# Kids Reflect vLLM Analysis

## Introduction

In this notebook, we will analyze the Kids Reflect dataset using both Azure OpenAI and vLLM for comparison. This notebook is specifically optimized for running on a supercomputer environment, taking advantage of multiple GPUs and high-performance computing resources.

To help you navigate this notebook, here is a step-by-step outline of what we will do:

1. **Configure vLLM for Supercomputer Environment**  
   - Set environment variables to optimize vLLM for high-performance computing
   - Verify GPU availability and configuration

2. **Load and Preprocess the Dataset**  
   - Load the Kids Reflect dataset
   - Clean and normalize text columns
   - Convert integer columns to the appropriate data type
   - Create verbatim text for analysis

3. **Prepare Training and Validation Data**  
   - Filter labeled data
   - Split data into training and validation sets

4. **Define Prompt Templates and Scenarios**  
   - Create templates for both Azure OpenAI and vLLM scenarios
   - Configure model parameters for optimal performance

5. **Run Iterative Prompt Improvement**  
   - Execute each scenario separately to monitor progress
   - Track GPU usage during execution

6. **Analyze and Visualize Results**  
   - Compare performance between Azure OpenAI and vLLM
   - Visualize kappa values across iterations
   - Save results for further analysis

## Configure vLLM for Supercomputer Environment

Before we begin, we need to configure vLLM to take full advantage of the supercomputer environment. This involves setting environment variables that control how vLLM utilizes the available GPU resources.

### Key Configuration Parameters:

- **VLLM_MODEL_PATH**: Path to the model or HuggingFace model ID
- **VLLM_DTYPE**: Data type for model weights (float16 for efficiency)
- **VLLM_GPU_MEMORY_UTILIZATION**: Target GPU memory utilization (0.95 or 95% for supercomputers)
- **VLLM_TENSOR_PARALLEL_SIZE**: Number of GPUs to use for tensor parallelism (4 for multi-GPU setups)
- **VLLM_MAX_MODEL_LEN**: Maximum sequence length (2048 tokens)
- **VLLM_ENABLE_PREFIX_CACHING**: Enable prefix caching for better performance
- **VLLM_WORKER_MULTIPROC_METHOD**: Worker multiprocessing method (spawn for better compatibility)

These settings are optimized for high-performance computing environments with multiple GPUs.

In [None]:
# Set vLLM environment variables for supercomputer
%env VLLM_MODEL_PATH=TinyLlama/TinyLlama-1.1B-Chat-v1.0
%env VLLM_DTYPE=float16
%env VLLM_GPU_MEMORY_UTILIZATION=0.95
%env VLLM_TENSOR_PARALLEL_SIZE=4
%env VLLM_MAX_MODEL_LEN=2048
%env VLLM_ENABLE_PREFIX_CACHING=true
%env VLLM_WORKER_MULTIPROC_METHOD=spawn

# Display current configuration
!echo "Current vLLM configuration:"
!echo "VLLM_MODEL_PATH: $VLLM_MODEL_PATH"
!echo "VLLM_GPU_MEMORY_UTILIZATION: $VLLM_GPU_MEMORY_UTILIZATION"
!echo "VLLM_TENSOR_PARALLEL_SIZE: $VLLM_TENSOR_PARALLEL_SIZE"

### Check GPU Availability

Before proceeding, it's important to verify that GPUs are available and properly configured. This step helps identify any potential issues with GPU allocation or configuration before running the analysis.

In [None]:
# Check GPU availability
!nvidia-smi

## Import Libraries and Setup

Now we'll import the necessary libraries and modules for our analysis. The qualitative_analysis package provides functions for data loading, preprocessing, and model interaction.

In [None]:
import pandas as pd
import os
from qualitative_analysis import (
    clean_and_normalize,
    load_data,
    sanitize_dataframe,
)
from qualitative_analysis.prompt_engineering import run_iterative_prompt_improvement

# Define data directory
data_dir = 'exploratory_data'
os.makedirs(data_dir, exist_ok=True)

## Load and Preprocess Data

### Dataset Description

The Kids Reflect dataset contains entries from children who engaged in a four-step process to formulate divergent questions about a reference text. Each entry includes:

- **Reference**: The text that children read beforehand
- **IDENTIFY**: Where the child identifies a knowledge gap related to the reference text
- **GUESS**: Where the child makes a guess about what the answer could be
- **SEEK**: Where the child formulates a question to seek the answer
- **ASSESS**: Where the child evaluates whether an answer was found

The dataset also includes validity ratings for each step and overall mechanical ratings, as well as annotations from three human raters (Chloe, Oli, and Gaia).

### Data Preprocessing Steps

1. Load the dataset from the Excel file
2. Clean and normalize text columns
3. Convert integer columns to the appropriate data type
4. Sanitize the DataFrame to handle any inconsistencies

In [None]:
# Define the path to your dataset
data_file_path = os.path.join(data_dir, 'Kids_Reflect_3anno.xlsx')

# Load the data
data = load_data(data_file_path, file_type='xlsx', delimiter=';')

# 1) Now define the new column names for cleaning
text_columns = ["reference", "IDENTIFY", "GUESS", "SEEK", "ASSESS", "assess_cues"]
integer_columns = ["Identify_validity", "Guess_validity", "Seek_validity", "Assess_validity", "mechanical_rating", "Rater_Chloe", "Rater_Oli", "Rater_Gaia"]

# 2) Clean and normalize the new columns
for col in text_columns:
    data[col] = clean_and_normalize(data[col])

# 3) Convert selected columns to integers, preserving NaNs
for col in integer_columns:
    data[col] = pd.to_numeric(data[col], errors="coerce").astype("Int64")  # Uses nullable integer type

# 4) Sanitize the DataFrame
data = sanitize_dataframe(data)

# Display the first few rows of the dataset
data.head()

## Create Verbatim Text

Now we'll combine the different columns into a single verbatim text for each entry. This format makes it easier for the language model to process the entire entry as a cohesive unit.

The verbatim text includes:
- The unique key identifier
- The reference text
- The IDENTIFY, GUESS, SEEK, and ASSESS steps
- The validity ratings for each step
- The mechanical rating (if available)

In [None]:
# Combine texts and entries
data['verbatim'] = data.apply(
    lambda row: (
        f"key: {row['key']}\n\n"
        f"reference: {row['reference']}\n\n"
        f"IDENTIFY: {row['IDENTIFY']}\n\n"
        f"GUESS: {row['GUESS']}\n\n"
        f"SEEK: {row['SEEK']}\n\n"
        f"ASSESS: {row['ASSESS']}\n\n"
        f"assess_cues: {row['assess_cues']}\n\n"
        f"Identify_validity: {row['Identify_validity']}\n\n"
        f"Guess_validity: {row['Guess_validity']}\n\n"
        f"Seek_validity: {row['Seek_validity']}\n\n"
        f"Assess_validity: {row['Assess_validity']}\n\n"
        f"mechanical_rating: {row['mechanical_rating']}\n\n"
    ),
    axis=1
)

# Extract the list of verbatims
verbatims = data['verbatim'].tolist()

print(f"Total number of verbatims: {len(verbatims)}")
print(f"Verbatim example:\n{verbatims[0]}")

## Prepare Training and Validation Data

To evaluate the performance of our models, we need to split the data into training and validation sets. We'll use the training set to train the models and the validation set to evaluate their performance.

### Steps:
1. Identify labeled data (entries with annotations from all three raters)
2. Create a subset of the labeled data for analysis
3. Split the subset into training (70%) and validation (30%) sets

In [None]:
# Identify the columns that represent your human ratings
annotation_columns = ['Rater_Chloe', 'Rater_Oli', 'Rater_Gaia']

# Filter labeled data (drop rows with NaN in any annotation column)
labeled_data = data.dropna(subset=annotation_columns)

# Filter unlabeled data
unlabeled_data = data[~data.index.isin(labeled_data.index)]

print("Number of labeled rows:", len(labeled_data))
print("Number of unlabeled rows:", len(unlabeled_data))

In [None]:
from sklearn.model_selection import train_test_split

subsample_size = 30

# Step 1: Get a stratified subset of samples
data_subset, _ = train_test_split(
    labeled_data,
    train_size=subsample_size,
    # stratify=data['label'],  # Uncomment if you have a label column to stratify on
    random_state=42
)

# Step 2: Split subset into train/val
train_data, val_data = train_test_split(
    data_subset,
    test_size=0.3,
    # stratify=data_subset['label'],  # Uncomment if you have a label column to stratify on
    random_state=42
)

print("Train size:", len(train_data))
print("Val size:", len(val_data))

## Define Prompt Templates

Now we'll define the prompt templates that will be used to instruct the language models. These templates include:

1. **Common Template**: The main instructions for evaluating the validity of a cycle
2. **Response Template**: The format in which the model should provide its response

The templates include detailed instructions on how to evaluate each step of the cycle and determine overall validity.

In [None]:
verbose = True

annotation_columns = ['Rater_Chloe', 'Rater_Oli', 'Rater_Gaia']
labels = [0,1]
epsilon = 0.2

# Define the common template for both scenarios
common_template = """
You are an assistant that evaluates data entries.

You are provided with data entries in the following format:

The data has the following columns:
- "key": Unique identifiant
- "reference": The reference text that participants must read beforehand. Their responses for the different steps must be semantically related to this text (same topic), but the answer to the question they are asking should not be found in the text.
- "IDENTIFY": Response for the IDENTIFY step
- "GUESS": Response for the GUESS step
- "SEEK": Response for the SEEK step
- "ASSESS": Response for the ASSESS step
- "assess_cues": Possible answers that were proposed in the ASSESS step
- "Identify_validity": If a number is already there (whatever the number), the step is valid
- "Guess_validity": If a number is already there (whatever the number), the step is valid
- "Seek_validity": If a number is already there (whatever the number), the step is valid
- "Assess_validity": If a number is already there (whatever the number), the step is valid
- "mechanical_rating": If a number is already there, you should use that as the final label (it over-rides any other logic in the codebook)


Here is an entry to evaluate:
{verbatim_text}

If a numeric value is present in the mechanical_rating column, copy it as the correct label.
If it's empty, you'll decide an overall cycle validity (0 or 1) based on the following codebook:

A cycle is considered valid if you can answer "yes" to all the following questions:

- Identify Step: Does the Identify step indicate a topic of interest?
- Guess Step: Does the Guess step suggest a possible explanation?
- Seek Step: Is the Seek step formulated as a question?
- Assess Step: Does it identify a possible answer or state that no answer where found ("no" is ok) ?
- Consistency: Are the Identify, Guess, and Seek steps related to the same question?
- Reference Link: Are the Identify, Guess, and Seek steps related to the topic of the reference text?
- Seek Question Originality: Is the answer to the Seek question not found (even vaguely) in the reference text?
- Resolving Answer: If the Assess step state an answer, does it answer to the question in the Seek step ?
- Valid Answer: If the ASSESS step indicates an answer was found, is the answer indeed in the assess_cues? → If not, then no answer was actually found, and the cycle is not valid.
- Valid No: If the ASSESS step indicates no answer was found, confirm that the answer to the SEEK question is not actually present in the assess_cues. → If the participant claims no answer was found, but it is in fact in assess_cues, the cycle is not valid.

Identify_validity, Guess_validity, Seek_validity, Assess_validity:
If one of those column already shows a numeric value (whatever the value), accept the step for this question without re-checking that step's validity.

If all these criteria are met, the cycle is valid.
Validity is expressed as:
1: Valid cycle
0: Invalid cycle

Minor spelling, grammatical, or phrasing errors should not be penalized as long as the intent of the entry is clear and aligns with the inclusion criteria. Focus on the content and purpose of the entry rather than linguistic perfection.
"""

# Define the common response template for both scenarios
common_response_template = """
Please follow the JSON format below:
```json
{{
  "Reasoning": "Your text here",
  "Classification": "Your integer here"
}}
"""

## Define Scenarios and GPU Monitoring

We'll define two scenarios for our analysis:

1. **Azure OpenAI with GPT-4o**: This scenario uses Azure's hosted GPT-4o model
2. **vLLM with Llama-2-7b-chat**: This scenario uses vLLM to run the Llama 2 model locally on the supercomputer

We'll also define a function to monitor GPU usage during execution, which is particularly useful for tracking resource utilization on the supercomputer.

In [None]:
# Function to monitor GPU usage during execution
def monitor_gpu():
    !nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv
    
# Check GPU status before starting
monitor_gpu()

In [None]:
scenarios = [
    # Azure OpenAI scenario
    {
        "provider_llm1": "azure",
        "model_name_llm1": "gpt-4o",
        "temperature_llm1": 0,

        # For the "improver" LLM2
        "provider_llm2": "azure",
        "model_name_llm2": "gpt-4o",
        "temperature_llm2": 0.7,

        "max_iterations": 4,
        "n_completions": 1,
        "prompt_name": "Azure-GPT4o",

        # Our initial prompt
        "template": common_template,
        "prefix": "Classification",
        "json_output": True,
        "selected_fields": ["Classification"],
        "label_type": "int",
        "response_template": common_response_template,
    },
    
    # vLLM scenario with a larger model (adjust based on your supercomputer's capabilities)
    {
        "provider_llm1": "vllm",
        "model_name_llm1": "meta-llama/Llama-2-7b-chat-hf",  # Or another model available on your supercomputer
        "temperature_llm1": 0.1,
        
        # For the "improver" LLM2, still use Azure
        "provider_llm2": "azure",
        "model_name_llm2": "gpt-4o",
        "temperature_llm2": 0.7,
        
        "max_iterations": 4,  # Can increase this since you have more compute
        "n_completions": 1,
        "prompt_name": "vLLM-Llama2-7B",
        
        # Same template as the Azure scenario
        "template": common_template,
        "prefix": "Classification",
        "json_output": True,
        "selected_fields": ["Classification"],
        "label_type": "int",
        "response_template": common_response_template,
    }
]

## Run Iterative Prompt Improvement

Now we'll run the iterative prompt improvement process for each scenario. This process involves:

1. Using the initial prompt to classify the training data
2. Evaluating the performance on the validation data
3. Improving the prompt based on the errors made
4. Repeating the process for a specified number of iterations

We'll run each scenario separately to better monitor progress and resource usage.

### Azure OpenAI Scenario

First, we'll run the Azure OpenAI scenario using GPT-4o. This will serve as our baseline for comparison.

In [None]:
# Azure OpenAI scenario
print("Running Azure OpenAI scenario...")
azure_results = []

best_prompt_azure, best_kappa_val_azure, iteration_rows_azure = run_iterative_prompt_improvement(
    scenario=scenarios[0],
    train_data=train_data,
    val_data=val_data,
    annotation_columns=annotation_columns,
    labels=labels,
    alt_test=True,
    errors_examples=0.5,
    examples_to_give=4,
    epsilon=epsilon,
    verbose=verbose
)
azure_results.extend(iteration_rows_azure)

# Check GPU status after Azure run
monitor_gpu()

### vLLM Scenario

Now we'll run the vLLM scenario using Llama-2-7b-chat. This will leverage the supercomputer's GPU resources for local inference.

In [None]:
# vLLM scenario
print("Running vLLM scenario...")
vllm_results = []

best_prompt_vllm, best_kappa_val_vllm, iteration_rows_vllm = run_iterative_prompt_improvement(
    scenario=scenarios[1],
    train_data=train_data,
    val_data=val_data,
    annotation_columns=annotation_columns,
    labels=labels,
    alt_test=True,
    errors_examples=0.5,
    examples_to_give=4,
    epsilon=epsilon,
    verbose=verbose
)
vllm_results.extend(iteration_rows_vllm)

# Check GPU status after vLLM run
monitor_gpu()

## Combine and Analyze Results

Now that we've run both scenarios, we'll combine the results and analyze them to compare the performance of Azure OpenAI and vLLM.

In [None]:
# Combine all results
all_results = azure_results + vllm_results
summary_df = pd.DataFrame(all_results)

# Display settings for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Display the summary dataframe
summary_df

## Compare Azure vs vLLM Performance

Let's compare the performance of Azure OpenAI and vLLM by looking at the best results for each provider.

In [None]:
# Group by provider and get the best result for each
best_by_provider = summary_df.groupby('prompt_name').apply(lambda x: x.loc[x['kappa_val'].idxmax()])
best_by_provider[['prompt_name', 'kappa_val', 'alt_test_val', 'iteration']]

### Visualize Performance Across Iterations

Now let's visualize how the performance of each provider changes across iterations. This will help us understand the effectiveness of the iterative prompt improvement process.

In [None]:
# Plot the results
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))

# Plot kappa values by iteration for each provider
sns.lineplot(data=summary_df, x='iteration', y='kappa_val', hue='prompt_name', marker='o')
plt.title('Kappa Values by Iteration and Provider')
plt.xlabel('Iteration')
plt.ylabel('Kappa Value')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## Save Results

Finally, let's save the results to a CSV file for further analysis or reporting.

In [None]:
# Save the results to a CSV file
output_dir = os.path.join(data_dir, 'outputs')
os.makedirs(output_dir, exist_ok=True)
summary_df.to_csv(os.path.join(output_dir, 'vllm_azure_comparison_results.csv'), index=False)

print(f"Results saved to {os.path.join(output_dir, 'vllm_azure_comparison_results.csv')}")