# Kids Reflect vLLM Analysis

## Introduction

In this notebook, we will analyze the Kids Reflect dataset using both Azure OpenAI and vLLM for comparison. This notebook is specifically optimized for running on a supercomputer environment, taking advantage of multiple GPUs and high-performance computing resources.

To help you navigate this notebook, here is a step-by-step outline of what we will do:

1. **Configure vLLM for Supercomputer Environment**  
   - Set environment variables to optimize vLLM for high-performance computing
   - Verify GPU availability and configuration

2. **Load and Preprocess the Dataset**  
   - Load the Kids Reflect dataset
   - Clean and normalize text columns
   - Convert integer columns to the appropriate data type
   - Create verbatim text for analysis

3. **Prepare Training and Validation Data**  
   - Filter labeled data
   - Split data into training and validation sets

4. **Define Prompt Templates and Scenarios**  
   - Create templates for both Azure OpenAI and vLLM scenarios
   - Configure model parameters for optimal performance

5. **Run Iterative Prompt Improvement**  
   - Execute each scenario separately to monitor progress
   - Track GPU usage during execution

6. **Analyze and Visualize Results**  
   - Compare performance between Azure OpenAI and vLLM
   - Visualize kappa values across iterations
   - Save results for further analysis

## Configure vLLM for Supercomputer Environment

Before we begin, we need to configure vLLM to take full advantage of the supercomputer environment. This involves setting environment variables that control how vLLM utilizes the available GPU resources.

### Key Configuration Parameters:

- **VLLM_MODEL_PATH**: Path to the model or HuggingFace model ID
- **VLLM_DTYPE**: Data type for model weights (float16 for efficiency)
- **VLLM_GPU_MEMORY_UTILIZATION**: Target GPU memory utilization (0.95 or 95% for supercomputers)
- **VLLM_TENSOR_PARALLEL_SIZE**: Number of GPUs to use for tensor parallelism (4 for multi-GPU setups)
- **VLLM_MAX_MODEL_LEN**: Maximum sequence length (2048 tokens)
- **VLLM_ENABLE_PREFIX_CACHING**: Enable prefix caching for better performance
- **VLLM_WORKER_MULTIPROC_METHOD**: Worker multiprocessing method (spawn for better compatibility)

These settings are optimized for high-performance computing environments with multiple GPUs.

In [None]:
# Set vLLM environment variables for supercomputer
%env VLLM_MODEL_PATH=TinyLlama/TinyLlama-1.1B-Chat-v1.0
%env VLLM_DTYPE=float16
%env VLLM_GPU_MEMORY_UTILIZATION=0.95
%env VLLM_TENSOR_PARALLEL_SIZE=4
%env VLLM_MAX_MODEL_LEN=2048
%env VLLM_ENABLE_PREFIX_CACHING=true
%env VLLM_WORKER_MULTIPROC_METHOD=spawn

# Display current configuration
!echo "Current vLLM configuration:"
!echo "VLLM_MODEL_PATH: $VLLM_MODEL_PATH"
!echo "VLLM_GPU_MEMORY_UTILIZATION: $VLLM_GPU_MEMORY_UTILIZATION"
!echo "VLLM_TENSOR_PARALLEL_SIZE: $VLLM_TENSOR_PARALLEL_SIZE"

### Check GPU Availability

Before proceeding, it's important to verify that GPUs are available and properly configured. This step helps identify any potential issues with GPU allocation or configuration before running the analysis.

In [None]:
# Check GPU availability
!nvidia-smi

## Import Libraries and Setup

Now we'll import the necessary libraries and modules for our analysis. The qualitative_analysis package provides functions for data loading, preprocessing, and model interaction.

In [None]:
import pandas as pd
import os
from qualitative_analysis import (
    clean_and_normalize,
    load_data,
    sanitize_dataframe,
)
from qualitative_analysis.prompt_engineering import run_iterative_prompt_improvement
from qualitative_analysis.alt_test import benjamini_yekutieli_correction
# Define data directory
data_dir = 'exploratory_data'
os.makedirs(data_dir, exist_ok=True)

## Load and Preprocess Data

### Dataset Description

The Kids Reflect dataset contains entries from children who engaged in a four-step process to formulate divergent questions about a reference text. Each entry includes:

- **Reference**: The text that children read beforehand
- **IDENTIFY**: Where the child identifies a knowledge gap related to the reference text
- **GUESS**: Where the child makes a guess about what the answer could be
- **SEEK**: Where the child formulates a question to seek the answer
- **ASSESS**: Where the child evaluates whether an answer was found

The dataset also includes validity ratings for each step and overall mechanical ratings, as well as annotations from three human raters (Chloe, Oli, and Gaia).

### Data Preprocessing Steps

1. Load the dataset from the Excel file
2. Clean and normalize text columns
3. Convert integer columns to the appropriate data type
4. Sanitize the DataFrame to handle any inconsistencies

In [None]:
# Define the path to your dataset
data_file_path = os.path.join(data_dir, 'Kids_Reflect_3anno.xlsx')

# Load the data
data = load_data(data_file_path, file_type='xlsx', delimiter=';')

# 1) Now define the new column names for cleaning
text_columns = ["reference", "IDENTIFY", "GUESS", "SEEK", "ASSESS", "assess_cues"]
integer_columns = ["Identify_validity", "Guess_validity", "Seek_validity", "Assess_validity", "mechanical_rating", "Rater_Chloe", "Rater_Oli", "Rater_Gaia"]

# 2) Clean and normalize the new columns
for col in text_columns:
    data[col] = clean_and_normalize(data[col])

# 3) Convert selected columns to integers, preserving NaNs
for col in integer_columns:
    data[col] = pd.to_numeric(data[col], errors="coerce").astype("Int64")  # Uses nullable integer type

# 4) Sanitize the DataFrame
data = sanitize_dataframe(data)

# Display the first few rows of the dataset
data.head()

## Create Verbatim Text

Now we'll combine the different columns into a single verbatim text for each entry. This format makes it easier for the language model to process the entire entry as a cohesive unit.

The verbatim text includes:
- The unique key identifier
- The reference text
- The IDENTIFY, GUESS, SEEK, and ASSESS steps
- The validity ratings for each step
- The mechanical rating (if available)

In [None]:
# Combine texts and entries
data['verbatim'] = data.apply(
    lambda row: (
        f"key: {row['key']}\n\n"
        f"reference: {row['reference']}\n\n"
        f"IDENTIFY: {row['IDENTIFY']}\n\n"
        f"GUESS: {row['GUESS']}\n\n"
        f"SEEK: {row['SEEK']}\n\n"
        f"ASSESS: {row['ASSESS']}\n\n"
        f"assess_cues: {row['assess_cues']}\n\n"
        f"Identify_validity: {row['Identify_validity']}\n\n"
        f"Guess_validity: {row['Guess_validity']}\n\n"
        f"Seek_validity: {row['Seek_validity']}\n\n"
        f"Assess_validity: {row['Assess_validity']}\n\n"
        f"mechanical_rating: {row['mechanical_rating']}\n\n"
    ),
    axis=1
)

# Extract the list of verbatims
verbatims = data['verbatim'].tolist()

print(f"Total number of verbatims: {len(verbatims)}")
print(f"Verbatim example:\n{verbatims[0]}")

## Prepare Training and Validation Data

To evaluate the performance of our models, we need to split the data into training and validation sets. We'll use the training set to train the models and the validation set to evaluate their performance.

### Steps:
1. Identify labeled data (entries with annotations from all three raters)
2. Create a subset of the labeled data for analysis
3. Split the subset into training (70%) and validation (30%) sets

In [None]:
# Identify the columns that represent your human ratings
annotation_columns = ['Rater_Chloe', 'Rater_Oli', 'Rater_Gaia']

# Filter labeled data (drop rows with NaN in any annotation column)
labeled_data = data.dropna(subset=annotation_columns)

# Filter unlabeled data
unlabeled_data = data[~data.index.isin(labeled_data.index)]

print("Number of labeled rows:", len(labeled_data))
print("Number of unlabeled rows:", len(unlabeled_data))

## Define Prompt Templates

Now we'll define the prompt templates that will be used to instruct the language models. These templates include:

1. **Common Template**: The main instructions for evaluating the validity of a cycle
2. **Response Template**: The format in which the model should provide its response

The templates include detailed instructions on how to evaluate each step of the cycle and determine overall validity.

In [None]:
verbose = True

annotation_columns = ['Rater_Chloe', 'Rater_Oli', 'Rater_Gaia']
labels = [0,1]
epsilon = 0.2

# Scenarios define the configuration for each experiment run
# You can specify different providers and models by changing:
#   - provider_llm1: The provider for the main LLM (e.g., "azure", "openai", "together", "vllm")
#   - model_name_llm1: The model name for the main LLM
#   - provider_llm2: The provider for the improver LLM
#   - model_name_llm2: The model name for the improver LLM
#
# For vLLM provider:
#   - The model_name_llm1 should be a HuggingFace model ID (e.g., "meta-llama/Llama-2-7b-chat-hf")
#   - For testing, you can use a small model like "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#
# Other parameters can be adjusted as needed (max_iterations, json_output, etc.)
scenarios = [
    {
        "provider_llm1": "azure",
        "model_name_llm1": "gpt-4o",
        "temperature_llm1": 0,

        # For the "improver" LLM2
        "provider_llm2": "azure",
        "model_name_llm2": "gpt-4o",
        "temperature_llm2": 0.7,

        "max_iterations": 1,
        "n_completions": 1,
        "prompt_name": "Basic",
        
        # Data configuration
        "subsample_size": 74,  # Size of data subset to use
        "use_validation_set": False,  # Whether to use a validation set
        "validation_size": 10,  # Size of validation set (if used)
        "random_state": 42,  # Random state for reproducibility

        # Our initial prompt
        "template": """
Here is an entry to evaluate:
{verbatim_text}

If a numeric value is present in the mechanical_rating column, copy it as the correct label.
""",
        "prefix": "Classification",
        "json_output": True,
        "selected_fields": ["Classification"],
        "label_type": "int",
        "response_template":
        """
Please follow the JSON format below:
```json
{{
  "Reasoning": "Your text here",
  "Classification": "Your integer here"
}}
""",
    },
    {
        "provider_llm1": "vllm",
        "model_name_llm1": "meta-llama/Llama-2-7b-chat-hf",
        "temperature_llm1": 0,

        # For the "improver" LLM2
        "provider_llm2": "azure",
        "model_name_llm2": "gpt-4o",
        "temperature_llm2": 0.7,

        "max_iterations": 1,
        "n_completions": 1,
        "prompt_name": "Basic",
        
        # Data configuration
        "subsample_size": 74,  # Size of data subset to use
        "use_validation_set": False,  # Whether to use a validation set
        "validation_size": 10,  # Size of validation set (if used)
        "random_state": 42,  # Random state for reproducibility

        # Our initial prompt
        "template": """
Here is an entry to evaluate:
{verbatim_text}

If a numeric value is present in the mechanical_rating column, copy it as the correct label.
""",
        "prefix": "Classification",
        "json_output": True,
        "selected_fields": ["Classification"],
        "label_type": "int",
        "response_template":
        """
Please follow the JSON format below:
```json
{{
  "Reasoning": "Your text here",
  "Classification": "Your integer here"
}}
""",
    },
]

In [None]:

import numpy as np
from sklearn.model_selection import train_test_split

# Number of runs per scenario (can be adjusted)
n_runs = 3

# For the final summary dataframe
all_aggregated_results = []

# For storing all individual run results
all_detailed_results = []

for scenario in scenarios:
    # Extract data configuration from scenario
    subsample_size = scenario.get("subsample_size", 20)
    use_validation_set = scenario.get("use_validation_set", True)
    validation_size = scenario.get("validation_size", 10)
    random_state = scenario.get("random_state", 42)
    
    # Step 1: Get a stratified subset of samples
    data_subset, _ = train_test_split(
        labeled_data,
        train_size=subsample_size,
        # stratify=labeled_data['label'] if 'label' in labeled_data.columns else None,
        random_state=random_state
    )
    
    # Step 2: Split subset into train/val if use_validation_set is True
    if use_validation_set:
        train_data, val_data = train_test_split(
            data_subset,
            test_size=validation_size,
            # stratify=data_subset['label'] if 'label' in data_subset.columns else None,
            random_state=random_state
        )
        print(f"Scenario '{scenario['prompt_name']}' - Train size: {len(train_data)}, Val size: {len(val_data)}")
    else:
        # Use all data for training
        train_data = data_subset
        val_data = None  # No validation set
        print(f"Scenario '{scenario['prompt_name']}' - Train size (all data): {len(train_data)}, No validation set")
    
    scenario_runs = []
    best_prompt_overall = None
    best_accuracy_overall = -1
    
    for run in range(n_runs):
        best_prompt, best_accuracy, iteration_rows = run_iterative_prompt_improvement(
            scenario=scenario,
            train_data=train_data,
            val_data=val_data,  # This can now be None
            annotation_columns=annotation_columns,
            labels=labels,
            alt_test=True,
            errors_examples=0.5,
            examples_to_give=4,
            epsilon=epsilon,
            verbose=verbose
        )
        
        # Store the results from this run
        for row in iteration_rows:
            row['run'] = run + 1
            scenario_runs.append(row)
        
        # Track the best prompt across all runs
        if best_accuracy > best_accuracy_overall:
            best_accuracy_overall = best_accuracy
            best_prompt_overall = best_prompt
    
    # Store all detailed results
    all_detailed_results.extend(scenario_runs)
    
    # Group rows by iteration
    iterations = set([row['iteration'] for row in scenario_runs])
    for iteration in iterations:
        iteration_rows = [row for row in scenario_runs if row['iteration'] == iteration]
        
        # Compute aggregated metrics for this iteration
        aggregated_metrics = {}
        for metric in ['kappa_train', 'kappa_val', 'accuracy_train', 'accuracy_val', 
                      'winning_rate_train', 'avg_adv_prob_train', 'winning_rate_val', 
                      'avg_adv_prob_val', 'tokens_used', 'cost', 'running_time_s']:
            # Filter out None or NaN values before computing mean
            valid_values = [row[metric] for row in iteration_rows if row[metric] is not None and not (isinstance(row[metric], float) and np.isnan(row[metric]))]
            aggregated_metrics[metric] = np.mean(valid_values) if valid_values else None
        
        # Handle p-values specially - compute mean p-values for each annotator
        if 'p_values_train' in iteration_rows[0] and iteration_rows[0]['p_values_train'] is not None:
            # Get the number of annotators (length of p_values list)
            n_annotators = len(iteration_rows[0]['p_values_train'])
            
            # Initialize lists to store p-values for each annotator across runs
            p_values_train_by_annotator = [[] for _ in range(n_annotators)]
            p_values_val_by_annotator = [[] for _ in range(n_annotators)]
            
            # Collect p-values for each annotator across runs
            for row in iteration_rows:
                if row['p_values_train'] is not None:
                    for i, p_val in enumerate(row['p_values_train']):
                        if not np.isnan(p_val):
                            p_values_train_by_annotator[i].append(p_val)
                
                if use_validation_set and 'p_values_val' in row and row['p_values_val'] is not None:
                    for i, p_val in enumerate(row['p_values_val']):
                        if not np.isnan(p_val):
                            p_values_val_by_annotator[i].append(p_val)
            
            # Compute mean p-values for each annotator
            mean_p_values_train = [np.mean(p_vals) if p_vals else np.nan for p_vals in p_values_train_by_annotator]
            
            # Store mean p-values
            aggregated_metrics['p_values_train'] = mean_p_values_train
            
            # Compute passed_alt_test based on mean p-values with proper Benjamini-Yekutieli correction
            alpha = 0.05
            # Filter out NaN values for the correction procedure
            valid_p_values_train = [p_val for p_val in mean_p_values_train if not np.isnan(p_val)]
            
            # Apply Benjamini-Yekutieli correction to the p-values
            if valid_p_values_train:
                train_rejections = benjamini_yekutieli_correction(valid_p_values_train, alpha=alpha)
                winning_rate_train = np.mean(train_rejections)
                aggregated_metrics['passed_alt_test_train'] = winning_rate_train >= 0.5
            else:
                aggregated_metrics['passed_alt_test_train'] = None
            
            # Handle validation p-values if using validation set
            if use_validation_set:
                mean_p_values_val = [np.mean(p_vals) if p_vals else np.nan for p_vals in p_values_val_by_annotator]
                aggregated_metrics['p_values_val'] = mean_p_values_val
                
                # Apply Benjamini-Yekutieli correction to validation p-values
                valid_p_values_val = [p_val for p_val in mean_p_values_val if not np.isnan(p_val)]
                if valid_p_values_val:
                    val_rejections = benjamini_yekutieli_correction(valid_p_values_val, alpha=alpha)
                    winning_rate_val = np.mean(val_rejections)
                    aggregated_metrics['passed_alt_test_val'] = winning_rate_val >= 0.5
                else:
                    aggregated_metrics['passed_alt_test_val'] = None
            else:
                aggregated_metrics['p_values_val'] = None
                aggregated_metrics['passed_alt_test_val'] = None
        
        # Add other necessary fields
        aggregated_metrics['data_set'] = iteration_rows[0]['data_set']
        aggregated_metrics['N_train'] = iteration_rows[0]['N_train']
        aggregated_metrics['N_val'] = iteration_rows[0]['N_val']
        aggregated_metrics['model'] = iteration_rows[0]['model']
        aggregated_metrics['prompt_name'] = iteration_rows[0]['prompt_name']
        aggregated_metrics['iteration'] = iteration
        aggregated_metrics['n_runs'] = n_runs
        aggregated_metrics['use_validation_set'] = use_validation_set
        
        # Add human annotator accuracies
        for annotator in annotation_columns:
            train_acc_key = f"{annotator}_train_acc"
            val_acc_key = f"{annotator}_val_acc"
            
            # Filter out None or NaN values before computing mean
            valid_train_values = [row[train_acc_key] for row in iteration_rows if train_acc_key in row and row[train_acc_key] is not None and not (isinstance(row[train_acc_key], float) and np.isnan(row[train_acc_key]))]
            
            aggregated_metrics[train_acc_key] = np.mean(valid_train_values) if valid_train_values else None
            
            # Only compute validation metrics if using validation set
            if use_validation_set:
                valid_val_values = [row[val_acc_key] for row in iteration_rows if val_acc_key in row and row[val_acc_key] is not None and not (isinstance(row[val_acc_key], float) and np.isnan(row[val_acc_key]))]
                aggregated_metrics[val_acc_key] = np.mean(valid_val_values) if valid_val_values else None
            else:
                aggregated_metrics[val_acc_key] = None
        
        all_aggregated_results.append(aggregated_metrics)

In [None]:

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Create the final summary dataframe
summary_df = pd.DataFrame(all_aggregated_results)

# Define the desired column order for summary dataframe
summary_columns = [
    'data_set', 'N_train', 'N_val', 'model', 'prompt_name', 'iteration', 'n_runs', 'use_validation_set',
    'kappa_train', 'kappa_val', 'accuracy_train', 'accuracy_val',
    'winning_rate_train', 'passed_alt_test_train', 'avg_adv_prob_train', 
    'winning_rate_val', 'passed_alt_test_val', 'avg_adv_prob_val',
    'tokens_used', 'cost', 'running_time_s'
]

# Add human annotator columns
for annotator in annotation_columns:
    summary_columns.extend([f"{annotator}_train_acc", f"{annotator}_val_acc"])

# Add p-values columns if they exist
if 'p_values_train' in summary_df.columns:
    summary_columns.extend(['p_values_train', 'p_values_val'])

# Reorder columns (only include columns that exist in the dataframe)
available_columns = [col for col in summary_columns if col in summary_df.columns]
summary_df = summary_df[available_columns]

# Store the detailed results in a separate dataframe
detailed_df = pd.DataFrame(all_detailed_results)

# Define the desired column order for detailed dataframe
detailed_columns = [
    'run', 'data_set', 'N_train', 'N_val', 'model', 'prompt_name', 'iteration',
    'kappa_train', 'kappa_val', 'accuracy_train', 'accuracy_val',
    'winning_rate_train', 'passed_alt_test_train', 'avg_adv_prob_train',
    'winning_rate_val', 'passed_alt_test_val', 'avg_adv_prob_val',
    'tokens_used', 'cost', 'running_time_s'
]

# Add human annotator columns
for annotator in annotation_columns:
    detailed_columns.extend([f"{annotator}_train_acc", f"{annotator}_val_acc"])

# Add p-values columns if they exist
if 'p_values_train' in detailed_df.columns:
    detailed_columns.extend(['p_values_train', 'p_values_val'])

# Reorder columns (only include columns that exist in the dataframe)
available_detailed_columns = [col for col in detailed_columns if col in detailed_df.columns]
detailed_df = detailed_df[available_detailed_columns]

# Store the best prompt for reference
best_prompt_result = best_prompt_overall

summary_df
