# Binary classification

## Introduction

In this notebook, we will perform **automatic classification of textual data** using **Large Language Models (LLMs)**.

The dataset we'll be working with requires binary labels (`0` or `1`), meaning this is a **binary classification task**, where each data entry is assigned to one of the two predefined categories.

To help you navigate this notebook, here is a step-by-step outline of what we will do:

1. **Getting started**  
   - Download and install the project and its dependencies, load import and your API key.

2. **Load and preprocess the dataset**  
   - Upload, explore and pre-process the dataset, with the sample dataset (recommended for a first use) or your own data.

3. **Prompt construction and classification on manually annotated data**  

4. **Evaluating Model Performance Against Human Annotations**  
   - Compute metrics (e.g., **Cohen's Kappa**, **Alt-Test**, ...)

5. **Final Step: Classify the Full Dataset**  

## Getting started

Before we begin, let's set up the environment by cloning the project and installing the necessary dependencies.

### Step 1: Clone the Project
Run the following cell to download the project files.
This will download the project folder into Colab and switch the working directory to it.

In [1]:
!git clone https://github.com/OlivierLClerc/qualitative_analysis_project

^C


Cloning into 'qualitative_analysis_project'...


### Step 2: Install Required Libraries
Now, install the project and its dependencies.

⚠️ Note:

- This will install all required libraries for the notebook to run.
- If Colab suggests restarting the runtime, click "Restart Runtime" and re-run this cell.

In [None]:
%cd qualitative_analysis_project
%pip install .

### Step 3: Load Your API Key

To use an LLM for analysis, you need to provide your **API key**. This key allows secure access to the API.

You can use this pipeline with an **OpenAI**, **Gemini**, or **Anthropic** key.  
The code in the cell below is currently configured for **OpenAI**.

If you're using another provider, simply replace all occurrences of `OPENAI_API_KEY` with the corresponding variable name:  
- For **Gemini** → `GEMINI_API_KEY`  
- For **Anthropic** → `ANTHROPIC_API_KEY`


#### Instructions

1. Click on the **🔑 "Key" icon** on the left sidebar in Colab (**⚙️ Settings** > **Secrets**).  
2. Click **"Add a new secret"**.  
3. Enter the following:  
   - **Name** → `OPENAI_API_KEY`
   - **Value** → *Your API Key* (Get it from [OpenAI](https://platform.openai.com/account/api-keys))  
4. Click **"Save"**.  

#### Troubleshooting

- **API Key not found?**  
  - Double-check that the secret name is exactly **`OPENAI_API_KEY`**.  
  - If the issue persists, **refresh the page** and rerun the cell.  

- **Is My Key Secure?**  
  - Yes! Colab's **Secrets Manager** encrypts your key and keeps it safe.  

In [None]:
from google.colab import userdata
import os
import pandas as pd

# Retrieve API keys securely from Colab Secrets
API_KEY = userdata.get('OPENAI_API_KEY')

# Check if the API key was loaded
if API_KEY:
    print("✅ API Key loaded successfully!")
    os.environ['OPENAI_API_KEY'] = API_KEY
else:
    print("⚠️ API Key not found. Please check the Secrets panel.")

### Step 4: Import Project Modules

Now that the project is installed, let's import the necessary modules and functions from the `qualitative_analysis` package. These tools will help us load data, process text, and perform binary classification analysis.

In [1]:
from qualitative_analysis import (
    load_data,
    clean_and_normalize,
    sanitize_dataframe,
)
import os
from qualitative_analysis.scenario_runner import run_scenarios
from qualitative_analysis.metrics.krippendorff import compute_krippendorff_non_inferiority, print_non_inferiority_results
from qualitative_analysis.metrics.kappa import compute_kappa_metrics
from qualitative_analysis.metrics.alt_test import run_alt_test_on_results
from qualitative_analysis.metrics.classification import compute_classification_metrics_from_results



  from .autonotebook import tqdm as notebook_tqdm


## Load and Preprocess the Dataset

### Step 1: Load the data (can be a CSV file or an xlsx file)

In [2]:
# Define data directory
data_dir = 'data/binary_user_case'
os.makedirs(data_dir, exist_ok=True)

# Define the path to your dataset
data_file_path = os.path.join(data_dir, 'binary_data.xlsx')

# Load the data
data = load_data(data_file_path, file_type='xlsx', delimiter=';')

# Preview the data
data.head()

Unnamed: 0,Code,model,prompt,but,réponse_attendue,réponse_llm,iteration,Rater_Oli,Invalid_Oli,Rater_chloe,Invalid_chloe,Rater_RA,Invalid_RA
0,1,GPT-4o mini,correct,Ton but est de trouver la pièce qui aide ces d...,Le composant reliant les capteurs avec les act...,La partie d'un robot qui relie les capteurs et...,1,1.0,False,1.0,False,0.0,False
1,1,GPT-4o mini,correct,Ton but est de trouver la pièce qui aide ces d...,Le composant reliant les capteurs avec les act...,La partie d'un robot qui relie les capteurs et...,2,,False,,False,,False
2,1,GPT-4o mini,correct,Ton but est de trouver la pièce qui aide ces d...,Le composant reliant les capteurs avec les act...,La partie d'un robot qui relie les capteurs et...,3,,False,,False,,False
3,1,GPT-4o mini,correct,Ton but est de trouver la pièce qui aide ces d...,Le composant reliant les capteurs avec les act...,La partie d'un robot qui relie les capteurs et...,4,,False,,False,,False
4,1,GPT-4o mini,correct,Ton but est de trouver la pièce qui aide ces d...,Le composant reliant les capteurs avec les act...,La partie d'un robot qui relie les capteurs et...,5,,False,,False,,False


### Dataset Description

The dataset used in this notebook contains **responses generated by a Large Language Model (LLM)**.  
We designed 12 STEM exercises, for which we created two categories of prompts:

- **Good prompts**: Intended to help the LLM generate a correct and relevant response to the exercise  
- **Bad prompts**: Expected to produce responses that are **insufficient** or irrelevant for correctly answering the task

The objective is to **evaluate the quality of LLM-generated responses** based on these varying prompt conditions.

### Dataset Structure

The dataset contains the following key columns:

- **`réponse_attendue`**: A reference response considered correct (the expected answer)  
- **`réponse_llm`**: The actual response generated by the LLM — this is what needs to be evaluated  
- **`iteration`**: An identifier for the exercise-prompt combination

To evaluate a given response, we compare the **LLM’s answer** to the **expected answer**.

The dataset also includes ratings from **three independent human annotators**.  
These annotations allow us to compute **inter-annotator agreement metrics**, helping assess the reliability of the labels and compare them with model predictions.

### Step 2: Data Preprocessing  (Optional, improve clarity and consistency of text data)

1. **Rename key columns**  
   Give important columns more descriptive names (commented here)

2. **Clean textual data**  
   For each text column, run `clean_and_normalize(series)` to  
   - trim leading/trailing spaces  
   - convert accented characters to plain ASCII (e.g. `'é'` → `'e'`).

3. **Convert to integers**  
   Convert selected columns to integers using `pd.to_numeric(...).astype("Int64")` to preserve missing values.

4. **Sanitize line breaks**  
   Run `sanitize_dataframe(df)` to replace newline (`\n`) and carriage‑return (`\r`) characters with a single space in every string column.

In [3]:
# 1a) Define a mapping from old column names to new names
# rename_map = {
#     "réponse_attendue": "expected_answer",
#     "réponse_llm": "llm_answer"
# }


# # 1b) Rename the columns in the DataFrame
# data = data.rename(columns=rename_map)

# 1) Now define the new column names for cleaning
text_columns = ["réponse_attendue", "réponse_llm"]
integer_columns = ["Code", "iteration", "Rater_Oli", "Rater_chloe", "Rater_RA"]

# 2) Clean and normalize the new columns
for col in text_columns:
    data[col] = clean_and_normalize(data[col])

# 4) Convert selected columns to integers, preserving NaNs
for col in integer_columns:
    data[col] = pd.to_numeric(data[col], errors="coerce").astype("Int64")

# 5) Sanitize the DataFrame
data = sanitize_dataframe(data)


### Step 3: Combine Texts and Questions

To prepare the data for the LLM, we gather exactly the information a human annotator would need.
The concatenated block of fields is called a **verbatim**.

#### Create the `verbatim` field

1. **Build verbatims**  
   For every row we create a multi‑line string containing:  
    - **réponse_attendue**: A reference response considered correct (the expected answer)   
    - **réponse_llm**: The actual response generated by the LLM — this is what needs to be evaluated  
   Each section is separated by a blank line for readability, and the result is written to a new column named `verbatim`.

2. **Sanity‑check**  
   - Print the total number of verbatims to ensure every row was processed.  
   - Display the first verbatim as a spot‑check of the format.

In [4]:
# Combine texts and entries

data['verbatim'] = data.apply(
    lambda row: (
        f"réponse_attendue: {row['réponse_attendue']}\n\n"
        f"réponse_llm: {row['réponse_llm']}\n\n"
    ),
    axis=1
)

# Extract the list of verbatims
verbatims = data['verbatim'].tolist()

print(f"Total number of verbatims: {len(verbatims)}")
print(f"Verbatim example:\n{verbatims[0]}")

Total number of verbatims: 720
Verbatim example:
réponse_attendue: Le composant reliant les capteurs avec les actionneurs est la carte de controle ou le microcontroleur.

réponse_llm: La partie d'un robot qui relie les capteurs et les moteurs pour aider a eviter les obstacles est generalement appelee le **systeme de controle** ou **unite de controle**. Ce systeme peut inclure un microcontroleur ou un ordinateur embarque qui recoit les donnees des capteurs (comme des capteurs de distance, des cameras, ou des capteurs infrarouges) et utilise ces informations pour prendre des decisions. Ensuite, il envoie des commandes aux moteurs pour ajuster la direction ou la vitesse du robot afin d'eviter les obstacles. Ce processus fait souvent partie d'un algorithme de navigation ou de traitement de signal qui permet au robot de reagir en temps reel a son environnement.




## Prompt construction and classification on manually annotated data

This framework allows you to evaluate different configurations to determine which prompt, model, and parameters yield the most accurate classification. These configurations are stored in the scenarios list.

The snippet defines two **classification scenarios** for evaluating participants’ “Identify → Guess → Seek → Assess” reasoning cycles with a Large Language Model (LLM).

Each scenario is a dictionary inside the `scenarios` list and can be seen as a self‑contained _experiment_: it specifies

* which LLM to call (`provider_llm1`, `model_name_llm1`, `temperature_llm1`);
* the **prompt template** that tells the LLM how to judge a single data row;
* the expected JSON output (fields listed in `selected_fields`);
* optional settings for **prompt‑refinement** by a second LLM (`provider_llm2`, …).

Running the pipeline iterates over every scenario and evaluates every (or a subsample of) data rows, then writes the chosen output fields back to your dataframe or file.

### LLM Settings

- `provider_llm1`: The LLM provider used for classification (`azure`, `openai`, `anthropic`, `gemini`)
- `model_name_llm1`: The model used for classification. This depends on the provider.

#### Example:

- **For** `azure` → `"gpt-4o"` or `"gpt-4o-mini"`
- **For** `openai` → `"gpt-4o"` or `"gpt-4o-mini"`
- **For** `anthropic` → `"claude-3-7-sonnet-20250219"`, `"claude-3-5-haiku-20241022"`
- **For** `gemini` → `"gemini-2.0-flash-001"`, `"gemini-2.5-pro-preview-03-25"`

- `temperature_llm1`: Controls output variability. Set to `0` for deterministic responses. Higher values add randomness (not recommended for evaluation tasks).
- `subsample_size`: Number of entries to evaluate. Set to `-1` to use the entire dataset.

### Prompt Configuration

- `prompt_name`: A short name identifying the scenario, used in performance tracking.
- `template`: The full prompt used to guide the LLM. It could include:
  - The **role** of the assistant
  - A **description** of the input columns
  - The **evaluation codebook** (la manière dont les données doivent etre classifiées)
  - Optionally, **examples**
  - ⚠️ **Must contain** the `{verbatim_text}` placeholder for the entry being evaluated

### Output

- `selected_fields`: The fields to extract from the LLM’s output (e.g., `"Classification"`, `"Reasoning"`).  
  You can modify this to include or exclude elements (like adding confidence scores, removing reasonning).
- `prefix`: The key to look for in the LLM output that contains the classification label (e.g., `"Classification"`).
Nous spécifions donc cela pour que le parsing du verdict soit plus facile, pour récuperer les labels de classification.
- `label_type`: Data type of the classification label. Typically `"int"` for binary classification (`0` or `1`),  
  but can be changed to `"float"` or `"str"` as needed.
- `response_template`: The required format of the LLM output (e.g., JSON). This ensures correct parsing. It is recommended not to change this format request.
- `json_output`: If `True`, the LLM must respond in JSON. Disabling this is not recommended. If you do, you will have to  
  change the `response_template` accordingly.


### Prompt Optimization (In developpement - better not to change anything)

This section enables **automatic prompt refinement** using a second LLM. It attempts to generate an improved version of the prompt to reduce classification errors.

- A second model (`llm2`) is used to review the prompt given to the first model (`llm1`) and suggest changes based on classification failures.
- If the new prompt performs better (fewer classification errors), it replaces the original.

**Warning**: This can lead to overfitting — the new prompt may work well on the training data but generalize poorly.  
It's highly recommended to **use a validation set** when using this feature.

### Prompt Optimization

- `provider_llm2`: LLM provider used for prompt improvement
- `model_name_llm2`: Name of the refinement model
- `temperature_llm2`: Temperature for the prompt-refiner LLM
- `max_iterations`: How many times the prompt should be revised.
For example, if you choose 3, each data entry will be classified three times: once with the original prompt, and twice with newly generated prompts.
- `use_validation_set`: Whether to use a separate validation set to monitor prompt overfitting (Boolean)
- `validation_size`: Number of samples in the validation set
- `random_state`: Random seed for reproducible train/validation split

### Majority vote

- `n_completions`: Number of completions per entry. 
It is possible to generate multiple responses for each entry using the same LLM. This will produce several classification labels for the same data point.
The final label is determined by majority vote. Generating multiple completions can improve robustness but also increases cost.

### Example

In the current example, we define two scenarios:

**Scenario 1**: One prompt in french.

**Scenario 2**: One prompt in english (but data still in french).

### Step 1: Define the scenarios

In [5]:
scenarios = [
    {
        # LLM settings
        "provider_llm1": "openrouter",
        "model_name_llm1": "anthropic/claude-3.7-sonnet",
        "temperature_llm1": 0,
        "prompt_name": "french_prompt",
        "subsample_size": 20,  # Size of data subset to use

        # Our initial prompt
        "template": """
Vous êtes un assistant chargé d’évaluer des entrées de données.

Les données comprennent les colonnes suivantes :
- "réponse_attendue": Un passage de réponse considérée comme satisfaisante
- "réponse_llm": La réponse fournie par le LLM, à juger
- "iteration": Identifiant

Voici une entrée à évaluer :
{verbatim_text}

Tâche d’évaluation :
Évaluer si la réponse_llm répond adéquatement, c'est à dire qu'elle correspond à la réponse attendue, en utilisant l’échelle suivante :

0 : La réponse générée ne permet pas ou difficilement de répondre à la question posée (hors sujet, incomplète ou incorrecte, vague, trop ou pas assez détaillée).
1 : La réponse générée permet de répondre à la question.
""",
        # Output
        "selected_fields": ["Classification", "Reasoning"],
        "prefix": "Classification",
        "label_type": "int",
        "response_template":
        """
S'il te plait, suis le format JSON ci-dessous :
```json
{{
  "Raisonnement": "Ton raisonnement ici",
  "Classification": "Ton integer ici"
}}
""",
        "json_output": True,

        # Prompt optimization
        "provider_llm2": "azure",
        "model_name_llm2": "gpt-4o",
        "temperature_llm2": 0.7,
        "max_iterations": 1,
        "use_validation_set": False,
        "validation_size": 10,
        "random_state": 42,

        # Majority vote
        "n_completions": 1,

    },
#         {
#         # LLM settings
#         "provider_llm1": "azure",
#         "model_name_llm1": "gpt-4o",
#         "temperature_llm1": 0,
#         "prompt_name": "english_prompt",
#         "subsample_size": 10,  # Size of data subset to use

#         # Our initial prompt
#         "template": """
# You are an assistant that evaluates data entries.

# The data has the following columns:
# - "réponse_attendue":  An excerpt of a reference answer considered satisfactory
# - "réponse_llm": The answer provided by the LLM, to be evaluated
# - "iteration": Identifier

# Here is an entry to evaluate:
# {verbatim_text}

# Evaluation Task:
# Evaluate whether the réponse_llm adequately matches the réponse_attendue using the following scale:

# 0: The generated answer does not help or barely helps to answer the question (off-topic, incomplete, or incorrect, vague, too detailed, or not detailed enough).
# 1: The generated answer answers the question.
# """,
#         # Output
#         "selected_fields": ["Classification", "Reasoning"],
#         "prefix": "Classification",
#         "label_type": "int",
#         "response_template":
#         """
# Please follow the JSON format below:
# ```json
# {{
#   "Reasoning": "Your text here",
#   "Classification": "Your integer here"
# }}
# """,
#         "json_output": True,

#         # Prompt optimization
#         "provider_llm2": "azure",
#         "model_name_llm2": "gpt-4o",
#         "temperature_llm2": 0.7,
#         "max_iterations": 3,
#         "use_validation_set": True,
#         "validation_size": 5,
#         "random_state": 42,

#         # Majority vote
#         "n_completions": 1,
#     },
]

### Step 2: Run the classification on Annotated Subset

Before launching the classification on the entire dataset, we first run it on the subset that has been manually annotated.  
This step allows us to compute performance metrics (e.g., **accuracy**, **F1-score**) by comparing LLM predictions to human labels,  
and therefore select which (if any) scenario can be used to classify the full, unlabeled dataset.

#### Configuration Parameters

- `annotation_columns`: The names of the columns containing human annotations.
- `labels`: The possible label values (in this case, `[0, 1]` for binary classification).

We filter out any rows with missing values in the annotation columns to ensure we're only evaluating on fully labeled data.

#### Repeated Runs for Stability

LLMs are **stochastic** by nature — even with a temperature of `0`, outputs can vary.  
To assess how consistent the model is, we introduce the `n_runs` parameter:

- `n_runs`: The number of times the classification is repeated for each scenario on the annotated data.

We recommend setting `n_runs = 3`, based on findings from **[Paper XX]** (insert reference),  
which showed that **three repetitions strike a good balance between stability and cost**.  
Running more times improves statistical reliability but increases costs proportionally.

#### `n_runs` vs `n_completions`

It’s important to distinguish between these two concepts:

- **`n_completions`**:  
  Controls how many responses are generated **within a single run** for each data point.  
  The final label is determined by **majority vote** over those completions.  
  **Example**:  
  If `n_completions = 3` and the model returns `[0, 0, 1]`, the selected label will be `0`.

- **`n_runs`**:  
  Repeats the **entire classification process** multiple times across the same data.  
  If you run the scenario three times and get `[0, 0, 1]` for a given entry,  
  that variation will be captured when calculating metrics (e.g., **variance**, **disagreement rate**).

In [6]:
# 9) Run scenarios and get results

annotation_columns = ["Rater_Oli", "Rater_chloe", "Rater_RA"]
labels = [0,1,2]

# Filter labeled data (drop rows with NaN in any annotation column)
labeled_data = data.dropna(subset=annotation_columns)
unlabeled_data = data[~data.index.isin(labeled_data.index)]

n_runs = 1  # Number of runs per scenario
verbose = True  # Whether to print verbose output

# Run the scenarios - this only runs the LLM and saves all the generated labels
complex_case_for_metrics = run_scenarios(
    scenarios=scenarios,
    data=labeled_data,
    annotation_columns=annotation_columns,
    labels=labels,
    n_runs=n_runs,
    verbose=verbose
)

Scenario 'french_prompt' - Train size (all data): 20, No validation set

=== Processing Verbatim 1/20 ===
Prompt:

Vous êtes un assistant chargé d’évaluer des entrées de données.

Les données comprennent les colonnes suivantes :
- "réponse_attendue": Un passage de réponse considérée comme satisfaisante
- "réponse_llm": La réponse fournie par le LLM, à juger
- "iteration": Identifiant

Voici une entrée à évaluer :
réponse_attendue: Les plantes stockent de l'energie sous forme de sucres produits par la photosynthese durant la journee. La nuit, elles utilisent ces reserves pour maintenir leurs fonctions vitales, en degradant les sucres via la respiration cellulaire pour liberer l'energie necessaire. Ce processus leur permet de continuer a croitre et a se reparer meme en l'absence de lumiere solaire.

réponse_llm: Les plantes ont besoin d'energie pour effectuer leurs activites nocturnes, telles que la croissance, le developpement et la maintenance de leurs fonctions metaboliques. Puisqu'el

### Step 3: Saving / Re-Loading the Results

This step provides an option to save the classification results to a file for future reference or further analysis.

In [34]:
# Possibility to save the results

# Save the annotated results to a CSV file
complex_case_for_metrics.to_csv("data/binary_user_case/outputs/binary_case_for_metrics.csv", sep=";", index=False, encoding="utf-8-sig")

In [35]:
# Optionally, load the annotated results from the CSV file if needed

complex_case_for_metrics = pd.read_csv(
    "data/binary_user_case/outputs/binary_case_for_metrics.csv",
    sep=";",
    encoding="utf-8-sig"
)

## Evaluating Model Performance Against Human Annotations

To determine whether the model's classification is reliable and can be used to annotate the rest of the unlabeled dataset,  
it is recommended to evaluate its alignment with human annotations.  
If the alignment is sufficiently high, you may choose to rely on the model-generated labels for the remaining data.

We propose **four types of analysis**, depending on your goals:

- **If you want to measure agreement between annotators**:  
  Use **Cohen's Kappa**, a simple and widely used metric for inter-rater agreement.

- **If you need detailed per-class performance metrics** (e.g., recall, true positives, false positives):  
  Use **Classification Metrics**. This method gives a descriptive breakdown of model performance by class.

- **If you have multiple manual annotations and want a more robust estimate**:  
  Use **Krippendorff's Alpha**. This method provides:
  - A confidence interval for the agreement, computed via bootstrapping
  - An estimate of the risk that the true alpha value lies outside this interval

- **If you have multiple annotation columns (≥ 3)** and want to assess whether the model can "replace" or **outperform individual annotators**,  
  and you can afford to annotate 50–100 entries:  
  Use the **Alt-Test**. This stricter test compares the model to each annotator using a **leave-one-out** approach.

Among the available methods, **Krippendorff’s Alpha** and the **Alt-Test** are the ones we consider more **rigorous and robust**.

> **Note 1**: The final decision on whether the model's performance is “good enough” depends on your research domain,  
> acceptable error tolerance, and practical factors such as annotation cost and time. It can be totally valid to accept the model based solely on its Cohen’s kappa score,
 if it is approximately equivalent to human inter-rater agreement.

> **Note 2**: If the agreement between human annotators is low, the issue likely lies in the codebook (e.g., unclear guidelines) or the annotation task itself.
> In such cases, it’s unrealistic to expect the LLM to achieve high performance if humans themselves struggle to agree on the correct labels.

> **Note 3**: If you're not satisfied with the model’s performance, you can go back and **adjust the scenario** (this may include updating the codebook, adding examples, using another model...)  
> ⚠️ However, if you do this **multiple times**, it is strongly recommended to use a **validation set** to avoid overfitting to your annotated subset.

### Cohen's Kappa

This analysis provides:

- **Mean agreement between the LLM and all human annotators** (when multiple annotators are available)
- **Mean agreement among human annotators** (when multiple annotators are available)
- **Individual agreement scores** for all pairwise comparisons

#### Weighting Options

You can set kappa_weights to different values. Use:

- **unweighted (remove the parameter)**:  
  Treats all disagreements equally.  
  _Example: Disagreeing between `0` and `1` is treated the same as between `0` and `2`._

- **linear**:  
  Weights disagreements by their distance.  
  _Example: A disagreement between `0` and `2` is considered twice as bad as between `0` and `1`._

- **quadratic**:  
  Weights disagreements by the square of their distance.  
  _Example: A disagreement between `0` and `2` is considered four times as bad as between `0` and `1`._

> **Note **: If `n_runs` > 1, the reported metrics will include **variability across runs**, allowing you to assess the **consistency** of LLM performance.  
> Lower variance indicates more stable and reliable model behavior.

In [7]:
# 10) Compute metrics from the detailed results
# First, compute kappa metrics
kappa_df, detailed_kappa_metrics = compute_kappa_metrics(
    detailed_results_df=complex_case_for_metrics,
    annotation_columns=annotation_columns,
    labels=labels,
    show_runs=True
)

kappa_df


=== Columns in detailed_results_df (in compute_kappa_metrics) ===
['sample_id', 'split', 'verbatim', 'iteration', 'Rater_Oli', 'Rater_chloe', 'Rater_RA', 'ModelPrediction', 'Raisonnement', 'run', 'prompt_name', 'use_validation_set']


Unnamed: 0,prompt_name,iteration,run,use_validation_set,N_train,N_val,accuracy_GT_train,kappa_GT_train,mean_kappa_llm_human,mean_human_human_agreement,Rater_Oli_kappa_GT,Rater_chloe_kappa_GT,Rater_RA_kappa_GT,n_runs
0,french_prompt,1,1,False,20,0,0.85,0.6875,0.606481,0.671222,0.79798,1.0,0.705882,
1,french_prompt,1,aggregated,False,20,0,0.85,0.6875,0.606481,0.671222,0.79798,1.0,0.705882,1.0


In [8]:
# Additional details about the kappa metrics

print("\n=== Detailed Kappa Metrics ===")
if detailed_kappa_metrics:
    for scenario_key, metrics in detailed_kappa_metrics.items():
        print(f"\nScenario: {scenario_key}")
        
        print("\nLLM vs Human Annotators:")
        print(metrics['llm_vs_human_df'])
        
        print("\nHuman vs Human Annotators:")
        print(metrics['human_vs_human_df'])
else:
    print("No detailed kappa metrics available.")


=== Detailed Kappa Metrics ===

Scenario: french_prompt_iteration_1

LLM vs Human Annotators:
  Human_Annotator  Cohens_Kappa
0       Rater_Oli      0.897959
1     Rater_chloe      0.897959
2        Rater_RA      0.615385

Human vs Human Annotators:
   Annotator_1  Annotator_2  Cohens_Kappa
0    Rater_Oli  Rater_chloe      0.797980
1    Rater_Oli     Rater_RA      0.509804
2  Rater_chloe     Rater_RA      0.705882

Scenario: english_prompt_iteration_1.0

LLM vs Human Annotators:
  Human_Annotator  Cohens_Kappa
0       Rater_Oli      1.000000
1     Rater_chloe      0.615385
2        Rater_RA      0.285714

Human vs Human Annotators:
   Annotator_1  Annotator_2  Cohens_Kappa
0    Rater_Oli  Rater_chloe      0.615385
1    Rater_Oli     Rater_RA      0.285714
2  Rater_chloe     Rater_RA      0.545455

Scenario: english_prompt_iteration_2.0

LLM vs Human Annotators:
  Human_Annotator  Cohens_Kappa
0       Rater_Oli      1.000000
1     Rater_chloe      0.615385
2        Rater_RA      0.2857

### Classification Metrics (Per-Class Analysis)

Analyze detailed classification metrics for each class, focusing on **recall** and **confusion matrix elements**.

This analysis uses the **majority vote from human annotations** as the ground truth and provides:

#### Global Metrics (prefix: `global_*`)

- `global_accuracy_train`: Overall accuracy on training data
- `global_recall_train`: Macro recall on training data
- `global_error_rate_train`: 1 - accuracy

(And similarly for validation data with suffix `_val`, if `use_validation_set = True`)

#### Per-Class Metrics (prefix: `class_<label>_*_train`)

For each class label (e.g., `0`, `1`), the following are computed:

- `class_<label>_recall_train`: Proportion of actual class instances correctly identified (True Positives)
- `class_<label>_error_rate_train`: Proportion of actual class instances incorrectly classified (Miss Rate)
- `class_<label>_correct_count_train`: Number of correctly predicted instances
- `class_<label>_missed_count_train`: Number of missed instances (False Negatives)
- `class_<label>_false_positives_train`: Number of incorrect predictions *as* this class (False Positives)


In [9]:
# Compute classification metrics
classification_df = compute_classification_metrics_from_results(
    detailed_results_df=complex_case_for_metrics,
    annotation_columns=annotation_columns,
    labels=labels,
    show_runs=True
)

pd.set_option("display.max_columns", None)    # show all columns
classification_df


=== Columns in detailed_results_df (in compute_classification_metrics_from_results) ===
['sample_id', 'split', 'verbatim', 'iteration', 'Rater_Oli', 'Rater_chloe', 'Rater_RA', 'ModelPrediction', 'Raisonnement', 'run', 'prompt_name', 'use_validation_set', 'Reasoning', 'prompt_iteration', 'best_accuracy', 'total_iterations', 'is_best_iteration']


Unnamed: 0,prompt_name,iteration,run,use_validation_set,N_train,N_val,prompt_iteration,global_accuracy_train,global_recall_train,global_error_rate_train,class_0_recall_train,class_0_error_rate_train,class_0_correct_count_train,class_0_missed_count_train,class_0_false_positives_train,class_1_recall_train,class_1_error_rate_train,class_1_correct_count_train,class_1_missed_count_train,class_1_false_positives_train,class_2_recall_train,class_2_error_rate_train,class_2_correct_count_train,class_2_missed_count_train,class_2_false_positives_train,global_accuracy_val,global_recall_val,global_error_rate_val,class_0_recall_val,class_0_error_rate_val,class_0_correct_count_val,class_0_missed_count_val,class_0_false_positives_val,class_1_recall_val,class_1_error_rate_val,class_1_correct_count_val,class_1_missed_count_val,class_1_false_positives_val,class_2_recall_val,class_2_error_rate_val,class_2_correct_count_val,class_2_missed_count_val,class_2_false_positives_val,n_runs
0,french_prompt,1,1,False,20,0,,0.95,0.62963,0.05,0.888889,0.111111,8,1,0,1.0,0.0,11,0,1,0.0,1.0,0,0,0,,,,,,,,,,,,,,,,,,,
1,french_prompt,1,2,False,20,0,,0.95,0.62963,0.05,0.888889,0.111111,8,1,0,1.0,0.0,11,0,1,0.0,1.0,0,0,0,,,,,,,,,,,,,,,,,,,
2,english_prompt,1,1,True,5,5,1.0,0.8,0.555556,0.2,0.666667,0.333333,2,1,0,1.0,0.0,2,0,1,0.0,1.0,0,0,0,1.0,0.666667,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
3,english_prompt,1,2,True,5,5,1.0,0.8,0.555556,0.2,0.666667,0.333333,2,1,0,1.0,0.0,2,0,1,0.0,1.0,0,0,0,1.0,0.666667,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
4,english_prompt,2,1,True,5,5,2.0,0.6,0.444444,0.4,0.333333,0.666667,1,2,0,1.0,0.0,2,0,2,0.0,1.0,0,0,0,1.0,0.666667,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
5,english_prompt,2,2,True,5,5,2.0,0.8,0.555556,0.2,0.666667,0.333333,2,1,0,1.0,0.0,2,0,1,0.0,1.0,0,0,0,1.0,0.666667,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
6,english_prompt,3,1,True,5,5,3.0,0.8,0.555556,0.2,0.666667,0.333333,2,1,0,1.0,0.0,2,0,1,0.0,1.0,0,0,0,1.0,0.666667,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
7,english_prompt,3,2,True,5,5,3.0,0.6,0.444444,0.4,0.333333,0.666667,1,2,0,1.0,0.0,2,0,2,0.0,1.0,0,0,0,1.0,0.666667,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
8,english_prompt,3,aggregated,True,30,30,,0.733333,0.518519,0.266667,0.555556,0.444444,10,8,0,1.0,0.0,12,0,8,0.0,1.0,0,0,0,1.0,0.666667,0.0,1.0,0.0,12.0,0.0,0.0,1.0,0.0,18.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0
9,french_prompt,1,aggregated,False,40,0,,0.95,0.62963,0.05,0.888889,0.111111,16,2,0,1.0,0.0,22,0,2,0.0,1.0,0,0,0,,,,,,,,,,,,,,,,,,,2.0


### Krippendorff’s α Non‑Inferiority Test  
*(Requires ≥ 3 human annotation columns)*

#### Purpose

This test evaluates whether the model's annotations are **statistically non-inferior** to fully human-annotated data.  
If successful, this means the model can probably take over the annotation of the remaining, unlabeled data.

#### How the Test Works

- **Human reliability (`α_human`)**  
  Krippendorff’s α is computed across all *n* human annotators.

- **Model reliability (`α_model`)**  
  For each possible panel of (*n − 1*) humans + the model, compute Krippendorff’s α.  
  The final value is the **mean** α across all such combinations.

- **Effect size (Δ)**  
  \[
  \Delta = \alpha_{\text{model}} - \alpha_{\text{human}}
  \]  
  - Positive Δ → Model improves reliability  
  - Negative Δ → Performance drop

- **Uncertainty estimation via bootstrapping**  
  The dataset is resampled thousands of times (e.g., 2,000) to recompute Δ.  
  A **90 % confidence interval (CI)** (configurable) is constructed to show where the true Δ likely lies.


- **Non‑Inferiority Margin (`δ`)**
    You define `δ` (commonly set to **−0.05**) as the **largest acceptable drop** in α when using the model.

- **Decision rule**:  
  If the entire confidence interval lies **above `δ`**, the model is declared **non-inferior**.  
  With a 90 % CI, this reflects a **5 % one-sided risk** of wrongly approving a model worse than the lower born of the CI.

#### Interpretation Cheatsheet

| CI Position                 | What It Means for Deployment                                               |
|----------------------------|-----------------------------------------------------------------------------|
| CI fully above **0**       | ✅ Model is **statistically superior** to humans  |
| CI fully above **δ**, but crosses 0 | ✅ Model is **non-inferior** (small, acceptable loss)     |
| CI touches or falls below **δ** | ❌ Model is possibly worse than the humans by the δ margin|

#### Why “5 % Risk”?

- A 90 % CI corresponds to a **one-sided α = 0.05** non-inferiority test.
- This 5 % risk applies to the **margin δ**, not to zero.
- If the CI just touches δ → ≈ 5 % chance that the **true Δ ≤ δ**
- If the CI is well above δ → Risk that **true Δ ≤ 0** is even lower than 5 %

#### Settings and Their Effects

| Setting                        | Increase →                          | Decrease →                          |
|-------------------------------|-------------------------------------|-------------------------------------|
| **Confidence level** (e.g. 90 % → 95 %) | – CI gets **wider**<br>– Test becomes **stricter**<br>– Type I error drops (5 % → 2.5 %) | – CI gets **narrower**<br>– Easier to declare non-inferiority<br>– Higher false positive risk |
| **Non-inferiority margin `δ`** (e.g. −0.05 → −0.10) | – You tolerate a **larger drop**<br>– Easier for model to pass<br>– Lower guaranteed quality | – You demand **closer match to humans**<br>– Harder to pass<br>– Stronger quality guarantee |


In [10]:
# Run the non-inferiority test
non_inferiority_results = compute_krippendorff_non_inferiority(
    detailed_results_df=complex_case_for_metrics,
    annotation_columns=annotation_columns,
    model_column="ModelPrediction",
    level_of_measurement='ordinal',
    non_inferiority_margin=-0.05,
    n_bootstrap=2000, 
    confidence_level=90.0,
    random_seed=42, 
    verbose=False   
)

# Print results in a formatted way
print_non_inferiority_results(non_inferiority_results, show_per_run=False)


=== Non-inferiority Test: english_prompt_iteration_3 ===
Human trios α: 0.4568 ± 0.0000
Model trios α: 0.4719 ± 0.0000
Δ = model − human = +0.0152 ± 0.0000
90% CI: [-0.1614, 0.1564]
Non-inferiority demonstrated in 0/2 runs
❌ Non-inferiority NOT demonstrated in any run (margin = -0.05)

=== Non-inferiority Test: french_prompt_iteration_1 ===
Human trios α: 0.6722 ± 0.0000
Model trios α: 0.7586 ± 0.0000
Δ = model − human = +0.0864 ± 0.0000
90% CI: [0.0226, 0.1567]
Non-inferiority demonstrated in 2/2 runs
✅ Non-inferiority consistently demonstrated across all runs (margin = -0.05)


### Alternative Annotator Test (ALT-Test)

The **ALT-Test** evaluates whether an LLM can perform **as well as or better than human annotators**, based on a **leave-one-human-out** approach.

This method requires **at least 3 human annotation columns**.

#### How It Works

- The LLM is compared against **each human annotator**, one at a time.
- For each comparison:
  - One human is **excluded**
  - The model’s predictions are evaluated **against the remaining human annotations**
  - This simulates a realistic setting where the LLM replaces a single annotator and is judged by agreement with the rest

#### Key Metrics in Output

- **`winning_rate_train`**: Proportion of annotators for which the LLM performs as well or better (after adjusting for ε)
- **`passed_alt_test_train`**: `True` if the LLM passes the test (i.e., `winning_rate ≥ 0.5`)
- **`avg_adv_prob_train`**: Average advantage probability, how likely the model is better across comparisons
- **`p_values_train`**: List of p-values for each comparison

#### Interpreting `ε` (Epsilon)

- `ε` accounts for the **cost/effort/time trade-off** between using an LLM and a human annotator.
- Higher `ε` gives the model more leeway, useful when **human annotations are costly**.
- Recommendations from the original paper:
  - `ε = 0.2` → when humans are **experts**
  - `ε = 0.1` → when humans are **crowdworkers**

> If `winning_rate ≥ 0.5`, the LLM is considered **statistically competitive with human annotators** for this dataset and scenario (the LLM is "better" than half the humans).

In [11]:
# Run ALT test
epsilon = 0.2  # Epsilon parameter for ALT test
alt_test_df = run_alt_test_on_results(
    detailed_results_df=complex_case_for_metrics,
    annotation_columns=annotation_columns,
    labels=labels,
    epsilon=epsilon,
    alpha=0.05,
    verbose=verbose,
    show_runs=True
)
alt_test_df = alt_test_df.drop(
    columns=["iteration", "run", "use_validation_set", "N_val", "n_runs"]
)

pd.set_option("display.max_colwidth", None)   # show full content in each cell
alt_test_df


=== Columns in detailed_results_df (in run_alt_test_on_results) ===
['sample_id', 'split', 'verbatim', 'iteration', 'Rater_Oli', 'Rater_chloe', 'Rater_RA', 'ModelPrediction', 'Raisonnement', 'run', 'prompt_name', 'use_validation_set', 'Reasoning', 'prompt_iteration', 'best_accuracy', 'total_iterations', 'is_best_iteration']
=== ALT Test: Label Debugging ===
Label counts for each rater:
  ModelPrediction: 20 valid labels
  Rater_Oli: 20 valid labels
  Rater_chloe: 20 valid labels
  Rater_RA: 20 valid labels

Label types for each rater:
  ModelPrediction: int64
  Rater_Oli: int64
  Rater_chloe: int64
  Rater_RA: int64

Mixed types across raters: False

=== Converting labels to consistent types ===
Using label_type: int
Model predictions type after conversion: <class 'numpy.int32'>
Rater_Oli type after conversion: <class 'numpy.int32'>
Rater_chloe type after conversion: <class 'numpy.int32'>
Rater_RA type after conversion: <class 'numpy.int32'>
=== Alt-Test: summary ===
P-values for each

Unnamed: 0,prompt_name,N_train,prompt_iteration,winning_rate_train,passed_alt_test_train,avg_adv_prob_train,p_values_train,winning_rate_val,passed_alt_test_val,avg_adv_prob_val,p_values_val
0,french_prompt,20,,1.0,True,1.0,"[3.9749874992087996e-05, 0.0, 0.0002056469356819377]",,,,
1,french_prompt,20,,1.0,True,1.0,"[3.9749874992087996e-05, 0.0, 0.0002056469356819377]",,,,
2,english_prompt,5,1.0,0.666667,True,1.0,"[0.0, 0.0, 0.05805826175840779]",1.0,True,1.0,"[0.0, 0.0, 0.0]"
3,english_prompt,5,1.0,0.666667,True,1.0,"[0.0, 0.0, 0.05805826175840779]",1.0,True,1.0,"[0.0, 0.0, 0.0]"
4,english_prompt,5,2.0,0.0,False,0.8,"[0.5, 0.5, 0.28071902212526284]",1.0,True,1.0,"[0.0, 0.0, 0.0]"
5,english_prompt,5,2.0,0.666667,True,1.0,"[0.0, 0.0, 0.05805826175840779]",1.0,True,1.0,"[0.0, 0.0, 0.0]"
6,english_prompt,5,3.0,0.666667,True,1.0,"[0.0, 0.0, 0.05805826175840779]",1.0,True,1.0,"[0.0, 0.0, 0.0]"
7,english_prompt,5,3.0,0.0,False,0.8,"[0.5, 0.5, 0.28071902212526284]",1.0,True,1.0,"[0.0, 0.0, 0.0]"
8,english_prompt,30,,0.0,False,0.9,"[0.25, 0.25, 0.1693886419418353]",1.0,True,1.0,"[0.0, 0.0, 0.0]"
9,french_prompt,40,,1.0,True,1.0,"[3.9749874992087996e-05, 0.0, 0.0002056469356819377]",,,,


## Final Step: Classify the Full Dataset

If you are satisfied with the evaluation metrics, you can now use the **best-performing scenario** to classify the **entire unlabeled dataset**.

Simply **copy the chosen scenario** and run the classification.

> This time, only **one run is needed**, since you're not computing evaluation metrics (there are no human labels to compare against).

If you're **not satisfied with the results**, feel free to continue exploring and testing **different scenarios**.

In [None]:
scenario = [
    {
        # LLM settings
        "provider_llm1": "azure",
        "model_name_llm1": "gpt-4o",
        "temperature_llm1": 0,
        "prompt_name": "few_shot",
        "subsample_size": -1,  # Size of data subset to use

        # Prompt configuration
        "template": """
Vous êtes un assistant chargé d’évaluer des entrées de données.

Les données comprennent les colonnes suivantes :
- "réponse": La réponse à évaluer
- "type": Le type de réponse attendue
- "mots-clés": Les mots-clés attendus. Tous les mot-clés ne doivent pas nécessairement être présents dans la réponse.
- "process": L'explication du processus, si celui-ci est attendu.
- "annotation": La grille d'annotation pour cette tâche.
- "exemple_faux": Un exemple de réponse fausse (0)
- "exemple_moyen": Un exemple de réponse partiellement correcte (1)
- "exemple_juste": Un exemple de réponse correcte (2)

Voici une entrée à évaluer :
{verbatim_text}

Tâche d’évaluation :
Pour chaque entrée, évaluer si la réponse est fausse, partiellement correcte ou correcte, en utilisant l’échelle fournie (annotation).
La réponse est écrite par des enfants, l'orthographe et la grammaire ne sont pas importantes.
La réponse n'a pas besoin d'être parfaitement similaire à l'exemple juste pour être considérée comme un 2. Si la réponse est correcte dans l'esprit, elle peut être considérée comme un 2 plutôt que comme un 1.
Répondre en donnant le Raisonnement et la Classification (0, 1 ou 2) de la réponse.
""",
        # Output
        "selected_fields": ["Classification", "Reasoning"],
        "prefix": "Classification",
        "label_type": "int",
        "response_template":
        """
S'il te plait, suis le format JSON ci-dessous :
```json
{{
  "Raisonnement": "Ton raisonnement ici",
  "Classification": "Ton integer ici"
}}
""",
        "json_output": True,

        # Prompt optimization
        "provider_llm2": "azure",
        "model_name_llm2": "gpt-4o",
        "temperature_llm2": 0.7,
        "max_iterations": 1,
        "use_validation_set": False,
        "validation_size": 10,
        "random_state": 42,

        # Majority vote
        "n_completions": 1,

    },
]

# Run the scenario
complex_case_fully_annotated = run_scenarios(
    scenarios=scenario,
    data=data,
    annotation_columns=annotation_columns,
    labels=labels,
    n_runs=n_runs,
    verbose=verbose
)

In [None]:
complex_case_fully_annotated.to_csv("data/multiclass_user_case/outputs/multiclass_case_fully_annotated.csv", sep=";", index=False, encoding="utf-8-sig")