# EDA - Experiment Results Data

The goal of this notebook is to show the structure of the results achived with this experiments
It will be passed through the columns and one experimental metric calculation using ROUGE-1 (main metric to evaluate the experiment)

In [96]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import evaluate
import ast
import re
import os

## 1. Overfiew of the file

Each result file is in a .csv format with the name indicating the especifications of the executed experiments
</br> The possiblels settings are:

*   **Model:**
    -   Llama3 - 8B, Gemma2 - 9B or Gemma2 - 27B

*   **Selection Method**:
    -   KATE

*   **K Demonstrations**:
    - 8, 16, 32 or 64

*   **Context Generalization**:
    - general (KATE applied to find the closest distances for a sample to all classes) or in_class (KATE applied to find the closet distance within only the own class)


The available columns in each file are:

*   **index**: indicates the index of the row
*   **task**: task specification with the possibilities following as the describe in the README of the experiment
*   **input**: test input submitted on the prompt
*   **output**: ground-truth output
*   **possible_outputs**: when it makes sense, another possible ground truth results (for example, for translations)
*   **input_encoding**: embedding of the input obtained using Sentence-BERT ('sentence-transformers/stsb-roberta-large)
*   **output_encoding**: embedding of the output obtained using Sentence-BERT ('sentence-transformers/stsb-roberta-large)
*   **distances**: the cosine distances of every training sample comparing with the input embedding

In [80]:
results = "data/results/results_gemma2_27_kate_32_in_class.csv"
df = pd.read_csv(results)

In [3]:
df.head()

Unnamed: 0,k,task,input,output,predicted_output,possible_outputs,prompt
0,32,active_to_passive,The professor mentioned the artist.,The artist was mentioned by the professor.,\n Output: The artist was mentioned by the pro...,,input_variables=['input'] examples=[{'input': ...
1,32,active_to_passive,The presidents recommended the lawyer.,The lawyer was recommended by the presidents.,\n Output:,,input_variables=['input'] examples=[{'input': ...
2,32,active_to_passive,The professors thanked the tourists.,The tourists were thanked by the professors.,\n Output: The $ was _{} the professor by the.,,input_variables=['input'] examples=[{'input': ...
3,32,active_to_passive,The scientist contacted the judge.,The judge was contacted by the scientist.,\n Output: The judge was contacted by the scie...,,input_variables=['input'] examples=[{'input': ...
4,32,active_to_passive,The doctor stopped the managers.,The managers were stopped by the doctor.,\n Output: The managers were stopped by the do...,,input_variables=['input'] examples=[{'input': ...


## 2. Example of ROUGE-1 metric

The chosen metric here to be applied was the ROUGE-1, because it can cover relatively well the text generations of different sizes, with the variancy depending on the sentence size


Below we have an example of the calculation, it's important to notice that the comparsion results have to consider also the possible outputs, to do so all of them are considered with the highest result being yielded

In [4]:
rouge = evaluate.load('rouge')

In [37]:
# Sample example
df.loc[1972]

k                                                                  32
task                                                translation_en-fr
input                                                          puzzle
output                                                         puzzle
predicted_output                                    \n Output: puzzle
possible_outputs    ['puzzle', 'rendre perplexe', 'devinette', 'my...
prompt              input_variables=['input'] examples=[{'input': ...
Name: 1972, dtype: object

In [38]:
# Calculating Rouge-1 for one sample (with possible outputs)

references = ast.literal_eval(df.loc[1972]["possible_outputs"])
references.append(df.loc[1971]["output"])
references

['puzzle',
 'rendre perplexe',
 'devinette',
 'mystère',
 'énigme',
 'casse-tête',
 'jeu de patience',
 'furet']

In [41]:
pattern = r"Output: (.*)"
predicted_output = df.loc[1972]["predicted_output"]
predictions = [re.search(pattern, predicted_output).group(1)]
predictions

['puzzle']

In [45]:
scores = []
for ref in references:
    results = rouge.compute(predictions=predictions, references=[ref])
    scores.append(results["rouge1"])
print(max(scores))

1.0


## 3. Merge of all experiments

Here it will be merged all experiments results and saved to a file called "merged_results.csv" within the directory "data" </br>
The data is got from the "compiled" dir, that correspondes to the compiled identification and "rouge1" for each experiment result </br>
This can be got by running the script "calculate_rouge1_metric.py"

In [134]:
def merge_results(dir):

    dfs = []

    # Iterate over all files in the directory
    for filename in os.listdir(dir):
        if filename.endswith(".csv"):
            file_path = os.path.join(dir, filename)
            # Read the CSV file into a DataFrame and append it to the list
            dfs.append(pd.read_csv(file_path, index_col=False))

    # Concatenate all DataFrames in the list into a single DataFrame
    combined_df = pd.concat(dfs, ignore_index=True)

    # Save the combined DataFrame to a new CSV file if needed
    combined_df.to_csv('data/merged_results.csv', index=False)

    # Display the combined DataFrame
    print("Saved file!")

In [135]:
merge_results("data/compiled")

Saved file!


## 4. Compile results for all experiments:

Here it will be created a function that joins all experiments files and create a complete dataframe compiling the mean rouge-1 general and task specific for all different settings described in session 1

The columns of the compilation are the following:
*   **model:**
    -   Llama3 - 8B, Gemma2 - 9B or Gemma2 - 27B

*   **method**:
    -   KATE

*   **k**:
    - 8, 16, 32 or 64

*   **experiment_type**:
    - general (KATE applied to find the closest distances for a sample to all classes) or in_class (KATE applied to find the closet distance within only the own class)

*   **valuation_type**:
    - "general" when applied over all task, else, the task name

*   **mean**:
    - mean ROUGE-1 obtained


In [140]:
merged = pd.read_csv("data/merged_results.csv", index_col=False).drop("Unnamed: 0", axis=1).rename({"rouge1": "rougeL"}, axis=1)
merged.head()

Unnamed: 0,model,method,k,experiment_type,task,input,output,predicted_output,possible_outputs,rougeL
0,gemma2-27,kate,16,general,active_to_passive,The professor mentioned the artist.,The artist was mentioned by the professor.,The artist was mentioned by the professor.,,1.0
1,gemma2-27,kate,16,general,active_to_passive,The presidents recommended the lawyer.,The lawyer was recommended by the presidents.,The lawyer was recommended by the presidents.,,1.0
2,gemma2-27,kate,16,general,active_to_passive,The professors thanked the tourists.,The tourists were thanked by the professors.,The tourists were thanked by the professors.,,1.0
3,gemma2-27,kate,16,general,active_to_passive,The scientist contacted the judge.,The judge was contacted by the scientist.,The judge was contacted by the scientist.,,1.0
4,gemma2-27,kate,16,general,active_to_passive,The doctor stopped the managers.,The managers were stopped by the doctor.,The managers were stopped by the doctor.,,1.0


In [141]:
# Group by task

grouped_task = merged.groupby(['model', 'k', 'experiment_type', 'task']).agg(
    mean_rougeL=('rougeL', 'mean'),
    median_rougeL=('rougeL', 'median'),
    stdev_rougeL=('rougeL', 'std')
).reset_index()

grouped_task.head()

Unnamed: 0,model,k,experiment_type,task,mean_rougeL,median_rougeL,stdev_rougeL
0,gemma2-27,8,general,active_to_passive,0.909281,1.0,0.226649
1,gemma2-27,8,general,antonyms,0.19,0.0,0.394277
2,gemma2-27,8,general,diff,0.12,0.0,0.326599
3,gemma2-27,8,general,first_word_letter,0.186297,0.0,0.374533
4,gemma2-27,8,general,larger_animal,0.32672,0.0,0.402796


In [142]:
# Group by task

grouped_all = merged.groupby(['model', 'k', 'experiment_type']).agg(
    mean_rougeL=('rougeL', 'mean'),
    median_rougeL=('rougeL', 'median'),
    stdev_rougeL=('rougeL', 'std')
).reset_index()

grouped_all["task"] = "all"

grouped_all.head()

Unnamed: 0,model,k,experiment_type,mean_rougeL,median_rougeL,stdev_rougeL,task
0,gemma2-27,8,general,0.19597,0.0,0.356272,all
1,gemma2-27,8,inclass,0.319652,0.0,0.429769,all
2,gemma2-27,16,general,0.208515,0.0,0.365186,all
3,gemma2-27,16,inclass,0.371711,0.0,0.45055,all
4,gemma2-27,32,general,0.214987,0.0,0.366556,all


In [143]:
## Join agg
aggregated = pd.concat([grouped_all, grouped_task], ignore_index=True)
aggregated.to_csv("data/aggregated_results.csv")
