# ToT Analysis Results

This notebook analyses data from the ToT-evaluate-Mixtral.ipynb, ToT-evaluate-Mistral.ipynb, ToT-evaluate-Mistral-tuned.ipynb and based on the MMLU dataset (Hendrycks et al, 2021a; Hendrycks et al, 2021b; Hendrycks et al, 2023).

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
 
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024]. 

In [10]:
import pandas as pd

def analyze_evaluate_dataset(evaluate_dataset):
    # Code, that is, the value reused from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 30 December 2024].
    # Para.1
    first_column = "subject"
    #
    second_column = "answer_evaluation"
    # Code adapted from: pandas, 2024. How do I select a subset of a DataFrame? (v.2.2) [Online].
    # Available from: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-a-subset-of-a-dataframe [Accessed 30 December 2024].
    evaluate_dataset = evaluate_dataset[[first_column, second_column]]
    #
    # Code adapted from: pandas, 2024. pandas.DataFrame.groupby (v.2.2) [Online].
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html [Accessed 30 December 2024].
    evaluate_dataset = evaluate_dataset.groupby([first_column]).mean()
    #
    return evaluate_dataset

def create_report(simple_evaluate_path, tot_evaluate_path):
    # Code, that is, the value reused from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 30 December 2024].
    # Para.1
    first_column = "subject"
    #

    # Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
    simple_evaluate_dataset = pd.read_csv(simple_evaluate_path)
    #
    # Code, that is, the loading of the dataset, adapted from: pandas, 2024. pandas.read_csv (v.2.2) [Online]. 
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html [Accessed 17 August 2024].
    tot_evaluate_dataset = pd.read_csv(tot_evaluate_path)
    #
    simple_evaluate_dataset = analyze_evaluate_dataset(simple_evaluate_dataset)
    tot_evaluate_dataset = analyze_evaluate_dataset(tot_evaluate_dataset)
    # Code adapted from: pandas, 2024. pandas.DataFrame.join (v.2.2) [Online].
    # Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html [Accessed 30 December 2024].
    simple_and_tot_evaluate_dataset = simple_evaluate_dataset.join(tot_evaluate_dataset, on=first_column, lsuffix="_simple", rsuffix="_tot")
    #
    # Code adapted from: pandas, 2024. How to create new columns derived from existing columns (v.2.2) [Online].
    # Available from: https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html#how-to-create-new-columns-derived-from-existing-columns [Accessed 30 December 2024].
    simple_and_tot_evaluate_dataset["difference"] = simple_and_tot_evaluate_dataset["answer_evaluation_tot"] - simple_and_tot_evaluate_dataset["answer_evaluation_simple"]
    #
    print(simple_and_tot_evaluate_dataset)
    return simple_and_tot_evaluate_dataset

In [11]:
import os

current_path = os.getcwd()

## Experiment 1

In [36]:
simple_folder_1 = "/test-dataset-for-evaluation-mixtral-simple"
tot_folder_1 = "/test-dataset-for-evaluation-mixtral-tot"

Extract first answer

In [37]:
simple_evaluate_path_1 = current_path + simple_folder_1 + "/test_dataset_45_671_671_string_answer_evaluated_extracted_answer_evaluated.csv"
tot_evaluate_path_1 = current_path + tot_folder_1 + "/test_dataset_45_671_671_string_answer_evaluated_extracted_answer_evaluated.csv"
create_report(simple_evaluate_path_1, tot_evaluate_path_1)

                                     answer_evaluation_simple  \
subject                                                         
abstract_algebra                                     0.500000   
anatomy                                              0.571429   
astronomy                                            0.750000   
business_ethics                                      0.800000   
clinical_knowledge                                   0.615385   
college_biology                                      0.571429   
college_chemistry                                    0.600000   
college_computer_science                             0.200000   
college_mathematics                                  0.200000   
college_medicine                                     0.888889   
college_physics                                      0.600000   
computer_security                                    0.666667   
conceptual_physics                                   1.000000   
econometrics             

Unnamed: 0_level_0,answer_evaluation_simple,answer_evaluation_tot,difference
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abstract_algebra,0.5,0.666667,0.166667
anatomy,0.571429,0.571429,0.0
astronomy,0.75,0.75,0.0
business_ethics,0.8,0.0,-0.8
clinical_knowledge,0.615385,0.769231,0.153846
college_biology,0.571429,0.142857,-0.428571
college_chemistry,0.6,0.0,-0.6
college_computer_science,0.2,0.0,-0.2
college_mathematics,0.2,0.4,0.2
college_medicine,0.888889,0.444444,-0.444444


Do not extract first answer

In [38]:
simple_evaluate_path_1 = current_path + simple_folder_1 + "/test_dataset_45_671_671_string_answer_evaluated_answer_evaluated.csv"
tot_evaluate_path_1 = current_path + tot_folder_1 + "/test_dataset_45_671_671_string_answer_evaluated_answer_evaluated.csv"
create_report(simple_evaluate_path_1, tot_evaluate_path_1)

                                     answer_evaluation_simple  \
subject                                                         
abstract_algebra                                     0.500000   
anatomy                                              0.571429   
astronomy                                            0.750000   
business_ethics                                      0.800000   
clinical_knowledge                                   0.769231   
college_biology                                      0.571429   
college_chemistry                                    0.600000   
college_computer_science                             0.200000   
college_mathematics                                  0.200000   
college_medicine                                     0.888889   
college_physics                                      0.600000   
computer_security                                    0.666667   
conceptual_physics                                   1.000000   
econometrics             

Unnamed: 0_level_0,answer_evaluation_simple,answer_evaluation_tot,difference
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abstract_algebra,0.5,0.666667,0.166667
anatomy,0.571429,0.571429,0.0
astronomy,0.75,0.75,0.0
business_ethics,0.8,0.4,-0.4
clinical_knowledge,0.769231,0.846154,0.076923
college_biology,0.571429,0.142857,-0.428571
college_chemistry,0.6,0.4,-0.2
college_computer_science,0.2,0.0,-0.2
college_mathematics,0.2,0.4,0.2
college_medicine,0.888889,0.555556,-0.333333


## Experiment 2

In [39]:
simple_folder_2 = "/test-dataset-for-evaluation-mistral-simple"
tot_folder_2 = "/test-dataset-for-evaluation-mistral-tot"

Extract first answer

In [40]:
simple_evaluate_path_2 = current_path + simple_folder_2 + "/test_dataset_45_671_671_string_answer_evaluated_extracted_answer_evaluated.csv"
tot_evaluate_path_2 = current_path + tot_folder_2 + "/test_dataset_45_671_671_string_answer_evaluated_extracted_answer_evaluated.csv"
create_report(simple_evaluate_path_2, tot_evaluate_path_2)

                                     answer_evaluation_simple  \
subject                                                         
abstract_algebra                                     0.333333   
anatomy                                              0.285714   
astronomy                                            0.625000   
business_ethics                                      0.800000   
clinical_knowledge                                   0.615385   
college_biology                                      0.571429   
college_chemistry                                    0.600000   
college_computer_science                             0.200000   
college_mathematics                                  0.000000   
college_medicine                                     0.777778   
college_physics                                      0.600000   
computer_security                                    0.833333   
conceptual_physics                                   0.583333   
econometrics             

Unnamed: 0_level_0,answer_evaluation_simple,answer_evaluation_tot,difference
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abstract_algebra,0.333333,0.166667,-0.166667
anatomy,0.285714,0.142857,-0.142857
astronomy,0.625,0.125,-0.5
business_ethics,0.8,0.6,-0.2
clinical_knowledge,0.615385,0.461538,-0.153846
college_biology,0.571429,0.285714,-0.285714
college_chemistry,0.6,0.6,0.0
college_computer_science,0.2,0.2,0.0
college_mathematics,0.0,0.2,0.2
college_medicine,0.777778,0.333333,-0.444444


Do not extract first answer

In [41]:
simple_evaluate_path_2 = current_path + simple_folder_2 + "/test_dataset_45_671_671_string_answer_evaluated_answer_evaluated.csv"
tot_evaluate_path_2 = current_path + tot_folder_2 + "/test_dataset_45_671_671_string_answer_evaluated_answer_evaluated.csv"
create_report(simple_evaluate_path_2, tot_evaluate_path_2)

                                     answer_evaluation_simple  \
subject                                                         
abstract_algebra                                     0.500000   
anatomy                                              0.285714   
astronomy                                            0.625000   
business_ethics                                      0.800000   
clinical_knowledge                                   0.615385   
college_biology                                      0.571429   
college_chemistry                                    0.600000   
college_computer_science                             0.200000   
college_mathematics                                  0.000000   
college_medicine                                     0.777778   
college_physics                                      0.600000   
computer_security                                    0.833333   
conceptual_physics                                   0.583333   
econometrics             

Unnamed: 0_level_0,answer_evaluation_simple,answer_evaluation_tot,difference
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abstract_algebra,0.5,0.333333,-0.166667
anatomy,0.285714,0.285714,0.0
astronomy,0.625,0.5,-0.125
business_ethics,0.8,0.6,-0.2
clinical_knowledge,0.615385,0.692308,0.076923
college_biology,0.571429,0.714286,0.142857
college_chemistry,0.6,0.6,0.0
college_computer_science,0.2,0.2,0.0
college_mathematics,0.0,0.4,0.4
college_medicine,0.777778,0.444444,-0.333333


## Experiment 3

In [42]:
simple_folder_3 = "/test-dataset-for-evaluation-mistral-simple"
tot_folder_3 = "/test-dataset-for-evaluation-mistral-tot-tuned"

Extract first answer

In [43]:
simple_evaluate_path_3 = current_path + simple_folder_3 + "/test_dataset_45_671_671_string_answer_evaluated_extracted_answer_evaluated.csv"
tot_evaluate_path_3 = current_path + tot_folder_3 + "/test_dataset_45_671_671_string_answer_evaluated_extracted_answer_evaluated.csv"
create_report(simple_evaluate_path_3, tot_evaluate_path_3)

                                     answer_evaluation_simple  \
subject                                                         
abstract_algebra                                     0.333333   
anatomy                                              0.285714   
astronomy                                            0.625000   
business_ethics                                      0.800000   
clinical_knowledge                                   0.615385   
college_biology                                      0.571429   
college_chemistry                                    0.600000   
college_computer_science                             0.200000   
college_mathematics                                  0.000000   
college_medicine                                     0.777778   
college_physics                                      0.600000   
computer_security                                    0.833333   
conceptual_physics                                   0.583333   
econometrics             

Unnamed: 0_level_0,answer_evaluation_simple,answer_evaluation_tot,difference
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abstract_algebra,0.333333,0.166667,-0.166667
anatomy,0.285714,0.285714,0.0
astronomy,0.625,0.625,0.0
business_ethics,0.8,0.8,0.0
clinical_knowledge,0.615385,0.461538,-0.153846
college_biology,0.571429,0.714286,0.142857
college_chemistry,0.6,0.2,-0.4
college_computer_science,0.2,0.0,-0.2
college_mathematics,0.0,0.4,0.4
college_medicine,0.777778,0.666667,-0.111111


Do not extract first answer

In [44]:
simple_evaluate_path_3 = current_path + simple_folder_3 + "/test_dataset_45_671_671_string_answer_evaluated_answer_evaluated.csv"
tot_evaluate_path_3 = current_path + tot_folder_3 + "/test_dataset_45_671_671_string_answer_evaluated_answer_evaluated.csv"
create_report(simple_evaluate_path_3, tot_evaluate_path_3)

                                     answer_evaluation_simple  \
subject                                                         
abstract_algebra                                     0.500000   
anatomy                                              0.285714   
astronomy                                            0.625000   
business_ethics                                      0.800000   
clinical_knowledge                                   0.615385   
college_biology                                      0.571429   
college_chemistry                                    0.600000   
college_computer_science                             0.200000   
college_mathematics                                  0.000000   
college_medicine                                     0.777778   
college_physics                                      0.600000   
computer_security                                    0.833333   
conceptual_physics                                   0.583333   
econometrics             

Unnamed: 0_level_0,answer_evaluation_simple,answer_evaluation_tot,difference
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abstract_algebra,0.5,0.166667,-0.333333
anatomy,0.285714,0.428571,0.142857
astronomy,0.625,0.625,0.0
business_ethics,0.8,0.8,0.0
clinical_knowledge,0.615385,0.538462,-0.076923
college_biology,0.571429,0.714286,0.142857
college_chemistry,0.6,0.2,-0.4
college_computer_science,0.2,0.0,-0.2
college_mathematics,0.0,0.4,0.4
college_medicine,0.777778,0.666667,-0.111111
