# ToT Data ETL

This notebook extracts the MMLU dataset (Hendrycks et al, 2021a; Hendrycks et al, 2021b; Hendrycks et al, 2023), transforms it to the appropriate format and saves it to train, validation and test files

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021b. Measuring Massive Multitask Language Understanding. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-27. Available from: https://arxiv.org/pdf/2009.03300.pdf [Accessed 5 August 2024].
 
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. and Steinhardt, J., 2023. Aligning AI With Shared Human Values. ICLR 2021, 4 May 2021, Vienna. Ithaca: Cornell University Library, arXiv.org, pp.1-29. Available from: https://arxiv.org/pdf/2008.02275.pdf [Accessed 5 August 2024]. 


## Extract

In [2]:
# Code, that is, the import, reused from: Hugging Face, n.d. Loading methods. datasets.load_dataset (V2.20.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset [Accessed 11 August 2024].
from datasets import load_dataset
#

# Code, that is, the loading, adapted from: Hugging Face, n.d. Loading methods. datasets.load_dataset (V2.20.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset [Accessed 11 August 2024].
# Code, that is, the values of the loading arguments reused from: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
# Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
full_dataset = load_dataset(path='cais/mmlu', data_dir='all')
#
print(full_dataset)

DatasetDict({
    train: Dataset({
        features: ['answer', 'choices', 'question', 'subject'],
        num_rows: 99842
    })
    validation: Dataset({
        features: ['answer', 'choices', 'question', 'subject'],
        num_rows: 1816
    })
    test: Dataset({
        features: ['answer', 'choices', 'question', 'subject'],
        num_rows: 14042
    })
})


## Transform

In [3]:
# Code, that is, the function and the filtering of the dataset, adapted from: Hugging Face, n.d. Process. Select and Filter (V2.21.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/en/process#select-and-filter [Accessed 15 August 2024].
def remove_records(record, id):
    # just in case, if there are records containing bad words, remove such records based on: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    bad_words = ["kill", "murder", "weapon", "criminal", "crime", "terro", "sex", "gun", "fight", "drug", "viole"]
    #
    if (True in [True if bad_word in record["question"].lower() else False for bad_word in bad_words]) or (True in [True if bad_word in " ".join(record["choices"]).lower() else False for bad_word in bad_words]):
        return False
    else:
        return True

full_dataset = full_dataset.filter(remove_records, with_indices=True)
#

In [4]:
# Code, that is, the function and the mapping of the dataset, adapted from: Hugging Face, n.d. Process. Map (V2.21.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/en/process#map [Accessed 15 August 2024].
def add_id(record, id):
    record["id"] = id
    return record

full_dataset = full_dataset.map(add_id, with_indices=True)
#

In [5]:
# Code, that is, the function and the mapping of the dataset, adapted from: Hugging Face, n.d. Main classes. map (V2.20.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.map [Accessed 15 August 2024].
def expand_choices(record):
    record["first_choice"] = record["choices"][0]
    record["second_choice"] = record["choices"][1]
    record["third_choice"] = record["choices"][2]
    record["fourth_choice"] = record["choices"][3]   
    return record

full_dataset = full_dataset.map(expand_choices)
#

In [6]:
# Code, that is, the function and the mapping of the dataset, adapted from: Hugging Face, n.d. Main classes. map (V2.20.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.map [Accessed 15 August 2024].
def enhance_question(record):
    # The logic of mapping the letters to the choices is according to: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
    # Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
    record["enhanced_question"] = record["question"] + "\nA. " + record["first_choice"] + "\nB. " + record["second_choice"] + "\nC. " + record["third_choice"] + "\nD. " + record["fourth_choice"]
    #
    return record

full_dataset = full_dataset.map(enhance_question)
#

In [7]:
# The logic of mapping the integer answers to the letter answers is according to: Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J., 2021a. 
# Dataset Card for MMLU [Online]. s.l.: Hugging Face. Available from: https://huggingface.co/datasets/cais/mmlu [Accessed 5 August 2024].
def int_to_letter_answer(int_answer):
    if int_answer == 0:
        return "A"
    elif int_answer == 1:
        return "B"
    elif int_answer == 2:
        return "C"
    elif int_answer == 3:
        return "D"
    else:
        print("The integer answer is not in the range")
        return None
#

# Code, that is, the function and the mapping of the dataset, adapted from: Hugging Face, n.d. Main classes. map (V2.20.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.map [Accessed 15 August 2024].
def process_answer(record):
    letter_answer = int_to_letter_answer(record["answer"])
    record["letter_answer"] = letter_answer
    return record

full_dataset = full_dataset.map(process_answer)
#

In [8]:
def split_full_dataset(full_dataset):
    return full_dataset["train"], full_dataset["validation"], full_dataset["test"]

train_dataset, validation_dataset, test_dataset = split_full_dataset(full_dataset)

In [9]:
# Code, that is, the conversion to Pandas, adapted from: Hugging Face, n.d. Main classes. to_pandas (V2.21.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/v2.21.0/en/package_reference/main_classes#datasets.Dataset.to_pandas [Accessed 15 August 2024].
train_dataset_pd = train_dataset.to_pandas()
#
train_dataset_pd[1:4]

Unnamed: 0,answer,choices,question,subject,id,first_choice,second_choice,third_choice,fourth_choice,enhanced_question,letter_answer
1,2,[partial breach of contract only if Ames had p...,Ames had painted Bell's house under a contract...,,1,partial breach of contract only if Ames had pr...,partial breach of contract whether or not Ames...,total breach of contract only if Ames had prop...,total breach of contract whether or not Ames h...,Ames had painted Bell's house under a contract...,C
2,2,[succeed if he can prove that he had painted t...,Ames had painted Bell's house under a contract...,,2,succeed if he can prove that he had painted th...,"succeed, because he cashed the check under eco...","not succeed, because he cashed the check witho...","not succeed, because he is entitled to recover...",Ames had painted Bell's house under a contract...,C
3,0,"[succeed, because by cashing the check Ames im...",Ames had painted Bell's house under a contract...,,3,"succeed, because by cashing the check Ames imp...","succeed, because Ames accepted Bell's offer by...","not succeed, because Bell's letter of June 18 ...","not succeed, because there is no consideration...",Ames had painted Bell's house under a contract...,A


In [10]:
print(train_dataset_pd[1:2]["enhanced_question"][1])

Ames had painted Bell's house under a contract which called for payment of $2,000. Bell, contending in good faith that the porch had not been painted properly, refused to pay anything. On June 15, Ames mailed a letter to Bell stating, "I am in serious need of money. Please send the $2,000 to me before July 1." On June 18, Bell replied, "I will settle for $1,800 provided that you agree to repaint the porch." Ames did not reply to this letter. Thereafter Bell mailed a check for $1,800 marked "Payment in full on the Ames-Bell painting contract as per letter dated June 18." Ames received the check on June 30. Because he was badly in need of money, check on June 30. Because he was badly in need of money, Questions Ames cashed the check without objection and spent the proceeds but has refused to repaint the porch.Bell's refusal to pay anything to Ames when he finished painting was a
A. partial breach of contract only if Ames had properly or substantially painted the porch.
B. partial breach 

In [11]:
# Code, that is, the conversion to Pandas, adapted from: Hugging Face, n.d. Main classes. to_pandas (V2.21.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/v2.21.0/en/package_reference/main_classes#datasets.Dataset.to_pandas [Accessed 15 August 2024].
validation_dataset_pd = validation_dataset.to_pandas()
#
validation_dataset_pd[200:203]

Unnamed: 0,answer,choices,question,subject,id,first_choice,second_choice,third_choice,fourth_choice,enhanced_question,letter_answer
200,3,[Two or more explanatory variables are perfect...,Near multicollinearity occurs when,econometrics,200,Two or more explanatory variables are perfectl...,The explanatory variables are highly correlate...,The explanatory variables are highly correlate...,Two or more explanatory variables are highly c...,Near multicollinearity occurs when\nA. Two or ...,D
201,0,"[It is the average return on Friday, It is the...",Consider the following time series model appli...,econometrics,201,It is the average return on Friday,It is the average return on Monday,It is the Friday deviation from the mean retur...,It is the Monday deviation from the mean retur...,Consider the following time series model appli...,A
202,0,"[(ii) and (iv) only, (i) and (iii) only, (i), ...",Which of the following statements are true con...,econometrics,202,(ii) and (iv) only,(i) and (iii) only,"(i), (ii), and (iii) only","(i), (ii), (iii), and (iv)",Which of the following statements are true con...,A


In [12]:
print(validation_dataset_pd[200:201]["enhanced_question"][200])

Near multicollinearity occurs when
A. Two or more explanatory variables are perfectly correlated with one another
B. The explanatory variables are highly correlated with the error term
C. The explanatory variables are highly correlated with the dependent variable
D. Two or more explanatory variables are highly correlated with one another


In [13]:
# Code, that is, the conversion to Pandas, adapted from: Hugging Face, n.d. Main classes. to_pandas (V2.21.0) [Online]. 
# Available from: https://huggingface.co/docs/datasets/v2.21.0/en/package_reference/main_classes#datasets.Dataset.to_pandas [Accessed 15 August 2024].
test_dataset_pd = test_dataset.to_pandas()
#
test_dataset_pd[0:3]

Unnamed: 0,answer,choices,question,subject,id,first_choice,second_choice,third_choice,fourth_choice,enhanced_question,letter_answer
0,1,"[0, 4, 2, 6]",Find the degree for the given field extension ...,abstract_algebra,0,0,4,2,6,Find the degree for the given field extension ...,B
1,2,"[8, 2, 24, 120]","Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",abstract_algebra,1,8,2,24,120,"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...",C
2,3,"[0, 1, 0,1, 0,4]",Find all zeros in the indicated finite field o...,abstract_algebra,2,0,1,1,4,Find all zeros in the indicated finite field o...,D


In [14]:
print(test_dataset_pd[0:1]["enhanced_question"][0])

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6


## Load

In [15]:
full_train_dataset_path_csv = "full_train_dataset.csv"
full_validation_dataset_path_csv = "full_validation_dataset.csv"
full_test_dataset_path_csv = "full_test_dataset.csv"

# Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
train_dataset_pd.to_csv(full_train_dataset_path_csv, index=False)
#
# Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
validation_dataset_pd.to_csv(full_validation_dataset_path_csv, index=False)
#
# Code, that is, the saving of the dataset, adapted from: pandas, 2024. pandas.DataFrame.to_csv (v.2.2) [Online]. 
# Available from: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv [Accessed 15 August 2024].
test_dataset_pd.to_csv(full_test_dataset_path_csv, index=False)
#