# (ADA) Homework 1: Scoring the Language Model Olympics

---

By the end of this homework, we expect you to be able to:

- Load data and handle data using pandas;
- Navigate the documentation of Python packages by yourself;
- Filter and tidy up noisy real-world datasets;
- Aggregate your data in different (and hopefully helpful) ways;
- Create meaningful visualizations to analyze the data;
- Communicate your findings in a clear and concise manner

---

**Important Dates.**

- Homework release: Fri 04 Oct 2024
- Homework due: Sat 18 Oct 2024, 23:59
- Grade release: Mon 04 Nov 2024

**Some rules**

- You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you may do so, but must justify your choice.
- Make sure you use the data folder provided in the repository in read-only mode. (Or alternatively, be sure you don’t change any of the files.)
- Be sure to provide a concise textual description of your thought process, the assumptions you made, the solution you implemented, and explanations for your answers. A notebook that only has code cells will not suffice. To avoid confusion: use short comments for longer code answers.
- For questions containing the /Discuss:/ prefix, answer not with code, but with a textual explanation (in markdown).
- Back up any hypotheses and claims with data, since this is an important aspect of the course.
- Please write all your comments in English, and use meaningful variable names in your code. Your repo should have a single notebook (plus the required data files) in the master/main branch. If there are multiple notebooks present, we will not grade anything.
- We will not run your notebook for you! Rather, we will grade it as is, which means that only the results contained in your evaluated code cells will be considered, and we will not see the results in unevaluated code cells. Thus, be sure to hand in a fully-run and evaluated notebook. In order to check whether everything looks as intended, you can check the rendered notebook on the GitHub website once you have pushed your solution there.
- In continuation to the previous point, interactive plots, such as those generated using the ‘plotly’ package, should be strictly avoided! Make sure to print results and/or dataframes that confirm you have properly addressed the task.

**A Note on using Language Models (LMs)**

If you try hard enough, you will likely get away with cheating. Fortunately, our job is not to police, but rather to educate! So, please consider the following:
- Presumably, you are taking this course to learn something! LMs are not always right ([they often fail in silly ways](https://community.openai.com/t/why-9-11-is-larger-than-9-9-incredible/869824/4)). This course should prepare you to detect when they are wrong!
- Some of the TAs on this course literally published many works on detecting machine-generated text.
---

## Context

Context
AI is booming! Newspapers, influencers, and your relatives all agree that AI is important. But while almost everyone agrees that AI is the future, much is unclear about what that future looks like…

Freshly graduated from the EPFL, you are hired by the Swiss government to advise on a large-scale “AI integration” initiative code-named **"NEUTRALITY"** (Navigating Efficient Upgrades Through Robust Artificial Learning Integration Techniques Yearly). Convinced by the stunning progress in language modeling, the government would like to battle the growing shortages in the education sector by using LMs. Your job description: investigate which LMs might be best suited!

You are given the results of three LMs on the [“Massive Multitask Language Understanding (MMLU)”](https://arxiv.org/abs/2009.03300) dataset to compare. This famous dataset consists of 57 subjects with multiple-choice questions, covering diverse subjects like mathematics, computer science, history, and law. Most providers of state-of-the-art LMs use this dataset to showcase the versatility of their latest models. Unfortunately, Horta-Ribeiro, the intern responsible for collecting the results, didn’t take EPFL’s famous ADA course. As a result, the collected datasets are slightly corrupted.

### A very brief primer on Language Models
Language models (LMs) are sophisticated statistical models designed to understand and generate human-like text. At their core, LMs are trained to predict the most likely continuation of a given input text. For example, given the input "The cat sat on the," an LM might predict "mat" as a likely continuation.
LMs are trained on vast text samples from various sources, including books, websites, and social media. This extensive training allows them to capture patterns and relationships in language, enabling them to generate coherent and contextually appropriate text across a wide range of topics and styles.

While LMs can produce text that appears to be written by intelligent humans, it's important to note that their capabilities can diverge from human intelligence in unexpected ways. They may sometimes generate factually incorrect information or struggle with complex reasoning tasks.

Two key concepts in understanding LMs are:
1. **Tokens**: LMs process text using "tokens" rather than individual characters. Tokens can be words, parts of words, or punctuation marks. For example, the sentence "I love AI!" might be tokenized as ["I", "love", "AI", "!"]. Tokenization is the first step in both training and using an LM.
2. **Context**: The input text provided to an LM is called the "context." This context informs the model's predictions or generations. A longer or more specific context often leads to more accurate and relevant outputs.

[See: Wikipedia entry on language models](https://en.wikipedia.org/wiki/Large_language_model)

###  Files for this assignment
This assignment is divided into three tasks, each of which should bring you a step closer to providing a recommendation toward project NEUTRALITY’s objectives:

- **Task 1**: Inspecting the results and getting your first model ranking
- **Task 2**: Inspecting the underlying data used to generate the results for possible biases
- **Task 3**: Learning about tokens and providing a final recommendation


```
📁 PROJECT_NEUTRALITY
│
├── 📄 analysis.ipynb (the file you're currently reading!)
├── 📄 requirements.txt (install into your environment)
│
├── 📁 task_1
├── 📁 task_2
└── 📁 task_2.5
```   
 

In [136]:
# please make sure you install the packages listed in the requirements.txt file in your environment!
# using pip
# pip install -r requirements.txt
#
# using Conda:
# conda create --name <env_name> --file requirements.txt
#
# some basic imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from scipy.stats import ttest_ind

# for random
SEED = 42

## Task 1 (18 points): What's in an average anyway?

The files needed to complete task 1 can be found in the folder "`data/task_1/`:
```
task_1/
│
├── mmlu_data/
│   └── test.csv
│
└── lm_scores/
    ├── lm_X.csv
    ├── lm_Y.csv
    └── lm_Z.csv
```

We will start by loading, (manually) inspecting, and cleaning the data. Although it doesn't seem "glamorous" (nor is it particularly fun...) - manually inspecting data is extremely important! In fact, it's one of the few things most AI and Data Science researchers agree on :). Next, we will take a first pass on ordering our Olympic podium between three LMs.

### 1.1 (1 pt)
 
Load the subfiles contained in the `mmlu_data` and `lm_scores` folders into separate dataframes:
- `df_test`
- `df_x`
- `df_y`
- `df_z`

for each, print their sizes.

In [None]:
# Load the single datasets
df_test = pd.read_csv("data/task_1/mmlu_data/test.csv")
df_x = pd.read_csv("data/task_1/lm_scores/lm_X.csv")
df_y = pd.read_csv("data/task_1/lm_scores/lm_Y.csv")
df_z = pd.read_csv("data/task_1/lm_scores/lm_Z.csv")

# Print the size of each dataset
print(f"Size of df_test: {df_test.shape[0]} rows, {df_test.shape[1]} columns")
print(f"Size of df_x: {df_x.shape[0]} rows, {df_x.shape[1]} columns")
print(f"Size of df_y: {df_y.shape[0]} rows, {df_y.shape[1]} columns")
print(f"Size of df_z: {df_z.shape[0]} rows, {df_z.shape[1]} columns")



### 1.2 (4 pt)
Unfortunately, LMs don't always output the format we want. In the column `result`, the value should be one of A, B, C, or D. 

A. For each of the LM score dataframes, use a `value_counts()` operation and print the results. 

B. /Discuss:/ Inspect the results and describe the types of answer formats you see. Besides the "expected" case, you should be able to find at least four unexpected formats.

In [None]:
# A
# Showing the most frequent answers given by the 3 models
# Due to the many possbile answers only the first 20 most frequent are presented
print("LLM x:")
print(df_x["result"].value_counts()[:20])


In [None]:
print("LLM y:")
print(df_y["result"].value_counts()[:20])

In [None]:
print("LLM z:")
print(df_z["result"].value_counts()[:20])

**Answer: B**

In addition to the "expected" cases and very specific formats (e.g., "creating insurmountable obstacles to the founding of factions, so the answer is A"), there are at least four other types of recurring unexpected formats observed in the dataset:

1. **Lettered Answers**: Responses formatted as "Answer: letter," where the letter can be one of ["A", "B", "C", "D"].

2. **"None of the Above"**: Answer not listed in the possbilities ["A", "B", "C", "D"].

3. **"Not Sure"**: Answer not listed in the possbilities ["A", "B", "C", "D"].

4. **Letters with Trailing Spaces**: Even when the response appears to be a single letter, variations due to trailing spaces can create discrepancies. For example, the string "A" is distinct from "A " (with a space).


### 1.3 (5 pt)
Oh oh... That doesn't look great. Simply dropping all invalid answers seems overly wasteful, yet fixing all of these looks like a mess! Instead, let's focus for now on fixing just those answers of length < 10 characters that require only a single `str.replace()` operation. 

For example, if the answer looks like `--A--`, we could fix this by using the following simple function:

```
def clean_answer(s, pattern='-'):
    return str(s).replace(pattern, '')

dirty_answer = '--A--'
clean_answer = clean_answer(dirty_answer)
```

A. Filter the three score dataframes to include only answers with less than 10 characters. Make a deep copy of the dataframes as you filter them.

B. Modify the `clean_answer()` example function to clean the answers in the filtered data frames using the `apply()` functionality. Finally, make sure **all remaining answers are one of `A, B, C, or D`.**

C. /Discuss:/ Compare the sizes of the original and filtered data frames. What do you see? Why might this be a problem?

In [141]:
# A
max_length = 10 # max number of char after which the answer is not considered

# remove rows where the value of "result" column is more than max_length
df_x_preprocessed = df_x[df_x["result"].str.len() < max_length].copy()
df_y_preprocessed = df_y[df_y["result"].str.len() < max_length].copy()
df_z_preprocessed = df_z[df_z["result"].str.len() < max_length].copy()

In [142]:
# B
# Define function to replace specific characters
def clean_answer(s, patterns=[' ']):
    for pat in patterns:
        s = str(s).replace(pat, '')
    return s

# Replace specific character in each value of "result" column
df_x_preprocessed["result"] = df_x_preprocessed["result"].apply(lambda x: clean_answer(x, [' ', "Answer:"]))
df_y_preprocessed["result"] = df_y_preprocessed["result"].apply(lambda x: clean_answer(x, [' ', "Answer:"]))
df_z_preprocessed["result"] = df_z_preprocessed["result"].apply(lambda x: clean_answer(x, [' ', "Answer:"]))

# Remove rows with values that are not in 
accepted_answers = ["A", "B", "C", "D"]
df_x_preprocessed.drop(df_x_preprocessed[~ df_x_preprocessed["result"].isin(accepted_answers)].index, axis=0, inplace=True)
df_y_preprocessed.drop(df_y_preprocessed[~df_y_preprocessed["result"].isin(accepted_answers)].index, axis=0, inplace=True)
df_z_preprocessed.drop(df_z_preprocessed[~df_z_preprocessed["result"].isin(accepted_answers)].index, axis=0, inplace=True)


# Reset the index of the DataFrame
df_x_preprocessed.reset_index(drop=True, inplace=True)
df_y_preprocessed.reset_index(drop=True, inplace=True)
df_z_preprocessed.reset_index(drop=True, inplace=True)


# Just for verification: check that all the answers are in the form ["A", "B", "C", "D"]
assert(df_x_preprocessed["result"].isin(accepted_answers).all())
assert(df_y_preprocessed["result"].isin(accepted_answers).all())
assert(df_z_preprocessed["result"].isin(accepted_answers).all())

**C. /Discuss:/**

As shown in the cell below, after the preprocessing and filtering, few row has been deleted (only model Z is amost 10%).
Thus, we can hypothesize that no major drawbacks should arise from these preprocessing choices.

In [None]:
# Print before and after for df_x
print(f"LLM x\n\tNumber of answers with less than {max_length} characters:\n\t\tBefore: {df_x.shape[0]}\n\t\tAfter: {df_x_preprocessed.shape[0]}\n\t\tLoss Percentage: {((df_x.shape[0] - df_x_preprocessed.shape[0]) / df_x.shape[0]) * 100:.2f}%\n")

# Print before and after for df_y
print(f"LLM y\n\tNumber of answers with less than {max_length} characters:\n\t\tBefore: {df_y.shape[0]}\n\t\tAfter: {df_y_preprocessed.shape[0]}\n\t\tLoss Percentage: {((df_y.shape[0] - df_y_preprocessed.shape[0]) / df_y.shape[0]) * 100:.2f}%\n")

# Print before and after for df_z
print(f"LLM z\n\tNumber of answers with less than {max_length} characters:\n\t\tBefore: {df_z.shape[0]}\n\t\tAfter: {df_z_preprocessed.shape[0]}\n\t\tLoss Percentage: {((df_z.shape[0] - df_z_preprocessed.shape[0]) / df_z.shape[0]) * 100:.2f}%\n")


### 1.4 (3 pt)

Now that our answer columns are nicely formatted, let's take a look at model performance:

A. Both the `MMLU` dataframes and the language model score data frames have the columns `question_id`. For each of the language model score data frames, use an inner join operation with the `df_test` dataframe on the `question_id` column.

B. Add a new column to each of the resulting dataframes called `correct`, that checks if the model's answer in `result` is the same as the expected answer in the column `answer`. Then, print the average score of each model.

In [None]:
# A
# Inner join to add information of each question
df_x_preprocessed_join = pd.merge(
    left=df_x_preprocessed, 
    right=df_test,
    left_on="question_id",
    right_on="question_id",
    how="inner", #only take the one that are in both df
    suffixes=["",""] #as all cols have different names, do not put suffixes
    ).copy()
df_y_preprocessed_join = pd.merge(
    left=df_y_preprocessed,
    right=df_test,
    left_on="question_id",
    right_on="question_id",
    how="inner",  # only take the ones that are in both df
    suffixes=["", ""]  # as all cols have different names, do not put suffixes
    ).copy()
df_z_preprocessed_join = pd.merge(
    left=df_z_preprocessed,
    right=df_test,
    left_on="question_id",
    right_on="question_id",
    how="inner",  # only take the ones that are in both df
    suffixes=["", ""]  # as all cols have different names, do not put suffixes
).copy()

# Little check
display(df_x_preprocessed_join)


In [None]:
# B

# Create new column to say if the model has given the right answer
df_x_preprocessed_join["correct"] = df_x_preprocessed_join["result"] == df_x_preprocessed_join["answer"]
df_y_preprocessed_join["correct"] = df_y_preprocessed_join["result"] == df_y_preprocessed_join["answer"]
df_z_preprocessed_join["correct"] = df_z_preprocessed_join["result"] == df_z_preprocessed_join["answer"]


# Little Check:
display(df_x_preprocessed_join.head(3))
display(df_y_preprocessed_join.head(3))
display(df_z_preprocessed_join.head(3))

In [None]:
# Print average accuracy for each model
average_correct_x = 100 * df_x_preprocessed_join['correct'].sum() / df_x_preprocessed_join.shape[0]
print(f"Average correct answers for model x is: {average_correct_x:.2f}%")

average_correct_y = 100 * df_y_preprocessed_join['correct'].sum() / df_y_preprocessed_join.shape[0]
print(f"Average correct answers for model y is: {average_correct_y:.2f}%")

average_correct_z = 100 * df_z_preprocessed_join['correct'].sum() / df_z_preprocessed_join.shape[0]
print(f"Average correct answers for model z is: {average_correct_z:.2f}%")

### 1.5 (5 pt)

Hmmm, something doesn't seem quite right. Let's investigate how "balanced" this dataset is:

A. For each of the 57 subjects in the MMLU, compare the number of questions answered by each model. Print the subjects for which there is a more than 10% difference.

B. Propose and implement a reasonable way to rebalance the results. (e.g., while throwing away 100% of the results perfectly rebalances the results, it is not reasonable).

C. Finally, print the updated accuracy on the rebalanced data.

**hint:**:
- (A) For a given subject, let model X and model Y have answered 181 and 200 questions respectively. You can consider this a 10% difference from the perspective of X since: (200 - 181) / 181 > 0.10

In [None]:
# A

# Count how many questions for each category
counts_cat_x = df_x_preprocessed_join.groupby("subject")["subject"].count()
counts_cat_y = df_y_preprocessed_join.groupby("subject")["subject"].count()
counts_cat_z = df_z_preprocessed_join.groupby("subject")["subject"].count()

# Create a new df
cat_df = pd.DataFrame({"x": counts_cat_x, "y":counts_cat_y, "z":counts_cat_z})
# Add a columns with the percentage difference 
    # ATTENTION: first model is the baseline
    # "diff_XY": difference from the perspective of X repsct to Y
cat_df["diff_XY"] = abs((cat_df["y"] - cat_df["x"]) / cat_df["x"])
cat_df["diff_YZ"] = abs((cat_df["z"] - cat_df["y"]) / cat_df["y"])
cat_df["diff_ZX"] = abs((cat_df["x"] - cat_df["z"]) / cat_df["z"])

# Print categories that have more than max_cat_difference 
max_cat_difference = 0.1
unbalanced_cat = []
for cat,row in cat_df.iterrows(): # attention; in this case the index is the name of the subjexct
    if (row[["diff_XY", "diff_YZ", "diff_ZX"]] > max_cat_difference).any():  
        unbalanced_cat.append(cat)

unbalanced_cat_df = cat_df[cat_df.index.isin(unbalanced_cat)]

print("First rows of comparisons:")
display(cat_df.head(3))
print("The dataframe that contains subjects that have more than 10% difference:")
display(unbalanced_cat_df)

print(f"Subjects that have more than {max_cat_difference:.0%} difference:")
for cat in unbalanced_cat:
    print(f"\t{cat}")

In [None]:
# B

# HOW TO REBALANCE?
# Subsample the incriminated subjects in order to have the same number of questions in each model.
# The number choosen is the one of the smallest counts

def rebalance_df(df, col_name, cat, number_samples, seed):
    filtered_df = df[df[col_name] == cat] 
    sampled_df = filtered_df.sample(n=number_samples, random_state=seed)  # Set random_state for reproducibility
    non_x_df = df[df[col_name] != cat]
    final_df = pd.concat([sampled_df, non_x_df], ignore_index=True)
    return final_df


print("Unbalanced Subjects:")
for index, row in unbalanced_cat_df.iterrows():
    min_count = int(row[["x", "y", "z"]].min())
    print(f"\t{index}")
    df_x_preprocessed_join = rebalance_df(df=df_x_preprocessed_join, col_name="subject", cat=index, number_samples=min_count, seed=SEED)
    df_y_preprocessed_join = rebalance_df(df=df_y_preprocessed_join, col_name="subject", cat=index, number_samples=min_count, seed=SEED)
    df_z_preprocessed_join = rebalance_df(df=df_z_preprocessed_join, col_name="subject", cat=index, number_samples=min_count, seed=SEED)

df_x_bal = df_x_preprocessed_join.copy()
df_y_bal = df_y_preprocessed_join.copy()
df_z_bal = df_z_preprocessed_join.copy()


In [None]:
# C
# Print average accuracy for each model
average_correct_x = 100 * df_x_bal['correct'].sum() / df_x_bal.shape[0]
print(f"Average correct answers for model x is: {average_correct_x:.2f}%")

average_correct_y = 100 * df_y_bal['correct'].sum() / df_y_bal.shape[0]
print(f"Average correct answers for model y is: {average_correct_y:.2f}%")

average_correct_z = 100 * df_z_bal['correct'].sum() / df_z_bal.shape[0]
print(f"Average correct answers for model z is: {average_correct_z:.2f}%")

## Task 2 (26 points): What do you mean A > D > B > C...?

Nice work! Having successfully inspected, cleaned, and rebalanced the provided data, you head over to director of the government's NEUTRALITY project. Ms. Sakota is happy with your work so far, but worried that the sloppy intern might have done more undetected damage. To be sure, she orders a new set of evaluations of all models on both MMLU and another dataset.

After cleaning up and rebalancing, you are left with the concatenated score files in the second folder `task_2`:
```
task_2/
│
└── lm_scores_mmlu.csv
│
└── lm_scores_other.csv
```

Each has a new column called `model_name`, which is one of `X, Y` or `Z`.



_NOTE: **only** use data from `task_2` and `task_2_5` for this assignment! The values in `lm_scores_mmlu.csv` will NOT be the same as the dataframes you finished in task 1. This is due to "randomness" or "temperature" in language model inference. This can slightly shift around generative results. (Conveniently: it also ensures any mistakes made in Task 1 don't propogate further ;) )_

In [150]:
# PROVIDED CODE
df_mmlu = pd.read_csv('data/task_2/lm_scores_mmlu.csv')
df_other = pd.read_csv('data/task_2/lm_scores_other.csv')

In [None]:
df_mmlu

### 2.1 (4 pt)

Let's explore the new results:

A. Compute the mean accuracy and standard errors of each model on both datasets and print the results.

B. Then, show your results in a bar plot using standard errors with a 95% confidence interval around the mean. Make sure the plot is easy to read and well annotated.

C. /Discuss:/ the plot you created: (i) can you say that one of the models is the best? (ii) is there anything that seems odd?

In [None]:
# A
df_accuracy = pd.DataFrame()

for model in df_mmlu["model_name"].unique():
    # MMLU Dataset
    df_filtered = df_mmlu[df_mmlu["model_name"] == model]
    df_temp = pd.DataFrame({
        "accuracy": [df_filtered["correct"].sum()/df_filtered.shape[0]],
        "sem": [df_filtered["correct"].sem()],
        "model": [model],
        "dataset": ["mmlu"]
    })
    df_accuracy = pd.concat([df_accuracy, df_temp], ignore_index=True)

    # Other Dataset
    df_filtered = df_other[df_other["model_name"] == model]
    df_temp = pd.DataFrame({
        "accuracy": [df_filtered["correct"].sum()/df_filtered.shape[0]],
        "sem": [df_filtered["correct"].sem()],
        "model": [model],
        "dataset": ["other"]
    })
    df_accuracy = pd.concat([df_accuracy, df_temp], ignore_index=True)

# Calculate 95% confidence intervals (the height of the error bar in the next plot)
df_accuracy["yerr"] = 1.96 * df_accuracy["sem"]


display(df_accuracy)

In [None]:
# B
fig, axs = plt.subplots(1,2,figsize=(15,5), sharey=True)

#mmlu dataset
mmlu_data = df_accuracy[df_accuracy["dataset"] == "mmlu"]
axs[0].bar(mmlu_data["model"], mmlu_data["accuracy"], yerr=mmlu_data["yerr"], capsize=5, color='skyblue')
axs[0].set_title('Mean Accuracy of Models - MMLU Dataset')
axs[0].set_ylabel('Mean Accuracy')
axs[0].set_xlabel('Model Name')
axs[0].set_xticks(mmlu_data["model"])
axs[0].set_xticklabels(mmlu_data["model"])
axs[0].grid(axis='y')

# Bar plot for df_other
other_data = df_accuracy[df_accuracy["dataset"] == "other"]
axs[1].bar(other_data["model"], other_data["accuracy"], yerr=other_data["yerr"], capsize=5, color='lightgreen')
axs[1].set_title('Mean Accuracy of Models - Other Dataset')
axs[1].set_xlabel('Model Name')
axs[1].set_xticks(other_data["model"])
axs[1].set_xticklabels(other_data["model"])
axs[1].grid(axis='y')

plt.tight_layout()
plt.show()

**C. /Discuss:/**

1) Simply looking at the charts, model X seems to be performing better than the other 2 when generalizang to other datasets, and model Z doing the worst by quite a margin. However this is not enough to conclude anything about the quality of the models or whcih one is really the best.
2) One observation we can make is that the models seem to be performing quite differently from one dataset to the next, with a difference in accuracy of about 5% between the two datasets for models X and Y. Something else we can notice is that the standard errors are ??significantly?? bigger in the "Other Dataset" compared to the "MMLU Dataset".

CI semmes to be very samll

strange is thta th emodles seems ot better differelty in the 2 satsasets.

### 2.2 (5 pt)

Ms. Sakota has assured you that both datasets contain questions of similar difficulty, so, what could be going on here?

A. What is the distribution of correct answers (A, B, C, D) for each dataset? Create a bar chart to visualize this.

B. Perform a chi-square test at $\alpha = 0.05$, of independence to determine if there's a significant difference in the distribution of correct answers between the two datasets. What do you conclude?

**hints**:
- for (A), keep in mind that df_mmlu and df_other contain the results of all models, i.e., the `question_id` column is duplicated.
- for (A), take care to clearly annotate the bar chart, e.g., title, y-label, legend.
- for (B), clearly state the null hypothesis and alternative hypothesis
- use the `chi2_contingency` function from `scipy.stats`
- format your results from answer (A) as a 2D array

In [None]:
# A
fig, axs = plt.subplots(1,2,figsize=(15,5), sharey = True)

# Count answers by letter
df_counts_mmlu = df_mmlu.groupby("answer").agg(
    count=('answer', 'size'),
).reset_index()
df_counts_mmlu["frac_answer"] = df_counts_mmlu["count"] / len(df_mmlu.index) # Calculate the fraction of each answer
display(df_counts_mmlu)

# do the same for the orther df
df_counts_other = df_other.groupby("answer").agg(
    count=('answer', 'size'),
).reset_index()
df_counts_other["frac_answer"] = df_counts_other["count"] / len(df_other.index)

#mmlu dataset
mmlu_data = df_accuracy[df_accuracy["dataset"] == "mmlu"]
axs[0].bar(df_counts_mmlu["answer"],df_counts_mmlu["frac_answer"], capsize=5, color='skyblue')
axs[0].set_title('Distribution of correct answers - MMLU Dataset')
axs[0].set_ylabel('Frequency')
axs[0].set_xlabel('Answer Letter')
axs[0].set_xticks(df_counts_mmlu["answer"])
axs[0].set_xticklabels(df_counts_mmlu["answer"])
axs[0].grid(axis='y')

# Bar plot for df_other
axs[1].bar(df_counts_other["answer"], df_counts_other["frac_answer"], capsize=5, color='salmon')
axs[1].set_title('Distribution of Correct Answers - Other Dataset')
axs[1].set_ylabel('Frequency')
axs[1].set_xlabel('Answer Letter')
axs[1].set_xticks(df_counts_other["answer"])
axs[1].set_xticklabels(df_counts_other["answer"])
axs[1].grid(axis='y')

plt.tight_layout()
plt.show()

In [None]:
# B

contingency_table = pd.DataFrame({
    'Correct': [df_mmlu['correct'].sum(), df_other['correct'].sum()],
    'Incorrect': [df_mmlu['correct'].count() - df_mmlu['correct'].sum(), 
                  df_other['correct'].count() - df_other['correct'].sum()]
}, index=['MMLU', 'Other'])

display(contingency_table)

# Perform Chi-squared test
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Null Hypothesis (H0): Assumes that there is no association between the two categorical variables (i.e., there is a significant differce between the 2 distributions).")
print("Alternative Hypothesis (H1): Assumes that there is an association between the variables (i.e., they are similar).")

# Decision
alpha = 0.05  # significance level
if p < alpha:
    conclusion = "Reject the null hypothesis: There is a significant difference in the distribution of correct answers between the two datasets."
else:
    conclusion = "Fail to reject the null hypothesis: There is no significant difference in the distribution of correct answers between the two datasets."

print(conclusion)

print("We conculde that even if we observe some differences in the percentage of correct answers between the 2 datasets, in reality they are just due to chances.")

### 2.3 (7 pt)

Let's dive in deeper:

A. What is language model X's mean accuracy conditioned on the four answer options for each dataset?

B. Compare LM X's performance when the correct answer is "A" between the two datasets. Use a T-test with CI = 0.95. What do you conclude?

C. Compare LM X's performance when the correct answer is "A" vs. "C or D" for each dataset. Use a T-test with CI = 0.95. What do you conclude?

In [None]:
# A
df_X_mmlu = df_mmlu[df_mmlu["model_name"] == "X"].groupby("answer").agg(
    tot_correct=("correct", "sum"),
    tot=("correct", "size")
)
df_X_mmlu["accuracy"] = df_X_mmlu["tot_correct"] / df_X_mmlu["tot"]
display(df_X_mmlu)

df_X_other = df_other[df_other["model_name"] == "X"].groupby("answer").agg(
    tot_correct=("correct", "sum"),
    tot=("correct", "size")
)
df_X_other["accuracy"] = df_X_other["tot_correct"] / df_X_other["tot"]
display(df_X_other)

In [None]:
# B

# create arrays of right and wrong answers
#mmlu
mmlu_A = df_mmlu[(df_mmlu["model_name"] == "X") & (df_mmlu["answer"] == "A")]["correct"].to_list()
mmlu_A = [int(x) for x in mmlu_A]

# #other
other_A = df_other[(df_other["model_name"] == "X") & (df_other["answer"] == "A")]["correct"].to_list()
other_A = [int(x) for x in other_A]

# perform t test
t_A = ttest_ind(a = mmlu_A, b = other_A, equal_var=True) # uses 95% CI by deafualt

# conclude

print("Null Hypothesis: Mean of LM X's answers when the correct answer is A is the same between the two datasets.")
print("Alternative Hypothesis: Mean of LM X's answers when the correct answer is A is different between the two datasets.")

print("p value:", round(t_A.pvalue, 3))

print("The p value is greater than 0.05, therefore we do not reject the null hypothesis.") 
print("We cannot say that there is a difference in LM X's performance between the two datasets when answering questions to which the correct answer is A.")

In [None]:
# C

# create arrays of right and wrong answers
# mmlu
mmlu_CD_correct = df_X_mmlu.loc[['C', 'D'], 'tot_correct'].sum()
mmlu_CD_tot = df_X_mmlu.loc[['C', 'D'], 'tot'].sum()
mmlu_CD_incorrect = mmlu_CD_tot - mmlu_CD_correct

mmlu_CD = [0] * mmlu_CD_incorrect + [1] * mmlu_CD_correct

# other
other_CD_correct = df_X_other.loc[['C', 'D'], 'tot_correct'].sum()
other_CD_tot = df_X_other.loc[['C', 'D'], 'tot'].sum()
other_CD_incorrect = other_CD_tot - other_CD_correct

other_CD = [0] * other_CD_incorrect + [1] * other_CD_correct

# perform t tests

t_mmlu_ACD = ttest_ind(a = mmlu_A, b = mmlu_CD, equal_var=True)

t_mmlu_other = ttest_ind(a = other_A, b = other_CD, equal_var=True)

# conclude

print("MMLU:")
print("Null Hypothesis: Mean of LM X's answers when the correct answer is A is the same as when it is C or D.")
print("Alternative Hypothesis: Mean of LM X's answers when the correct answer is A is different from when it is C or D.")

print("p value:", t_mmlu_ACD.pvalue)

print("The p value is smaller than 0.05, therefore we reject the null hypothesis.") 
print("We can conclude that there is a difference in the model's performance when answering questions where A is correct and where C or D is correct.")
print("\n")

print("Other:")

print("p value:", t_mmlu_other.pvalue)
print("We can conclude the same thing for the Other dataset, the p value being under 0.05.")

### 2.4 (2 pt)

What an intriguing finding! 

A. Print the mean accuracies conditioned on the correct answer for all LMs for each dataset.

B. /Discuss:/ What do you observe?

In [None]:
# A

models = df_mmlu["model_name"].unique()  # Get unique model names from the MMLU dataset

# Create an empty DataFrame to store results
accuracy_summary = pd.DataFrame()

for model in models:
    # MMLU Dataset
    df_model_mmlu = df_mmlu[df_mmlu["model_name"] == model].groupby("answer").agg(
        tot_correct=("correct", "sum"),
        tot=("correct", "size")
    )
    df_model_mmlu["accuracy"] = df_model_mmlu["tot_correct"] / df_model_mmlu["tot"]
    df_model_mmlu["model_name"] = model
    df_model_mmlu["dataset"] = "mmlu"
    
    # Append results to the summary DataFrame
    accuracy_summary = pd.concat([accuracy_summary, df_model_mmlu.reset_index()], ignore_index=True)

    # Other Dataset
    df_model_other = df_other[df_other["model_name"] == model].groupby("answer").agg(
        tot_correct=("correct", "sum"),
        tot=("correct", "size")
    )
    df_model_other["accuracy"] = df_model_other["tot_correct"] / df_model_other["tot"]
    df_model_other["model_name"] = model
    df_model_other["dataset"] = "other"
    
    # Append results to the summary DataFrame
    accuracy_summary = pd.concat([accuracy_summary, df_model_other.reset_index()], ignore_index=True)

# Display the summarized accuracy results for all models
display(accuracy_summary)

In [None]:
import seaborn as sns

# Set the style for seaborn
sns.set_theme(style="whitegrid")

# Create a plot with facets for each dataset
plt.figure(figsize=(12, 6))

# Create a bar plot for the accuracy summary
sns.barplot(data=accuracy_summary, x='answer', y='accuracy', hue='model_name', 
            palette='viridis', errorbar=None, dodge=True)

# Add titles and labels
plt.title('Mean Accuracy of Language Models Conditioned on Answer Options', fontsize=16)
plt.ylabel('Mean Accuracy', fontsize=14)
plt.xlabel('Answer Options', fontsize=14)
plt.legend(title='Model Name')
plt.grid(axis='y')

# Show the plot
plt.show()

**B. /Discuss:/**

Model X is doing much better on questions to which the answer is A, compared to other answers. We can observe something similar for model Y and questions with answer D. model Z however seems to be doing quite badly on all questions.

### 2.5 (2 pt)

Concerned with your findings so far, you quickly consult with Ms. Sakota. After thinking it over, Ms. Sakota concludes that more tests are needed. She orders a second round of MMLU results. However, the clever Ms. Sakota thinks of the following twist: while keeping questions fixed, she randomly permutes the position of the correct answer. The new results can be found in the folder `data/task_2_5/`:
```
task_2_5/
│
└── lm_scores_mmlu_shuffle.csv
```

/Discuss:/ Why would Ms. Sakota do this?

**/Discuss:/**

For different pouposes, like:
- Avoid that the model becomes biased towards a specific answer location. For instance, if the correct answer is always the first one, the model will learn to pick the first one independantly of any correct explanation.
- Save time and money. This allows Ms. Sakota to create a "new" dataset for testing the LLM without having to spend much time or money.

### 2.6 (4 pt)

Increasingly sceptical of the language models' performance, you read up on proper testing practices. You stumble upon the concept of [test-rested stability](https://en.wikipedia.org/wiki/Repeatability), which roughtly states that:

"_Measurements taken by a single person or instrument on the same item, under the same conditions, and in a short period of time, should have the same results._"

In our case, we would assume an LM would have the same performance on a given question regardless of the correct answer position. One way of testing this is by using the following metric:

$$\text{test-retest metric} = \frac{1}{N}\sum_{i=1}^N \frac{1}{M}\sum_{j=1}^M c^i_0 c_j^i,$$

where $c^i_0 \in \{0, 1\}$ indicates whether the model answers the $i^{\text{th}}$ question correctly (1 if correct, 0 if incorrect). $c_j^i$ indicates whether the model answers the $i^{\text{th}}$ question correctly in the $j^{\text{th}}$ shuffled version of the answer label content. Finally, $M$ is the total number of shuffles and $N$ is the dataset size.

Task: compute the test-retest metric for each language model using the original `lm_scores_mmlu.csv` file and the new `lm_scores_mmlu_shuffle.csv` file. Using a bar plot, visualize your results by comparing the accuracy of the original `lm_scores_mmlu.csv` and the test-retest scores.

**hints**
- what is $M$ in our case?

(bonus: no points, but so much sweet, sweet knowledge - check out [the following article](https://arxiv.org/pdf/2406.19470v1))

In [None]:
# Load the new dataset
df_mmlu = pd.read_csv('data/task_2/lm_scores_mmlu.csv')
df_mmlu_shuffle = pd.read_csv("data/task_2_5/lm_scores_mmlu_shuffle.csv")
df_mmlu["shuffled"] = True

# merge them
df = pd.merge(left = df_mmlu, 
              right = df_mmlu_shuffle,
              on = ["question_id", "model_name"],
              suffixes=["_normal", "_shuffle"]
              )

print(df_mmlu.shape)
print(df_mmlu_shuffle.shape)
print(df.shape)
df.head(3)
df.columns

In [None]:
# calculate accuracy for the normal model
df_accuracy = pd.DataFrame()

models = df["model_name"].unique()

for model in models:
    df_filtered = df_mmlu[df_mmlu["model_name"] == model]
    df_temp = pd.DataFrame({
        "accuracy": [df_filtered["correct"].sum()/df_filtered.shape[0]],
        "sem": [df_filtered["correct"].sem()],
        "model": [model],
        "dataset": ["mmlu"]
    })
    df_accuracy = pd.concat([df_accuracy, df_temp], ignore_index=True)

display(df_accuracy)

In [None]:
# with M = 1 the formula should be:
    # 1/n sum(c_mmlu * c_mmule_shuffled)

trm = []
for model in models:
    df_filtered = df[df["model_name"] == model]
    # print(model)
    # Calculate test retest metric
    c_shuffle = df_filtered["correct_shuffle"] #correct columns of shuffle
    c_normal = df_filtered["correct_normal"] #correct columns of normal
    trm.append(np.mean(c_shuffle * c_normal))

df_accuracy["trm"] = trm

display(df_accuracy)

In [None]:
# Setting up the bar plot
bar_width = 0.35  # Width of bars
x = np.arange(len(df_accuracy['model']))  # the label locations

fig, ax = plt.subplots(figsize=(10, 6))

# Create bars for accuracy and trm scores
bars1 = ax.bar(x - bar_width/2, df_accuracy['accuracy'], bar_width, label='Original Accuracy', color='skyblue')
bars2 = ax.bar(x + bar_width/2, df_accuracy['trm'], bar_width, label='Test-Retest Scores', color='salmon')

# Adding labels and title
ax.set_xlabel('Model')
ax.set_ylabel('Scores')
ax.set_title('Comparison of Original Accuracy and Test-Retest Scores')
ax.set_xticks(x)
ax.set_xticklabels(df_accuracy['model'])
ax.legend()

# Adding value labels on top of bars
for bar in bars1:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 3), ha='center', va='bottom')

for bar in bars2:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 3), ha='center', va='bottom')

# Show the plot
plt.tight_layout()
plt.show()

**discussion**
MAKE IT BETTER!!!
- Model's performance can vary significantly between tests.
- LM performs well on a training dataset but poorly on new, unseen data (as reflected by a low TRM), it could indicate that the model is overfitting.
- The test-retest scores are low for all three models, which confirms that the positioning of the answers makes quite a difference in the answer of the models.

### 2.7 (2 pt)

A. Using the unshuffled data: For each LM, print the distribution of the answers they give as well as the accuracy conditioned on the answer they give.

B. /Discuss:/ Describe what you observe

[bonus: not scored, but again _that sweet, sweet knowledge_] Could you think of a plausible explanation?

In [None]:
# A

# MAYBE MAKE APLOT

models = df_mmlu["model_name"].unique()  # Get unique model names from the MMLU dataset

# Create an empty DataFrame to store results
accuracy_summary = pd.DataFrame()

for model in models:
    # MMLU Dataset
    df_model_mmlu = df_mmlu[df_mmlu["model_name"] == model].groupby("result").agg(
        tot_correct=("correct", "sum"),
        tot=("correct", "size")
    )
    df_model_mmlu["result_proportion"] = df_model_mmlu["tot"] / len(df_mmlu[df_mmlu["model_name"] == model].index)
    df_model_mmlu["accuracy"] = df_model_mmlu["tot_correct"] / df_model_mmlu["tot"]
    df_model_mmlu["model_name"] = model
    
    # Append results to the summary DataFrame
    accuracy_summary = pd.concat([accuracy_summary, df_model_mmlu.reset_index()], ignore_index=True)

# Display the summarized accuracy results for all models
display(accuracy_summary)

B. /Discuss:/

- We can observe that both the accuracy and the proportion of results have a wide range of values depending on the result. This is particularily noticeable in models X and Y, X having accuracy ranging from 0.37 to 1.0 and Y having proportions ranging from 0.091 to 0.46.
- We can also note that answers with a low proportion like C and D for model X or A for model Y have the best accuracy.

## Task 3 (16 points): What do Questions and Answers look like for a Language Model?

While you feel pretty good about the tests you conducted so far, something still bothers you: what if the language models don't see the data like you do? Suddenly, you receive a phone call from a wise AI sage in the West, _Westoda_:

```
"Hmm, correct you are, young padawan, to question how the world is seen by large language models! Simple 'text' it is not, hmm? No, no, no! Characters and words, the way of puny humans, this is not, heh heh heh.

'Tokens', they use, yes! Mysterious and powerful, these tokens are. Expand our vocabulary, they do, beyond the simple 'a to Z'. Chunky blocks of text, they become, yes! 'Hello world', a simple phrase it may seem. But to a language model, '[24912, 2375]' it might appear, yes! Confusing, it is, hmm?

Wise, it would be, to explore these MMLU data points through the eyes of a language model, you think? Yes, yes! Much to learn, there is. The ways of the tokens, understand you must, if truly comprehend the great LMs, you wish to.
Meditate on this, you should. The force of natural language processing, strong it is. But patience, you must have, my young padawan. For only through great study and contemplation, will the mysteries of the tokens reveal themselves to you, they will. Yes, hmmm!"
```

Admittingly, Westoda at times speaks in riddles… However, he was explaining a crucial aspect of modern LMs called [Tokenization](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens):


“Tokens are words, character sets, or combinations of words and punctuation that are used by [language models (LMs)] to decompose text into. Tokenization is the first step in training”

Instead of characters, LMs process natural language using “tokens”. While this is useful for a number of reasons, it does at times introduce some “unintuitive” behavior…

In [None]:
# PROVIDED CODE

try:
    import tiktoken
except Exception as e:
    print('installing tiktoken package')
    
    !pip install tiktoken
    
    import tiktoken

def tokenize_text(s):
    enc = tiktoken.encoding_for_model('gpt-4o')
    tokens = enc.encode(str(s))
    return tokens

example_string = 'hello world'
print(f'humans see: "{example_string}" --> language models see: {tokenize_text(example_string)}')

### 3.1 (5 pt)

Use the provided code in the cell above to "see the world through the eyes of a language model":

A. Tokenize the questions of the original MMLU data provided in task 1: `task_1/mmlu_data/test.csv` and plot the token distribution (the frequency of each token).

B. Same as (A), but now for the answers in columns (columns "A", "B", "C", and "D").

C. Isolate the tokens for the strings "A", "B", "C", and "D", then, for their occurances in both questions and answers, print their relative distribution to each other.

**hint**
- There are a _lot_ of tokens, consider using a cutoff point and log scale
- For (c), they should sum to 1

In [None]:
# A

df_test_tokenized = df_test.copy()

df_test_tokenized["tokenized_question"] = df_test_tokenized["question"].apply(lambda s : tokenize_text(s))

df_test_tokenized.head()

In [170]:
def get_all_tokens(token_series: pd.Series) -> pd.Series:
    # concatenate all tokens in the series
    all_tokens = []
    for tokens in token_series:
        all_tokens.extend(tokens)

    # create a series from the list of tokens
    return pd.Series(all_tokens)

In [None]:
# count the number of times each token appears
sr_question_tokens = get_all_tokens(df_test_tokenized["tokenized_question"])

# find the percentiles of the tokens
percentiles = np.arange(0, 1.01, 0.05)
token_percentiles = sr_question_tokens.quantile(percentiles)

count_unique_tokens = token_percentiles.apply(
    lambda x: sr_question_tokens[sr_question_tokens <= x].nunique()
)

plt.figure(figsize=(12, 6))

count_unique_tokens.plot(logy=True)

plt.title("Percentiles of Tokens in Original MMLU Data")
plt.ylabel("Number of unique tokens")
plt.xlabel("Percentile")
plt.xticks(percentiles)
plt.xlim(0, 1)

plt.show()

In [None]:
cut_off_point = int(sr_question_tokens.quantile(0.3))
print("Cut-off point:", cut_off_point)
sr_question_tokens_cut = sr_question_tokens[sr_question_tokens <= cut_off_point]
print("Number unique of tokens <= cut-off point:", sr_question_tokens_cut.nunique())
print("Number of tokens <= cut-off point:", len(sr_question_tokens))

In [173]:
# We can see the distribution is right-skewed and the number of unique tokens starts increasing significantly after 0.3.
# As the point still keeps a reasonable amount of unique tokens, we choose 30th percentile as the cut-off point.
sr_question_tokens_cut_freq = sr_question_tokens_cut.value_counts() / len(sr_question_tokens_cut)

In [None]:
# Plot the token distribution

plt.figure(figsize=(35, 10))

sr_question_tokens_cut_freq.plot(kind="bar", logy=True)

plt.title(
    f"Question Token Distribution of Original MMLU Data (token <= {cut_off_point})",
    fontsize=22,
)
plt.ylabel("Frequency", fontsize=20)
plt.xlabel("Token", fontsize=20)

plt.show()

In [None]:
# B

# tokenize the answers
df_test_tokenized["tokenized_A"] = df_test_tokenized["A"].apply(lambda s : tokenize_text(s))
df_test_tokenized["tokenized_B"] = df_test_tokenized["B"].apply(lambda s : tokenize_text(s))
df_test_tokenized["tokenized_C"] = df_test_tokenized["C"].apply(lambda s : tokenize_text(s))
df_test_tokenized["tokenized_D"] = df_test_tokenized["D"].apply(lambda s : tokenize_text(s))

df_test_tokenized.head()

In [176]:
# Count the number of times each token appears
# We use the same cut-off point as before to keep the same range of unique tokens

sr_A_tokens = get_all_tokens(df_test_tokenized["tokenized_A"])
sr_A_tokens_cut = sr_A_tokens[sr_A_tokens <= cut_off_point]
sr_A_token_counts_cut = sr_A_tokens_cut.value_counts()
sr_A_token_counts_cut_freq = sr_A_token_counts_cut / len(sr_A_tokens_cut)

sr_B_tokens = get_all_tokens(df_test_tokenized["tokenized_B"])
sr_B_tokens_cut = sr_B_tokens[sr_B_tokens <= cut_off_point]
sr_B_token_counts_cut = sr_B_tokens_cut.value_counts()
sr_B_token_counts_cut_freq = sr_B_token_counts_cut / len(sr_B_tokens_cut)

sr_C_tokens = get_all_tokens(df_test_tokenized["tokenized_C"])
sr_C_tokens_cut = sr_C_tokens[sr_C_tokens <= cut_off_point]
sr_C_token_counts_cut = sr_C_tokens_cut.value_counts()
sr_C_token_counts_cut_freq = sr_C_token_counts_cut / len(sr_C_tokens_cut)

sr_D_tokens = get_all_tokens(df_test_tokenized["tokenized_D"])
sr_D_tokens_cut = sr_D_tokens[sr_D_tokens <= cut_off_point]
sr_D_token_counts_cut = sr_D_tokens_cut.value_counts()
sr_D_token_counts_cut_freq = sr_D_token_counts_cut / len(sr_D_tokens_cut)

dict_answer_token_cut_counts = {
    "A": sr_A_token_counts_cut_freq,
    "B": sr_B_token_counts_cut_freq,
    "C": sr_C_token_counts_cut_freq,
    "D": sr_D_token_counts_cut_freq,
}

In [None]:
# Plot the token distribution

for answer, freq in dict_answer_token_cut_counts.items():

    plt.figure(figsize=(35, 10))

    freq.plot(kind="bar", logy=True)

    plt.title(
        f"Answer {answer} Token Distribution of Original MMLU Data (token <= {cut_off_point})",
        fontsize=22,
    )
    plt.ylabel("Frequency", fontsize=20)
    plt.xlabel("Token", fontsize=20)

    plt.show()

In [None]:
# C

# build a dictionary of "A", "B", "C", "D" tokens
iso_chars = ["A", "B", "C", "D"]
dict_ABCD_token = {k: tokenize_text(k)[0] for k in iso_chars}

# find the occurances of "A", "B", "C", "D" in the questions and answers
for char, token in dict_ABCD_token.items():
    df_test_tokenized[f"{char}_occur_pair"] = df_test_tokenized[
        "tokenized_question"
    ].apply(lambda x: token in x) | df_test_tokenized[f"tokenized_{char}"].apply(
        lambda x: token in x
    )

df_chars = pd.DataFrame.from_dict(dict_ABCD_token, orient="index").reset_index()
df_chars.columns = ["char", "token"]
df_chars["count_occur_question_answers"] = df_chars["char"].apply(
    lambda x: df_test_tokenized[f"{x}_occur_pair"].sum()
)
df_chars["count_occur_question_answers_rel"] = (
    df_chars["count_occur_question_answers"]
    / df_chars["count_occur_question_answers"].sum()
)

print("Relative distribution of A, B, C, D tokens occuring in both the questions and answers:")
display(df_chars[["char", "count_occur_question_answers_rel"]])

### 3.2 (3 pt)

What if the number of "A", "B", "C", and "D" tokens in the question and answer pairs could influence a language model's decisions?

A. For each combined question-answers pair, compute: 
1. the number of "A", "B", "C", and "D" tokens; and
2. the total number of tokens.
3. then, group by the "correct" answer and compute the mean frequency of A, B, C, and D tokens and the total number of tokens. 
4. finally, print your results

B. /Discuss:/ What do you think of the hypothesis that the frequency of A, B, C, and D tokens could influence answers?


In [None]:
# A

# compute number of tokens in question-answer pairs
for char, token in dict_ABCD_token.items():
    df_test_tokenized[f"pair_{char}_count"] = df_test_tokenized[
        "tokenized_question"
    ].apply(lambda x: x.count(token)) + df_test_tokenized[f"tokenized_{char}"].apply(
        lambda x: x.count(token)
    )
    df_test_tokenized[f"pair_{char}_total"] = df_test_tokenized[
        "tokenized_question"
    ].apply(len) + df_test_tokenized[f"tokenized_{char}"].apply(len)

df_test_tokenized.head(3)

In [180]:
# group by answer
group_by_answer = df_test_tokenized.groupby("answer").agg(
    mean_freq_A = ("pair_A_count", "mean"),
    mean_freq_B = ("pair_B_count", "mean"),
    mean_freq_C = ("pair_C_count", "mean"),
    mean_freq_D = ("pair_D_count", "mean"),
    mean_freq_total_pair_A = ("pair_A_total", "mean"),
    mean_freq_total_pair_B = ("pair_B_total", "mean"),
    mean_freq_total_pair_C = ("pair_C_total", "mean"),
    mean_freq_total_pair_D = ("pair_D_total", "mean"),
)

In [None]:
group_by_answer

B. /Discuss:/

The mean frequency of each letter corresponding to the answers is very close. Besides, the frequency of the letter "A" is noticeably higher than other letters across all answers provided by the model. We believe that individual letters do not influence the model's choice of answers.

### 3.3 (4 pt)

Three of the most important considerations when deciding between language models are:

Quality
Costs
Speed

So far, much of your analysis has focused on quality. However, the government has indicated that they are quite concerned about both the total costs and speed as well. Specifically, it has been brought to their attention that a new `turbo` model has been launched! 

This model is both cheaper and faster than the models you evaluated so far. However, there is a catch: the context length* is much smaller than that of the other LMS. Namely, it can only process **300** tokens during inference. Meanwhile, the other models can process up to 100K tokens! 

*_The “context length” refers to the number of tokens that can be given to an LM as input._

A. Are there subjects where using the cheaper model might be problematic? I.e., where part of the question and answer(s) might not fit completely in the context?

B. /Discuss:/ Can you think of a strategy that would balance the needs of the government?

**hint**:
- An LM needs to have both the question and the different answer options in its context

In [None]:
# A

# find context that exceeds 300 tokens
# original_count --> tot number of questions (of that category)
# count --> tot number of questions (of that category) that have more than 300 tokens (counting both qeustion and answers)
df_test_tokenized["context"] = (
    df_test_tokenized["tokenized_question"]
    + df_test_tokenized["tokenized_A"]
    + df_test_tokenized["tokenized_B"]
    + df_test_tokenized["tokenized_C"]
    + df_test_tokenized["tokenized_D"]
)
df_token_exceeded = df_test_tokenized[df_test_tokenized["context"].apply(len) > 300]

df_subject_exceeded = df_token_exceeded["subject"].value_counts().reset_index()
df_subject_exceeded["original_count"] = df_subject_exceeded["subject"].apply(
    lambda x: df_test_tokenized[df_test_tokenized["subject"] == x].shape[0]
)
df_subject_exceeded["rate"] = (
    df_subject_exceeded["count"] / df_subject_exceeded["original_count"]
)
df_subject_exceeded

In [None]:
# find problematic subjects and that significantly affected
problemal_subject = df_subject_exceeded["subject"]
significant_subject = df_subject_exceeded[df_subject_exceeded["rate"] > 0.1]["subject"]
print("Affected subjects:")
for subject in problemal_subject:
    print("\t", subject)
print("Problematic subjects (significantly affected, with rate > 0.1):")
for subject in significant_subject:
    print("\t", subject)

B. /Dicsuss:/

Use the original model in subjects that are significantly affected, such as law and history, and apply the new turbo model in other subjects.

ry to summarize adn re-formulate the questions (like remove reuse of part of the question in the answer).

### 3.4 (4 pt)

/Discuss:/ The time has come to give your final recommendation on the use of LMs in education to the government! Taking into account everything you analyzed in all the preceding tasks (1, 2, and 3), please write a short recommendation consisting of 4 bullet points discussing your concerns.

**hint**
- Try to use the MECE framework: _Mutually Exclusive Collectively Exhaustive_

/Discuss:/

1. The model exhibits overfitting, performing worse on other datasets compared to the MMLU dataset. This suggests that the model can carry inherent biases from their training data, and its accuracy in real-world use may not meet expected performance.

2. The model is sensitive to the positional information of the answers, meaning that when asking questions, the model shows bias in answer ranking and may not treat all answers impartially.

3. In subjects with longer contexts, such as history and law, where the text is more extensive, the model may lack the ability to process all the information, resulting in an inability to effectively address issues in these subjects.

4. The use of the model may incur significant costs, and its application across subjects should be adjusted based on budget considerations.