Data Analysis: Clinic 1 - Group 15

Names: Emma Calvino, Ingrid Salvador, Miriam Espinosa

Numbers: i6328339, i6314966, i6320314

**Use of genAI tools (e.g. chatGPT), websites (e.g. stackoverflow)**: *list websites where you found code (or other info) as well as include information on how you used genAI tools*

# Data Analysis, Clinic 1
# FIETS: Fundamentele Innovatie En Technologie in Scholing
## Met FIETS blijft het onderwijs vooruitgaan, zelfs tegen de wind in!

---

By completing and delivering the clinic tasks you will know how to :

- Load data and handle data using pandas;
- Navigate the documentation of Python packages by yourself;
- Filter and tidy up **noisy** real-world datasets;
- Aggregate your data in different (and hopefully helpful) ways;
- Use EDA to learn more about your data
- Create and interpret informative visualizations to explore the data set
- Derive meaningful insights for the societal impact of datasets

---
**Important Dates.**

- Clinic 1 release: Thu 30 Jan 2024
- Clinic 1 due: Fri 07 Feb 2024 late night, wildcards available

**Instructions for the deliverable:**

* You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you may do so, but must justify your choice.

* Make sure that you include a proper amount/mix of comments, results and code. More specifically, be sure to provide a concise textual description of your thought process, the assumptions you made, the solution you implemented, and explanations for your answers. A notebook that only has code cells will not suffice. To avoid confusion: use short comments for longer code answers.

* For questions containing the /Discuss:/ prefix, answer not with code, but with a textual explanation (in markdown).

* Back up any hypotheses and claims with data, since this is an important aspect of the course.

* Please write all your comments in English, and use meaningful variable names (as possible) in your code. 

* In the end, make sure that all cells are executed properly and everything you need to show is in your (execucted) notebook. We will not run your notebook for you! 

- In continuation to the previous point, interactive plots, such as those generated using the ‘plotly’ package, should be strictly avoided! Make sure to print results and/or dataframes that confirm you have properly addressed the task.

* You are asked to deliver **only your executed notebook file, .ipnyb** and nothing else. If you deliver other files, we will not grade anything.

* Honor code applies to these tasks. If you are not certain about an action, consult with Jerry.

**A Note from Jerry on using Language Models (LMs)**

If you try hard enough, you will likely get away with cheating (that does not only apply to LMs). Fortunately, my job is not to police, but rather to educate you. So, please consider the following:

I assume that you are taking this course to learn something! LMs are not always right ([they often fail in silly ways](https://community.openai.com/t/why-9-11-is-larger-than-9-9-incredible/869824/4)). This course should prepare you to detect when they are wrong!

I don't restrict the use of LMs because I see the value of being helped when coding (esp. in the context of pandas dataframes nightmare :)). Based on what we saw last year in your notebooks, it's pretty clear when you "copy" some code and then you struggle to interpret the results. This is the essence of this course and of the skills you should try build for yourself: Many people can run fancy models these days but not many people can interpret the results correctly. Try to be the latter ones.

---

## Context

AI is booming! Newspapers, influencers and your relatives all agree that AI is important. But while almost everyone agrees that AI is the future, much is unclear about what that future esp. in critical sectors like education looks like...

Freshly graduated from a top Dutch university in Limburg, you are hired by the Dutch government to advise on a large-scale “education innovation” initiative code-named "FIETS" (Flexibele Innovatie voor Efficiënte Toepassing in Scholing). With higher education facing severe budget cuts, the government is looking for creative solutions to "do more with less." Convinced by the stunning progress in language modeling, officials believe LLMs could help battle growing teacher shortages and reduce costs by automating parts of the education process. Your job description: investigate which LMs might be best suited to plug the gaps without draining the budget!

You are handed the results of three LMs on the [“Massive Multitask Language Understanding (MMLU)”](https://arxiv.org/abs/2009.03300) dataset  to compare. This famous dataset consists of 57 subjects with multiple-choice questions, covering diverse subjects like mathematics, computer science, history, and law. Most providers of state-of-the-art LMs use this dataset to showcase the versatility of their latest models. Unfortunately, the intern responsible for collecting the results, didn’t pay attention during DACS KEN3450: Data Analysis. As a result, the collected datasets are slightly corrupted. Jammer!

The success of FIETS depends on your ability to make sense of the messy data and recommend the best model to keep the Dutch education system pedaling forward—despite uphill challenges like funding shortages and a skeptical academic community!

### A very brief primer on Language Models
We studied LLMs in the context of the NLP course but here is a short reminder. Language models (LMs) are sophisticated statistical models designed to understand and generate human-like text. At their core, LMs are trained to predict the most likely continuation of a given input text. For example, given the input "The cat sat on the," an LM might predict "mat" as a likely continuation.
LMs are trained on vast text samples from various sources, including books, websites, and social media. This extensive training allows them to capture patterns and relationships in language, enabling them to generate coherent and contextually appropriate text across a wide range of topics and styles.

While LMs can produce text that appears to be written by intelligent humans, it's important to note that their capabilities can diverge from human intelligence in unexpected ways. They may sometimes generate factually incorrect information or struggle with complex reasoning tasks.

Two key concepts in understanding LMs are:
1. **Tokens**: LMs process text using "tokens" rather than individual characters. Tokens can be words, parts of words, or punctuation marks. For example, the sentence "I love AI!" might be tokenized as ["I", "love", "AI", "!"]. Tokenization is the first step in both training and using an LM.
2. **Context**: The input text provided to an LM is called the "context." This context informs the model's predictions or generations. A longer or more specific context often leads to more accurate and relevant outputs.

[See: Wikipedia entry on language models](https://en.wikipedia.org/wiki/Large_language_model)

###  Files for this assignment
This assignment is divided into three tasks, each of which should bring you a step closer to providing a recommendation toward project the objectives of FIETS:

- **Task 1**: Inspecting the results and getting your first model ranking
- **Task 2**: Inspecting the underlying data used to generate the results for possible biases
- **Task 3**: Learning about tokens and providing a final recommendation


```
📁 FIETS
│
├── 📄 clinic1.ipynb (the file you're currently reading!)
│
└── 📁 data
    ├── 📁 task_1
    ├── 📁 task_2
    └── 📁 task_2.5
```   
 

In [166]:
# some basic imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from scipy.stats import ttest_ind

## Task 1 (18 points): What's in an average anyway?

The files needed to complete task 1 can be found in the folder "`data/task_1/`:
```
task_1/
│
├── mmlu_data/
│   └── test.csv
│
└── lm_scores/
    ├── lm_X.csv
    ├── lm_Y.csv
    └── lm_Z.csv
```

We will start by loading, (manually) inspecting, and cleaning the data. Although it doesn't seem "glamorous" (nor is it particularly fun...) - manually inspecting data is extremely important! In fact, it's one of the few things most AI and Data Science researchers agree on :). Next, we will take a first pass on ordering our Olympic podium between three LMs.

### 1.1 (1 pt)
 
Load the subfiles contained in the `mmlu_data` and `lm_scores` folders into separate dataframes:
- `df_test`
- `df_x`
- `df_y`
- `df_z`

for each, print their sizes.

In [167]:
df_test = pd.read_csv('data/task_1/mmlu_data/test.csv')

f = 'data/task_1/lm_scores/'
df_x = pd.read_csv(os.path.join(f, 'lm_X.csv'))
df_y = pd.read_csv(os.path.join(f, 'lm_Y.csv'))
df_z = pd.read_csv(os.path.join(f, 'lm_Z.csv'))

print('df_test: ', df_test.shape)
print('df_x: ', df_x.shape)
print('df_y: ', df_y.shape)
print('df_z: ', df_z.shape)

df_test:  (14042, 8)
df_x:  (13882, 2)
df_y:  (13978, 2)
df_z:  (13923, 2)


### 1.2 (4 pt)
Unfortunately, LMs don't always output the format we want. In the column `result`, the value should be one of A, B, C, or D. 

A. For each of the LM score dataframes, use a `value_counts()` operation and print the results. 

B. /Discuss:/ Inspect the results and describe the types of answer formats you see. Besides the "expected" case, you should be able to find at least four unexpected formats.

In [168]:
# A
#For the dataframe test
print('df_test: ', df_test.value_counts())
#For the dataframe lm_X
print('df_x: ', df_x.value_counts())
#For the datafram lm_Y
print('df_y: ', df_y.value_counts())
#For the dataframe lm_Z
print('df_z: ', df_z.value_counts())

df_test:  question                                                                                                                                                                                                                                                                                                                                                      A                                                   B                                                  C                                                        D                                                                 answer  subject                 question_id
 A 10% increase (decrease) in price produces a 10% decrease (increase) in quantity demanded. This is referred to as:                                                                                                                                                                                                                                          Zero price elasticit

In [159]:
# B

The expected format for answers across all dataframes (df_x, df_y, and df_z) is a single letter: 'A', 'B', 'C', or 'D'. However, several responses differ from this format, leading to inconsistencies in the data:

- Answers prefixed with "Answer: " instead of just the letter.
- Full sentence responses that contain explanations instead of a single letter.

**Four Specific Examples of Unexpected Answers:**
- Question ID: 4658 (df_y) : The response is "Answer: C" instead of just "C".
- Question ID: 4664 (df_z) : The response is "Answer: A" instead of just "A".
- Question ID: 4661 (df_y) : The response is "The demand for labor is derived from the demand for the products produced by labor., so the answer is D" instead of just "D".
- Question ID: 9403 (df_x) : The response is "Answer: D" instead of just "D".


These inconsistencies may cause issues when processing or analyzing the data, so standardizing the answer format is necessary.

### 1.3 (5 pt)
Oh oh... That doesn't look great. Simply dropping all invalid answers seems overly wasteful, yet fixing all of these looks like a mess! Instead, let's focus for now on fixing just those answers of length < 10 characters that require only a single `str.replace()` operation. 

For example, if the answer looks like `--A--`, we could fix this by using the following simple function:

```
def clean_answer(s, pattern='-'):
    return str(s).replace(pattern, '')

dirty_answer = '--A--'
clean_answer = clean_answer(dirty_answer)
```

A. Filter the three score dataframes to include only answers with less than 10 characters. Make a deep copy of the dataframes as you filter them.

B. Modify the `clean_answer()` example function to clean the answers in the filtered data frames using the `apply()` functionality. Finally, make sure **all remaining answers are one of `A, B, C, or D`.**

C. /Discuss:/ Compare the sizes of the original and filtered data frames. What do you see? Why might this be a problem?

In [169]:
#A
#We will filter df_x,df_y and df_z to include less than 10 characters and use Pandas .copy(deep=True) to ensure a deep copy is created
df_x_filter = df_x[df_x['result'].str.len() < 10].copy(deep=True)
df_y_filter = df_y[df_y['result'].str.len() < 10].copy(deep=True)
df_z_filter= df_z[df_z['result'].str.len() < 10].copy(deep=True)

#and now print the results
print(df_x_filter)
print(df_y_filter)
print(df_z_filter)

       question_id     result
0                0          B
1                1          C
2                2         D 
3                3         B 
4                4  Answer: B
...            ...        ...
13877        14037         A 
13878        14038          A
13879        14039          B
13880        14040          B
13881        14041  Answer: A

[13509 rows x 2 columns]
       question_id     result
0                0  Answer: D
1                1          D
2                2  Answer: D
4                4          D
5                5          C
...            ...        ...
13973        14037         C 
13974        14038          D
13975        14039  Answer: D
13976        14040          B
13977        14041         D 

[13637 rows x 2 columns]
       question_id     result
0                0          B
1                1  Answer: B
2                2          C
3                3         B 
4                4          B
...            ...        ...
13918        14037

In [170]:
#B
#Use the apply() Pandas functionality to make sure all the answers are as 'A','B','C, or 'D'
def clean_answer(s, pattern='-'):
    return str(s).replace("Answer: ", "")

# Apply to each dataframe
df_x_filter["result"] = df_x_filter["result"].apply(clean_answer)
df_y_filter["result"] = df_y_filter["result"].apply(clean_answer)
df_z_filter["result"] = df_z_filter["result"].apply(clean_answer)

#and now print the results
print(df_x_filter)
print(df_y_filter)
print(df_z_filter)

#And print the sizes for part C discussion
print('df_x: ', df_x_filter.shape)
print('df_y: ', df_y_filter.shape)
print('df_z: ', df_z_filter.shape)

       question_id result
0                0      B
1                1      C
2                2     D 
3                3     B 
4                4      B
...            ...    ...
13877        14037     A 
13878        14038      A
13879        14039      B
13880        14040      B
13881        14041      A

[13509 rows x 2 columns]
       question_id result
0                0      D
1                1      D
2                2      D
4                4      D
5                5      C
...            ...    ...
13973        14037     C 
13974        14038      D
13975        14039      D
13976        14040      B
13977        14041     D 

[13637 rows x 2 columns]
       question_id result
0                0      B
1                1      B
2                2      C
3                3     B 
4                4      B
...            ...    ...
13918        14037      A
13919        14038      A
13920        14039      B
13921        14040     B 
13922        14041      A

[12878 rows

C. /Discuss:/

Before filtering the sizes were: df_x: (13882, 2) df_y: (13978, 2) df_z: (13923, 2)

now they are; df_x: (13509, 2) df_y: (13637, 2) df_z: (12878, 2)

We can observe some rows have been removed from the dataframe, these are the ones with answers with more than 10 characters.

Even though this is a small percentage of the total data, it still represents a loss of information. If the removed answers contained valuable information (e.g., explanations or formatted differently), we might lose insights. If the removed rows had systematic patterns (e.g., specific question types affected), it could bias our results.

Instead of dropping long answers, we could normalize them (e.g., extract the letter from "Answer: A" instead of removing it).

### 1.4 (3 pt)

Now that our answer columns are nicely formatted, let's take a look at model performance:

A. Both the `MMLU` dataframes and the language model score data frames have the columns `question_id`. For each of the language model score data frames, use an inner join operation with the `df_test` dataframe on the `question_id` column.

B. Add a new column to each of the resulting dataframes called `correct`, that checks if the model's answer in `result` is the same as the expected answer in the column `answer`. Then, print the average score of each model.

In [171]:
# A
#We use pandas to merge each dataframe with df_test through the column of question_id so we can use the LM

#To only use matching rows we specify 'inner'

df_x_merged = pd.merge(df_x_filter, df_test, on="question_id", how="inner")
df_y_merged = pd.merge(df_y_filter, df_test, on="question_id", how="inner")
df_z_merged = pd.merge(df_z_filter, df_test, on="question_id", how="inner")

# and now we print these merged dataframes
print(df_x_merged)
print(df_y_merged)
print(df_z_merged)

       question_id result                                           question  \
0                0      B  Find the degree for the given field extension ...   
1                1      C  Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...   
2                2     D   Find all zeros in the indicated finite field o...   
3                3     B   Statement 1 | A factor group of a non-Abelian ...   
4                4      B  Find the product of the given polynomials in t...   
...            ...    ...                                                ...   
13504        14037     A   What has been a central focus of religious tra...   
13505        14038      A   To whom did ordinary folk appeal during a dro...   
13506        14039      B   The theological term homoousios means which o...   
13507        14040      B  According to the Japanese origin myth, who giv...   
13508        14041      A   The numen of Augustus referred to which of th...   

                            A          

In [172]:
# B
# Add 'correct' column to check if the result matches the answer using astype Pandas to turn true into 1 and false into 0
df_x_merged['correct'] = (df_x_merged['result'] == df_x_merged['answer']).astype(int)
df_y_merged['correct'] = (df_y_merged['result'] == df_y_merged['answer']).astype(int)
df_z_merged['correct'] = (df_z_merged['result'] == df_z_merged['answer']).astype(int)

# Calculate the average score for each model
avg_score_x = df_x_merged['correct'].mean()
avg_score_y = df_y_merged['correct'].mean()
avg_score_z = df_z_merged['correct'].mean()

# Print the average scores
print(f" Model X: {avg_score_x:.4f}")
print(f" Model Y: {avg_score_y:.4f}")
print(f" model Z: {avg_score_z:.4f}")

 Model X: 0.5567
 Model Y: 0.5893
 model Z: 0.5947


### 1.5 (5 pt)

Hmmm, something doesn't seem quite right. Let's investigate how "balanced" this dataset is:

A. For each of the 57 subjects in the MMLU, compare the number of questions answered by each model. Print the subjects for which there is a more than 10% difference.

B. Propose and implement a reasonable way to rebalance the results. (e.g., while throwing away 100% of the results perfectly rebalances the results, it is not reasonable).

C. Finally, print the updated accuracy on the rebalanced data.

**hint:**:
- (A) For a given subject, let model X and model Y have answered 181 and 200 questions respectively. You can consider this a 10% difference from the perspective of X, i.e., (200 - 181) / 181 > 0.10

In [173]:
#A
# Group by subject
model_x_bysubject = df_x_merged.groupby('subject')['question_id']
model_y_bysubject = df_y_merged.groupby('subject')['question_id']
model_z_bysubject = df_z_merged.groupby('subject')['question_id']

#count how many questions are answered by these models
x_counts = model_x_bysubject.count()
y_counts = model_y_bysubject.count()
z_counts = model_z_bysubject.count()

# Calculate the percentage differences between each pair of models
X_Y_diff = abs(x_counts - y_counts) / y_counts * 100
X_Z_diff = abs(x_counts - z_counts) / z_counts* 100
Y_Z_diff = abs(y_counts - z_counts) / z_counts * 100

# Filter subjects with more than 10% difference in any of the comparisons
subjects_with_diff = X_Y_diff[(X_Y_diff > 10) | (X_Z_diff > 10) | (Y_Z_diff > 10)]

# Print the subjects with more than 10% difference
print(subjects_with_diff)

subject
college chemistry            1.020408
college computer science     1.020408
computer security            2.040816
formal logic                12.096774
high school geography        0.512821
logical fallacies           12.408759
medical genetics             1.010101
moral disputes               9.539474
moral scenarios             14.532872
Name: question_id, dtype: float64


In [174]:
#B
#To rebalance the dataframe we can reduce amount of overrepresented samples by oversampling
import pandas as pd

# Step 1: Count questions per subject
x_counts = df_x_merged['subject'].value_counts()
y_counts = df_y_merged['subject'].value_counts()
z_counts = df_z_merged['subject'].value_counts()

# Step 2: Find the minimum number of questions per subject
min_samples_per_subject = min(x_counts.min(), y_counts.min(), z_counts.min())

# Step 3: Apply undersampling
def undersample(df, min_samples):
    return df.groupby('subject').apply(lambda x: x.sample(n=min_samples, random_state=42)).reset_index(drop=True)

df_x_balanced = undersample(df_x_merged, min_samples_per_subject)
df_y_balanced = undersample(df_y_merged, min_samples_per_subject)
df_z_balanced = undersample(df_z_merged, min_samples_per_subject)

# Step 4: Verify that subjects are now balanced
print(df_x_balanced['subject'].value_counts())  # Should be the same for all three datasets
print(df_y_balanced['subject'].value_counts())
print(df_z_balanced['subject'].value_counts())



TypeError: 'int' object is not callable

In [None]:
#C
# Step 1: Ensure the 'correct' column exists (True for correct answers, False otherwise)
df_x_balanced['correct'] = df_x_balanced['result'] == df_x_balanced['answer']
df_y_balanced['correct'] = df_y_balanced['result'] == df_y_balanced['answer']
df_z_balanced['correct'] = df_z_balanced['result'] == df_z_balanced['answer']

# Step 2: Calculate accuracy
accuracy_x = df_x_balanced['correct'].mean()
accuracy_y = df_y_balanced['correct'].mean()
accuracy_z = df_z_balanced['correct'].mean()

# Step 3: Print the accuracy
print(f"Accuracy of Model X: {accuracy_x:.2%}")
print(f"Accuracy of Model Y: {accuracy_y:.2%}")
print(f"Accuracy of Model Z: {accuracy_z:.2%}")

## Task 2 (26 points): What do you mean A > D > B > C...?

Nice work! Having successfully inspected, cleaned, and rebalanced the provided data, you head over to director of the government's FIETS project operating under the code name Geronimo. He is happy with your work so far, but worried that the sloppy intern might have done more undetected damage. To be sure, he orders a new set of evaluations of all models on both MMLU and another dataset.

After cleaning up and rebalancing, you are left with the concatenated score files in the second folder `task_2`:
```
task_2/
│
└── lm_scores_mmlu.csv
│
└── lm_scores_other.csv
```

Each has a new column called `model_name`, which is one of `X, Y` or `Z`.



_NOTE: **only** use data from `task_2` and `task_2_5` for this assignment! The values in `lm_scores_mmlu.csv` will NOT be the same as the dataframes you finished in task 1. This is due to "randomness" or "temperature" in language model inference. This can slightly shift around generative results. (Conveniently: it also ensures any mistakes made in Task 1 don't propogate further ;) )_

In [None]:
# PROVIDED CODE
df_mmlu = pd.read_csv('data/task_2/lm_scores_mmlu.csv')
df_other = pd.read_csv('data/task_2/lm_scores_other.csv')

### 2.1 (4 pt)

Let's explore the new results:

A. Compute the mean accuracy and standard errors of each model on both datasets and print the results.

B. Then, show your results in a bar plot using standard errors with a 95% confidence interval around the mean. Make sure the plot is easy to read and well annotated.

C. /Discuss:/ the plot you created: (i) can you say that one of the models is the best? (ii) is there anything that seems odd?

In [None]:
#A

def mean_accuracy(df):
    return df.groupby("model_name")["correct"].mean()

def standard_errors(df):
    summary = df.groupby("model_name")["correct"].agg(['std', 'count'])
    return summary['std'] / np.sqrt(summary['count'])


mean_mmlu = mean_accuracy(df_mmlu)
se_mmlu = standard_errors(df_mmlu)

mean_other = mean_accuracy(df_other)
se_other = standard_errors(df_other)

print("Mean mmlu: ", mean_mmlu)
print("SE mmlu: ", se_mmlu)
print("Mean other model: ", mean_other)
print("SE other model: ", se_other)


In [None]:
#B
bar_width = 0.4
fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(mean_mmlu))  

# Confidence interval for MMLU dataset
ci_mmlu_lower = mean_mmlu - 1.96*se_mmlu
ci_mmlu_upper = mean_mmlu + 1.96*se_mmlu

# Confidence interval for other dataset
ci_other_lower = mean_other - 1.96*se_other
ci_other_upper = mean_other + 1.96*se_other

yerr_mmlu_lower = mean_mmlu - ci_mmlu_lower
yerr_mmlu_upper = ci_mmlu_upper - mean_mmlu

yerr_other_lower = mean_other - ci_other_lower
yerr_other_upper = ci_other_upper - mean_other

# Bar plot with error bars
ax.bar(x - bar_width/2, mean_mmlu, yerr=[yerr_mmlu_lower, yerr_mmlu_upper], capsize=5, width=0.4, label='MMLU')
ax.bar(x + bar_width/2, mean_other, yerr=[yerr_other_lower, yerr_other_upper], capsize=5, width=0.4, label='Other Dataset')

# Set up plot visualization
ax.set_xticks(x)
ax.set_xticklabels(mean_mmlu.index) 
ax.set_ylabel("Mean Accuracy")
ax.set_title("Model Performance")
ax.legend()


# Show plot
plt.tight_layout()
plt.show()

### Answer C:
We could say that model X is the best model since its overall accuracy is better than the results obtained from Y and Z. However, model X performance changes depending on the dataset, the same happens with model Y but then we have model Z where its performance remains practically consistent in the two datatsets.

### 2.2 (5 pt)

Geronimo has assured you that both datasets contain questions of similar difficulty, so, what could be going on here?

A. What is the distribution of correct answers (A, B, C, D) for each dataset? Create a bar chart to visualize this.

B. Perform a chi-square test at $\alpha = 0.05$, of independence to determine if there's a significant difference in the distribution of correct answers between the two datasets. What do you conclude?

**hints**:
- for (A), keep in mind that df_mmlu and df_other contain the results of all models, i.e., the `question_id` column is duplicated.
- for (A), take care to clearly annotate the bar chart, e.g., title, y-label, legend.
- for (B), clearly state the null hypothesis and alternative hypothesis
- use the `chi2_contingency` function from `scipy.stats`
- format your results from answer (A) as a 2D array

In [None]:
#A 

# Displaying purposes
fig, ax = plt.subplots(figsize=(10,6))
bar_width = 0.35
answers = ['A', 'B', 'C', 'D']
x = np.arange(len(answers))

# Filtering df to get only correct answers
df_mmlu_correct = df_mmlu[df_mmlu['correct'] == True]
df_other_correct = df_other[df_other['correct'] == True]

# Count the number of correct answers per answer choice A,B,C or D
counts_mmlu = df_mmlu_correct.groupby('answer').size()
counts_other = df_other_correct.groupby('answer').size()

counts_mmlu_norm = counts_mmlu / counts_mmlu.sum()
counts_other_norm = counts_other / counts_other.sum()

# Set up bar chart
ax.bar(x - bar_width/2, counts_mmlu.values, capsize=5, width=bar_width, label='MMLU Dataset')
ax.bar(x + bar_width/2, counts_other.values, capsize=5, width=bar_width, label='Other Dataset')

ax.set_xticks(x)
ax.set_xticklabels(answers)
ax.set_title("Distribution of Correct Answers (A, B, C, D)")
ax.set_ylabel("Number of Correct Answers")
ax.legend(title="Dataset")

plt.show()



In [None]:
#B
import scipy.stats as stats

contingency_table = np.array([counts_mmlu.values, counts_other.values])

# Perform chi-square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("Chi-Square statistic: ", chi2_stat)
print("p-value: ", p_value)
print("Null hypothesis: The distribution of correct answers is the same for both datasets")
alpha = 0.05
if p_value < alpha: 
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

### 2.3 (7 pt)

Let's dive in deeper:

A. What is language model X's mean accuracy conditioned on the four answer options for each dataset?

B. Compare LM X's performance when the correct answer is "A" between the two datasets. Use a T-test with CI = 0.95. What do you conclude?

C. Compare LM X's performance when the correct answer is "A" vs. "C or D" for each dataset. Use a T-test with CI = 0.95. What do you conclude?

In [None]:
#A

# Filter df to only focus on model X
df_mmlu_x = df_mmlu[df_mmlu['model_name'] == 'X']
df_other_x = df_other[df_other['model_name'] == 'X']

# Group by answer choices and calculate mean accuracy
mmlu_mean_x = df_mmlu_x.groupby('answer')['correct'].mean()
other_mean_x = df_other_x.groupby('answer')['correct'].mean()

print("These are the following mean accuracies for model X: \n")
print("MMLU dataset:")
print(mmlu_mean_x)
print("\nOther dataset:")
print(other_mean_x)

In [None]:
#B
def standard_errors_x_a(df):
    result = df.agg(['std','count'])
    return result['std'] / np.sqrt(result['count'])

# Filter df to only consider cases in which the answer is A
df_mmlu_x_a = df_mmlu_x[df_mmlu_x['answer'] == 'A']
df_other_x_a = df_other_x[df_other_x['answer'] == 'A']

mean_mmlu_x_a = df_mmlu_x_a['correct'].mean()
mean_other_x_a = df_other_x_a['correct'].mean()

# Perform T-test
t_stat, p_value = ttest_ind(df_mmlu_x_a['correct'], df_other_x_a['correct'])

mean_diff = mean_mmlu_x_a - mean_other_x_a

se_mmlu_x_a = standard_errors_x_a(df_mmlu_x_a['correct'])
se_other_x_a = standard_errors_x_a(df_other_x_a['correct'])

se_diff = np.sqrt(se_mmlu_x_a**2 + se_other_x_a**2)


ci_lower = mean_diff - 1.96*se_diff
ci_upper = mean_diff + 1.96*se_diff

# Display the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
print(f"Mean difference: {mean_diff}")
print(f"95% Confidence interval: [{ci_lower}, {ci_upper}]")
print("Null hypothesis: There is no signifcant difference ebtween the two datasets when the correct anser is A")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")



In [None]:
#C
def standard_errors(df):
    result = df.agg(['std','count'])
    return result['std'] / np.sqrt(result['count'])

# Filter df to only consider cases in which the answer is C or D
df_mmlu_x_c_d = df_mmlu_x[df_mmlu_x['answer'].isin(['C','D'])]
df_other_x_c_d = df_other_x[df_other_x['answer'].isin(['C','D'])]

mean_mmlu_x_c_d = df_mmlu_x_c_d['correct'].mean()
mean_other_x_c_d = df_other_x_c_d['correct'].mean()

# Perform T-test
# Mean of each df is done in ttest_ind automatically
t_stat_mmlu, p_value_mmlu = ttest_ind(df_mmlu_x_a['correct'], df_mmlu_x_c_d['correct'])
t_stat_other, p_value_other = ttest_ind(df_other_x_a['correct'], df_other_x_c_d['correct'])

# Calculations for confidence interval for mmlu dataset
mean_diff_mmlu = mean_mmlu_x_a - mean_mmlu_x_c_d
se_mmlu_x_a = standard_errors(df_mmlu_x_a['correct'])
se_mmlu_x_c_d = standard_errors(df_mmlu_x_c_d['correct'])

se_diff_mmlu = np.sqrt(se_mmlu_x_a**2 + se_mmlu_x_c_d**2)


# Calculations for confidence interval for other datatset
mean_diff_other = mean_other_x_a - mean_other_x_c_d
se_other_x_a = standard_errors(df_other_x_a['correct'])
se_other_x_c_d = standard_errors(df_other_x_c_d['correct'])

se_diff_other = np.sqrt(se_other_x_a**2 + se_other_x_c_d**2)

alpha = 0.05
print(t_critical_mmlu)

# Confidence interval calculation
ci_lower_mmlu = mean_diff_mmlu - 1.96 * se_diff_mmlu
ci_upper_mmlu = mean_diff_mmlu + 1.96 * se_diff_mmlu

ci_lower_other = mean_diff_other - 1.96 * se_diff_other
ci_upper_other = mean_diff_other + 1.96 * se_diff_other

# Results
print("MMLU datset:")
print(f"T-statistic: {t_stat_mmlu}")
print(f"P-value: {p_value_mmlu}")
print(f"Mean Difference: {mean_diff_mmlu}")
print(f"95% Confidence Interval: ({ci_lower_mmlu}, {ci_upper_mmlu}) \n")

print("Other dataset:")
print(f"T-statistic: {t_stat_other}")
print(f"P-value: {p_value_other}")
print(f"Mean Difference: {mean_diff_other}")
print(f"95% Confidence Interval: ({ci_lower_other}, {ci_upper_other})")

print("\nNull hypothesis for MMLU dataset: There is no signifcant difference between 'A' and 'C or D'")
print("Result:")
if p_value_mmlu < alpha:
    print("Reject null hypothesis for MMLU")
else:
    print("Fail to reject the null hypothesis for MMLU")

print("\nNull hypothesis for Other dataset: There is no signifcant difference between 'A' and 'C or D'")
print("Result:")
if p_value_other < alpha:
    print("Reject null hypothesis for Other")
else:
    print("Fail to reject null hypothesis for Other")


### 2.4 (2 pt)

What an intriguing finding! 

A. Print the mean accuracies conditioned on the correct answer for all LMs for each dataset.

B. /Discuss:/ What do you observe?

In [None]:
#A

# Group df by model and answer choice, then compute mean accuracy and generate df with each combination
mean_mmlu = df_mmlu.groupby(['model_name', 'answer'])['correct'].mean().unstack()
mean_other = df_other.groupby(['model_name', 'answer'])['correct'].mean().unstack()

print(mean_mmlu)
print(mean_other)

B. /Discuss:/

For the first dataset, we have the following observations:
- Model X has a high accuracy when the answer choice is A and its performance decreases for B, C and D, being this last choice the one with the least accuracy.
- Model Y performs better when D is the correct answer but its performance decreases for A,B and C,especially for A which has a 62%
- Model Z accuracy is consistent across the different answer choices since its performance ranges between 64% and 67%

For the second dataset, we have the following observations:
- Model X follows practically the same pattern performance as in the first dataset but this time its accuracy range 97% - 60% which is different from the 97% - 63% the first dataset has.
- Model Y performance is similar to the one in the first dataset and has a slighly different accuracy range which is 92% - 62%
- Model Z accuracy is also consistent in this dataset but it performs slightly better in this dataset than in the first one where the accuracy range is 67% - 64% while in the second dataset is 68% - 66%

In conclusion, model Z is the most consistent one but less accurate across the different answer choices, while model X performance is always the best one when A is the answer choice and model Y is the best one when D is the answer choice.


### 2.5 (2 pt)

Concerned with your findings so far, you quickly consult with Geronimo. After thinking it over, Geronimo concludes that more tests are needed. He orders a second round of MMLU results. However, Geronimo thinks of the following twist: while keeping questions fixed, he randomly permutes the position of the correct answer. The new results can be found in the folder `data/task_2_5/`:
```
task_2_5/
│
└── lm_scores_mmlu_shuffle.csv
```

/Discuss:/ Why would Geronimo do this?


Geronimo would have done that to get rid off a potential position bias respect to the answer choices since models like X and Y perform better in a certain answer option, so this might indicate a bias toward the position of the correct answer which would mean thna the model is relying on a pattern.
By randomly permuting the position of the correct answer, Geronimo is ensuring that the models are not following a pattern.

### 2.6 (4 pt)

Increasingly sceptical of the language models' performance, you read up on proper testing practices. You stumble upon the concept of [test-rested stability](https://en.wikipedia.org/wiki/Repeatability), which roughtly states that:

"_Measurements taken by a single person or instrument on the same item, under the same conditions, and in a short period of time, should have the same results._"

In our case, we would assume an LM would have the same performance on a given question regardless of the correct answer position. One way of testing this is by using the following metric:

$$\text{test-retest metric} = \frac{1}{N}\sum_{i=1}^N \frac{1}{M}\sum_{j=1}^M c^i_0 c_j^i,$$

where $c^i_0 \in \{0, 1\}$ indicates whether the model answers the $i^{\text{th}}$ question correctly (1 if correct, 0 if incorrect). $c_j^i$ indicates whether the model answers the $i^{\text{th}}$ question correctly in the $j^{\text{th}}$ shuffled version of the answer label content. Finally, $M$ is the total number of shuffles and $N$ is the dataset size.

Task: compute the test-retest metric for each language model using the original `lm_scores_mmlu.csv` file and the new `lm_scores_mmlu_shuffle.csv` file. Using a bar plot, visualize your results by comparing the accuracy of the original `lm_scores_mmlu.csv` and the test-retest scores.

**hints**
- what is $M$ in our case?

(bonus: no points, but so much sweet, sweet knowledge - check out [the following article](https://arxiv.org/pdf/2406.19470v1))

In [None]:
#fancy code

# Loading datasets
df_original = pd.read_csv('data/task_2/lm_scores_mmlu.csv')  # Loaded again for clarity purposes
df_shuffled = pd.read_csv('data/task_2_5/lm_scores_mmlu_shuffle.csv')

# Filter df
grouped_original = df_original.groupby('model_name')
grouped_shuffled = df_shuffled.groupby('model_name')

# Mean accuracy
mean_original = grouped_original["correct"].mean()


# Computes test-retest metric
def test_retests(group_original, group_shuffled):

    c_original = group_original['correct'].astype(int)
    c_shuffled = group_shuffled['correct'].astype(int)

    test_retest = (c_original * c_shuffled).mean()

    return test_retest

test_retest_scores = {}
for model in grouped_original.groups:

    model_original = grouped_original.get_group(model)
    model_shuffled = grouped_shuffled.get_group(model)

    test_retest_scores[model] = test_retests(model_original, model_shuffled)


models = list(test_retest_scores.keys())
scores = list(test_retest_scores.values())
mean_scores = mean_original[models].values
x = np.arange(len(models))

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

ax.bar(x - bar_width / 2, mean_scores, bar_width, label='Original Accuracy', color='blue')
ax.bar(x + bar_width / 2, scores, bar_width, label='Test-Retest Metric', color='green')

# Customize the plot
ax.set_xlabel('Models')
ax.set_ylabel('Accuracy / Test-Retest Metric')
ax.set_title('Comparison of Original Accuracy and Test-Retest Metric for Each Model')
ax.set_xticks(x) 
ax.set_xticklabels(models)
ax.legend()

plt.ylim(0, 1) # Displaying purposes
plt.show()


### 2.7 (2 pt)

A. Using the unshuffled data: For each LM, print the distribution of the answers they give as well as the accuracy conditioned on the answer they give.

B. /Discuss:/ Describe what you observe

[bonus: not scored, but again _that sweet, sweet knowledge_] Could you think of a plausible explanation?

In [None]:
#A

# Load dataset again for clarity purposes
df = pd.read_csv('data/task_2/lm_scores_mmlu.csv')

distribution = df.groupby('model_name')['answer'].value_counts(normalize= True).unstack()
mean = df.groupby(['model_name', 'answer'])['correct'].mean().unstack()

print(distribution)
print(mean)


B. /Discuss:/

The first table shows the distribution of answers and we have the following observations:
- The results for each answer are consistent across all models. This suggests that the models answer the questions with similar tendencies.

The second table shows the accuracy of each model across the different answer choices and we have the following observations:
- Model X performs better when the correct answer is A and model Y has the same performance but with D. In the case of model Z its performance is consistent in all the answer choices but the overall accuracy is not that good if we compare it with the results obtain with model X and Y.

In conclusion, model X and Y have great performances in different answer choices whic are A for model X and D for model Y. However, model Z is more consistent but with a worse performance overall.


## Task 3 (16 points): What do Questions and Answers look like for a Language Model?

While you feel pretty good about the tests you conducted so far, something still bothers you: what if the language models don't see the data like you do? Suddenly, you receive a phone call from a wise AI sage based in Maastricht named Yodata:

```
"Hmmm, correct you are, jonge padawan, to question how the wereld is seen by large language models! Simple 'text,' it is not, nee nee nee! Characters and words, the way of gewone humans, this is not, heh heh heh.

'Tokens,' they use, ja! Mysterious and powerful, these tokens are. Expand our vocabulary, they do, beyond the simple 'a to Z.' Chunky blocks of text, they become, yes! 'Hello world,' a simple phrase it may seem. But to a language model, '[24912, 2375]' it might appear, hmm? Verwarrend, it is!

Wise, it would be, to explore these MMLU data points through the eyes of a language model, you think? Yes, yes! Much to learn, there is. The ways of the tokens, understand you must, if truly comprehend the great LMs, you wish to.

Meditate on this, you should. The force of natural language processing, strong it is. But geduld, you must have, my jonge padawan. For only through great study and contemplation, will the mysteries of the tokens reveal themselves to you, they will. Ja, hmmm!"
```

Admittingly, Yodata at times speaks in riddles... However, he was explaining a crucial aspect of modern LMs called [Tokenization](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens):


“Tokens are words, character sets, or combinations of words and punctuation that are used by [language models (LMs)] to decompose text into. Tokenization is the first step in training”

Instead of characters, LMs process natural language using “tokens”. While this is useful for a number of reasons, it does at times introduce some “unintuitive” behavior…

In [None]:
# PROVIDED CODE

try:
    import tiktoken
except Exception as e:
    print('installing tiktoken package')
    
    %pip install tiktoken
    
    import tiktoken

def tokenize_text(s):
    enc = tiktoken.encoding_for_model('gpt-4o')
    tokens = enc.encode(str(s))
    return tokens

example_string = 'hello world'
print(f'humans see: "{example_string}" --> language models see: {tokenize_text(example_string)}')

### 3.1 (5 pt)

Use the provided code in the cell above to "see the world through the eyes of a language model":

A. Tokenize the questions of the original MMLU data provided in task 1: `task_1/mmlu_data/test.csv` and plot the token distribution (the frequency of each token).

B. Same as (A), but now for the answers in columns (columns "A", "B", "C", and "D").

C. Isolate the tokens for the strings "A", "B", "C", and "D", then, for their occurances in both questions and answers, print their relative distribution to each other.

**hint**
- There are a _lot_ of tokens, consider using a cutoff point and log scale
- For (c), they should sum to 1

In [None]:
#A
def tokenize(s):
    enc = tiktoken.encoding_for_model('gpt-4o')
    tokens = enc.encode(str(s))
    return tokens

df_mmlu["question_tokens"] = df_mmlu["question"].apply(tokenize)

all_question_tokens = [token for tokens in df_mmlu["question_tokens"] for token in tokens]

question_token_counts = pd.Series(all_question_tokens).value_counts()

## CODE FOR FREQUENCY OF FREQUENCY
#frequency_counts = question_token_counts.value_counts().sort_index()
#size_frec = len(frequency_counts)
#print('size_frec:', size_frec)
#max_frec = frequency_counts.max()
#min_frec = frequency_counts.min()
#print('max: ', max_frec, 'min: ', min_frec)
#print('frequency_counts', frequency_counts.head(3))
#plt.figure(figsize=(10, 5))
#plt.hist(frequency_counts, bins=size_frec, alpha=0.7, color="blue", label="Questions")
#plt.xlabel("Frequency")
#plt.ylabel("Frequency of that frequency")
#plt.title("Frequency distribution")
#plt.legend()
#plt.show()
#plt.show()

#CUTOFF
question_token_counts = question_token_counts[question_token_counts >= 10]

size = len(question_token_counts)
#print('size:', size)
max = question_token_counts.max()
min = question_token_counts.min()
#print('max: ', max, 'min: ', min)

plt.figure(figsize=(10, 5))
plt.hist(all_question_tokens, bins=size, alpha=0.7, color="blue", label="Questions")
plt.xlabel("Token ID")
plt.ylabel("Frequency (log)")
plt.title("Token Distribution in Questions (MMLU)")
plt.yscale("log")
plt.legend()
plt.show()

In [None]:
#B
def tokenize(s):
    enc = tiktoken.encoding_for_model('gpt-4o')
    tokens = enc.encode(str(s))
    return tokens
    
array_merged = pd.Series(df_mmlu[["A", "B", "C", "D"]].values.flatten())

df_mmlu["abcd_tokens"] = array_merged.apply(tokenize)

all_abcd_tokens = [token for tokens in df_mmlu["abcd_tokens"] for token in tokens]

abcd_token_counts = pd.Series(all_abcd_tokens).value_counts()

#CUTOFF
abdc_token_counts = abcd_token_counts[abcd_token_counts >= 6]

size = len(abcd_token_counts)
#print('size:', size)
max = abcd_token_counts.max()
min = abcd_token_counts.min()
#print('max: ', max, 'min: ', min)

#frequency_counts = abcd_token_counts.value_counts().sort_index()
#size_frec = len(frequency_counts)
#print('size_frec:', size_frec)
#max_frec = frequency_counts.max()
#min_frec = frequency_counts.min()
#print('max: ', max_frec, 'min: ', min_frec)
#print('frequency_counts', frequency_counts.head(7))
#plt.figure(figsize=(10, 5))
#plt.hist(frequency_counts, bins=size_frec, alpha=0.7, color="blue", label="Questions")
#plt.xlabel("Frequency")
#plt.ylabel("Frequency of that frequency")
#plt.title("Frequency distribution")
#plt.legend()
#plt.show()
#plt.show()

plt.figure(figsize=(10, 5))
plt.hist(all_abcd_tokens, bins=size, alpha=0.7, color="blue", label="Questions")
plt.xlabel("Token ID")
plt.ylabel("Frequency (log)")
plt.title("Token Distribution in A, B, C and D (MMLU)")
plt.yscale("log")
plt.legend()
plt.show()

In [None]:
#C
df_mmlu["question_tokens"] = df_mmlu["question"].apply(tokenize)

for col in ["A", "B", "C", "D"]:
    df_mmlu[f"{col}_tokens"] = df_mmlu[col].apply(tokenize)

all_question_tokens = [token for tokens in df_mmlu["question_tokens"] for token in tokens]
question_token_count = len(all_question_tokens)


answer_token_counts = {col: len([token for tokens in df_mmlu[f"{col}_tokens"] for token in tokens]) for col in ["A", "B", "C", "D"]}
total_answer_tokens = sum(answer_token_counts.values())

normalized_distribution = {col: answer_token_counts[col] / total_answer_tokens if total_answer_tokens > 0 else 0 for col in ["A", "B", "C", "D"]}

relative_distribution = {}
for col1 in ["A", "B", "C", "D"]:
    relative_distribution[col1] = {}
    for col2 in ["A", "B", "C", "D"]:
        if answer_token_counts[col2] > 0:
            relative_distribution[col1][col2] = answer_token_counts[col1] / answer_token_counts[col2]

print("Relative distribution of tokens in responses (A, B, C, D) with respect to the total:")
print(normalized_distribution)
print("Total sum (should be 1):", sum(normalized_distribution.values()))

print("\nRelative distribution of each option compared to the others:")
relative_df = pd.DataFrame(relative_distribution)
print(relative_df)


### 3.2 (3 pt)

What if the number of "A", "B", "C", and "D" tokens in the question and answer pairs could influence a language model's decisions?

A. For each question-answer pair, compute: 
1. the number of "A", "B", "C", and "D" tokens that occur in the combined question and answers; 
2. an the total number of tokens.
3. then, group by the "correct" answer and compute the mean frequency of A, B, C, and D tokens and the total number of tokens. 
4. finally, print your results

B. /Discuss:/ What do you think of the hypothesis that the frequency of A, B, C, and D tokens could influence answers?


In [None]:
#A
#step 1
df_mmlu["question_tokens"] = df_mmlu["question"].apply(tokenize)

for col in ["A", "B", "C", "D"]:
    df_mmlu[f"{col}_tokens"] = df_mmlu[col].apply(tokenize)

#step 2
df_mmlu["total_tokens"] = df_mmlu["question_tokens"].apply(len) + df_mmlu["A_tokens"].apply(len) + \
                          df_mmlu["B_tokens"].apply(len) + df_mmlu["C_tokens"].apply(len) + \
                          df_mmlu["D_tokens"].apply(len)

#step 3
df_mmlu["A_token_count"] = df_mmlu["A_tokens"].apply(len)
df_mmlu["B_token_count"] = df_mmlu["B_tokens"].apply(len)
df_mmlu["C_token_count"] = df_mmlu["C_tokens"].apply(len)
df_mmlu["D_token_count"] = df_mmlu["D_tokens"].apply(len)

token_stats_df = df_mmlu[["answer", "A_token_count", "B_token_count", "C_token_count", "D_token_count", "total_tokens"]]
grouped_stats = token_stats_df.groupby("answer").mean()

#step 4
print("Mean Frequency of A, B, C, D Tokens and Total Tokens by Correct Answer:")
print(grouped_stats)


B. /Discuss:/

**Can the frequency of A, B, C, and D tokens influence a language model's decisions?**

Based on the table, we observe that across different answers, the total number of tokens remains relatively constant, with only small differences.

"B" has the lowest token count (89.37), while "C" has the highest (92.38). This suggests that the dataset is fairly balanced in terms of total token count, reducing the likelihood of a strong length-based bias.When "D" is the correct answer, it has the highest D_token_count (8.98), which is still lower than the count for "C" (9.06). 

Therefore, we could say that when the correct answers are A, B, or C, the highest token count property holds, whereas when the correct answer is "D," this pattern does not hold.Nevertheless, the differences appear to be relatively small, indicating a weak correlation between the correct answer and token frequency. This means a model might pick up on subtle token frequency patterns.If token frequency had a strong influence, we would expect a significant imbalance where the correct answer consistently has more tokens. However, the observed differences are relatively small, meaning that token frequency alone is unlikely to significantly affect model predictions.


### 3.3 (4 pt)

Three of the most important considerations when deciding between language models are:

Quality
Costs
Speed

So far, much of your analysis has focused on quality. However, the government has indicated that they are quite concerned about both the total costs and speed as well. Specifically, it has been brought to their attention that a new `turbo` model has been launched! 

This model is both cheaper and faster than the models you evaluated so far. However, there is a catch: the context length* is much smaller than that of the other LMS. Namely, it can only process **300** tokens during inference. Meanwhile, the other models can process up to 100K tokens! 

*_The “context length” refers to the number of tokens that can be given to an LM as input._

A. Are there subjects where using the cheaper model might be problematic? I.e., where part of the question and answer(s) might not fit completely in the context?

B. /Discuss:/ Can you think of a strategy that would balance the needs of the government?

**hint**:
- An LM needs to have both the question and the different answer options in its context

In [None]:

#A
enc = tiktoken.encoding_for_model('gpt-4o')

def tokenize_text(s):
    return len(enc.encode(str(s)))

# Compute token count for each question + its answer choices
df_test['total_tokens'] = df_test.apply(
    lambda row: tokenize_text(row['question']) + 
                tokenize_text(row['A']) + 
                tokenize_text(row['B']) + 
                tokenize_text(row['C']) + 
                tokenize_text(row['D']), 
    axis=1
)

subject_analysis = df_test.groupby('subject').apply(
    lambda g: pd.Series({
        'questions_exceeding_limit': (g['total_tokens'] > 300).sum(),
        'total_questions': len(g), 
        'percentage_exceeding': (g['total_tokens'] > 300).sum()/len(g)*100 
    })
).reset_index()

filtered_subjects = subject_analysis[subject_analysis['percentage_exceeding'] > 25]

print(filtered_subjects)


B. /Discuss:/

Given the government's concerns about cost and speed, adopting the new turbo model presents benefits for most subjects but could pose challenges for those identified in section A.

The government could consider using an alternative model, depending on the budget, that accommodates all subjects or implement a hybrid strategy capable of classifying the question context and selecting the appropriate model. This way, the majority of subjects could be processed using the turbo model, while more complex cases would be redirected to a more capable language model that can process up to 100K tokens.

Additionally, they could explore input cleaning techniques, ensuring that instead of tokenizing the entire question, only the most important words are retained. This would help reduce the number of tokens required while maintaining essential information.

### 3.4 (4 pt)

/Discuss:/ The time has come to give your final recommendation on the use of LMs in education to the government! Taking into account everything you analyzed in all the preceding tasks (1, 2, and 3), please write a short recommendation consisting of 4 bullet points discussing your concerns.

B. /Discuss:/

1. The expected format for answers across all dataframes is a single letter: 'A', 'B', 'C', or 'D'. However, several responses differ from this format, leading to inconsistencies in the data.These inconsistencies may cause issues when processing or analyzing the data, so standardizing the answer format is necessary. To rebalance the dataframe we can reduce amount of overrepresented samples by oversampling

2. Model Z is the most consistent one but less accurate across the different answer choices, while model X performance is always the best one when A is the answer choice and model Y is the best one when D is the answer choice.

3. Model Selection Should Be Context-Based: While the turbo model offers cost and speed advantages, its limited context length could pose issues for subjects requiring complex reasoning. A hybrid approach that selects the appropriate model based on the complexity of the question is advisable. Preprocessing strategies such as question summarization or keyword extraction could reduce token consumption, allowing more efficient use of the turbo model without sacrificing comprehension.

4. Since education requirements evolve, the effectiveness of the selected models should be regularly assessed, leading to a continuous evaluation and adjustment. If a strong pattern of bias or accuracy degradation emerges, adjustments in model selection or input processing should be made accordingly.