(Placeholder for your group #)

(Placeholder for your names)

(Placeholder for your i-numbers)

**Use of genAI tools (e.g. chatGPT), websites (e.g. stackoverflow)**: *list websites where you found code (or other info) as well as include information on how you used genAI tools*

# Data Analysis, Clinic 1
# FIETS: Fundamentele Innovatie En Technologie in Scholing
## Met FIETS blijft het onderwijs vooruitgaan, zelfs tegen de wind in!

---

By completing and delivering the clinic tasks you will know how to :

- Load data and handle data using pandas;
- Navigate the documentation of Python packages by yourself;
- Filter and tidy up **noisy** real-world datasets;
- Aggregate your data in different (and hopefully helpful) ways;
- Use EDA to learn more about your data
- Create and interpret informative visualizations to explore the data set
- Derive meaningful insights for the societal impact of datasets

---
**Important Dates.**

- Clinic 1 release: Thu 30 Jan 2024
- Clinic 1 due: Fri 07 Feb 2024 late night, wildcards available

**Instructions for the deliverable:**

* You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you may do so, but must justify your choice.

* Make sure that you include a proper amount/mix of comments, results and code. More specifically, be sure to provide a concise textual description of your thought process, the assumptions you made, the solution you implemented, and explanations for your answers. A notebook that only has code cells will not suffice. To avoid confusion: use short comments for longer code answers.

* For questions containing the /Discuss:/ prefix, answer not with code, but with a textual explanation (in markdown).

* Back up any hypotheses and claims with data, since this is an important aspect of the course.

* Please write all your comments in English, and use meaningful variable names (as possible) in your code. 

* In the end, make sure that all cells are executed properly and everything you need to show is in your (execucted) notebook. We will not run your notebook for you! 

- In continuation to the previous point, interactive plots, such as those generated using the ‘plotly’ package, should be strictly avoided! Make sure to print results and/or dataframes that confirm you have properly addressed the task.

* You are asked to deliver **only your executed notebook file, .ipnyb** and nothing else. If you deliver other files, we will not grade anything.

* Honor code applies to these tasks. If you are not certain about an action, consult with Jerry.

**A Note from Jerry on using Language Models (LMs)**

If you try hard enough, you will likely get away with cheating (that does not only apply to LMs). Fortunately, my job is not to police, but rather to educate you. So, please consider the following:

I assume that you are taking this course to learn something! LMs are not always right ([they often fail in silly ways](https://community.openai.com/t/why-9-11-is-larger-than-9-9-incredible/869824/4)). This course should prepare you to detect when they are wrong!

I don't restrict the use of LMs because I see the value of being helped when coding (esp. in the context of pandas dataframes nightmare :)). Based on what we saw last year in your notebooks, it's pretty clear when you "copy" some code and then you struggle to interpret the results. This is the essence of this course and of the skills you should try build for yourself: Many people can run fancy models these days but not many people can interpret the results correctly. Try to be the latter ones.

---

## Context

AI is booming! Newspapers, influencers and your relatives all agree that AI is important. But while almost everyone agrees that AI is the future, much is unclear about what that future esp. in critical sectors like education looks like...

Freshly graduated from a top Dutch university in Limburg, you are hired by the Dutch government to advise on a large-scale “education innovation” initiative code-named "FIETS" (Flexibele Innovatie voor Efficiënte Toepassing in Scholing). With higher education facing severe budget cuts, the government is looking for creative solutions to "do more with less." Convinced by the stunning progress in language modeling, officials believe LLMs could help battle growing teacher shortages and reduce costs by automating parts of the education process. Your job description: investigate which LMs might be best suited to plug the gaps without draining the budget!

You are handed the results of three LMs on the [“Massive Multitask Language Understanding (MMLU)”](https://arxiv.org/abs/2009.03300) dataset  to compare. This famous dataset consists of 57 subjects with multiple-choice questions, covering diverse subjects like mathematics, computer science, history, and law. Most providers of state-of-the-art LMs use this dataset to showcase the versatility of their latest models. Unfortunately, the intern responsible for collecting the results, didn’t pay attention during DACS KEN3450: Data Analysis. As a result, the collected datasets are slightly corrupted. Jammer!

The success of FIETS depends on your ability to make sense of the messy data and recommend the best model to keep the Dutch education system pedaling forward—despite uphill challenges like funding shortages and a skeptical academic community!

### A very brief primer on Language Models
We studied LLMs in the context of the NLP course but here is a short reminder. Language models (LMs) are sophisticated statistical models designed to understand and generate human-like text. At their core, LMs are trained to predict the most likely continuation of a given input text. For example, given the input "The cat sat on the," an LM might predict "mat" as a likely continuation.
LMs are trained on vast text samples from various sources, including books, websites, and social media. This extensive training allows them to capture patterns and relationships in language, enabling them to generate coherent and contextually appropriate text across a wide range of topics and styles.

While LMs can produce text that appears to be written by intelligent humans, it's important to note that their capabilities can diverge from human intelligence in unexpected ways. They may sometimes generate factually incorrect information or struggle with complex reasoning tasks.

Two key concepts in understanding LMs are:
1. **Tokens**: LMs process text using "tokens" rather than individual characters. Tokens can be words, parts of words, or punctuation marks. For example, the sentence "I love AI!" might be tokenized as ["I", "love", "AI", "!"]. Tokenization is the first step in both training and using an LM.
2. **Context**: The input text provided to an LM is called the "context." This context informs the model's predictions or generations. A longer or more specific context often leads to more accurate and relevant outputs.

[See: Wikipedia entry on language models](https://en.wikipedia.org/wiki/Large_language_model)

###  Files for this assignment
This assignment is divided into three tasks, each of which should bring you a step closer to providing a recommendation toward project the objectives of FIETS:

- **Task 1**: Inspecting the results and getting your first model ranking
- **Task 2**: Inspecting the underlying data used to generate the results for possible biases
- **Task 3**: Learning about tokens and providing a final recommendation


```
📁 FIETS
│
├── 📄 clinic1.ipynb (the file you're currently reading!)
│
└── 📁 data
    ├── 📁 task_1
    ├── 📁 task_2
    └── 📁 task_2.5
```   
 

In [64]:
# some basic imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from scipy.stats import ttest_ind
from tabulate import tabulate # To nicely print tables


## Task 1 (18 points): What's in an average anyway?

The files needed to complete task 1 can be found in the folder "`data/task_1/`:
```
task_1/
│
├── mmlu_data/
│   └── test.csv
│
└── lm_scores/
    ├── lm_X.csv
    ├── lm_Y.csv
    └── lm_Z.csv
```

We will start by loading, (manually) inspecting, and cleaning the data. Although it doesn't seem "glamorous" (nor is it particularly fun...) - manually inspecting data is extremely important! In fact, it's one of the few things most AI and Data Science researchers agree on :). Next, we will take a first pass on ordering our Olympic podium between three LMs.

### 1.1 (1 pt)
 
Load the subfiles contained in the `mmlu_data` and `lm_scores` folders into separate dataframes:
- `df_test`
- `df_x`
- `df_y`
- `df_z`

for each, print their sizes.

In [65]:
df_test = pd.read_csv('data/task_1/mmlu_data/test.csv')

f = 'data/task_1/lm_scores/' # Stores the Path for the Folder, since we will reuse it

# LM Score Data Frames
df_x = pd.read_csv(os.path.join(f, 'lm_X.csv'))
df_y = pd.read_csv(os.path.join(f, 'lm_Y.csv'))
df_z = pd.read_csv(os.path.join(f, 'lm_Z.csv'))

print('df_test: ', df_test.shape)
print('df_x: ', df_x.shape)
print('df_y: ', df_y.shape)
print('df_z: ', df_z.shape)

df_test:  (14042, 8)
df_x:  (13882, 2)
df_y:  (13978, 2)
df_z:  (13923, 2)


### 1.2 (4 pt)
Unfortunately, LMs don't always output the format we want. In the column `result`, the value should be one of A, B, C, or D. 

A. For each of the LM score dataframes, use a `value_counts()` operation and print the results. 

B. /Discuss:/ Inspect the results and describe the types of answer formats you see. Besides the "expected" case, you should be able to find at least four unexpected formats.

In [66]:
# A
result_counts_df_x = df_x['result'].value_counts().reset_index()
result_counts_df_x.columns = ['Value', 'Count']

print(f'Total distinct answers in df_z: {df_x["result"].nunique()}')
print('df_x:')
print(result_counts_df_x.to_string(index=False))

Total distinct answers in df_z: 145
df_x:
                                                                                                                                                                                                                                                                                                       Value  Count
                                                                                                                                                                                                                                                                                                           A   2733
                                                                                                                                                                                                                                                                                                          A    1657
                                  

In [67]:
# A
result_counts_df_y = df_y['result'].value_counts().reset_index()
result_counts_df_y.columns = ['Value', 'Count']

print(f'Total distinct answers in df_z: {df_y["result"].nunique()}')
print('df_y:')
print(result_counts_df_y.to_string(index=False))

Total distinct answers in df_z: 141
df_y:
                                                                                                                                                                                                                                                                                                                                                                                                                                   Value  Count
                                                                                                                                                                                                                                                                                                                                                                                                                                       D   2894
                                                                                              

In [68]:
# A
result_counts_df_z = df_z['result'].value_counts().reset_index()
result_counts_df_z.columns = ['Value', 'Count']

print(f'Total distinct answers in df_z: {df_z["result"].nunique()}\n') # There are 560
print('df_z:')
print(result_counts_df_z.to_string(index=False))

Total distinct answers in df_z: 560

df_z:
                                                                                                                                                                                                                                                                                                                                                                                                                                                   Value  Count
                                                                                                                                                                                                                                                                                                                                                                                                                                                       D   2257
                                                             

In [69]:
# B

B. /Discuss:/ First of all, the shocking thing is that we would expect to have 4 possible different answers for each question, but in fact, we have quite a lot. For example, for df_z there are 560 different answers. Another point is that some answers are indeed supposed to refer to A, B, C, or D, that is, one of the 4 possible choices, however, they were written in different ways, such as 'Answer:A' 'Not wrong, Not wrong, so the answer is D' (probably due to the back and forth in the chat where it previously considered other options), 'Not sure' (which is not even supposed to be an answer, since in this test, they should pick an option), 'Vitamin B12, so the answer is C' (where the model basically gives a reason alongside its answer, but it at least picks one).

Therefore, I would say the 4 unexpected formats are:
1. Answers where the Model will not pick any choice and will say something like "Not Sure";
2. Answers where the Model will give a reason alongside its answer (can be one sentence or even a full paragraph);
3. One that the Model will not give a reason, but it will format the correct answer in different way, such as "Answer:A"
4. Answers where when printed looks like just the character "A", but it probbaly includes either a space or space character, so the computer does not recognize as being the same

### 1.3 (5 pt)
Oh oh... That doesn't look great. Simply dropping all invalid answers seems overly wasteful, yet fixing all of these looks like a mess! Instead, let's focus for now on fixing just those answers of length < 10 characters that require only a single `str.replace()` operation. 

For example, if the answer looks like `--A--`, we could fix this by using the following simple function:

```
def clean_answer(s, pattern='-'):
    return str(s).replace(pattern, '')

dirty_answer = '--A--'
clean_answer = clean_answer(dirty_answer)
```

A. Filter the three score dataframes to include only answers with less than 10 characters. Make a deep copy of the dataframes as you filter them.

B. Modify the `clean_answer()` example function to clean the answers in the filtered data frames using the `apply()` functionality. Finally, make sure **all remaining answers are one of `A, B, C, or D`.**

C. /Discuss:/ Compare the sizes of the original and filtered data frames. What do you see? Why might this be a problem?

In [70]:
#A
# Filter the dataframes to include only answers with less than 10 characters
# The Deep Copy means that no references will be shared between the original DataFrame and the new one
df_x_filtered = df_x[df_x['result'].str.len() < 10].copy()
df_y_filtered = df_y[df_y['result'].str.len() < 10].copy()
df_z_filtered = df_z[df_z['result'].str.len() < 10].copy()

print('Original df_x size: ', df_x.shape)
print('Filtered df_x size: ', df_x_filtered.shape)
print('Original df_y size: ', df_y.shape)
print('Filtered df_y size: ', df_y_filtered.shape)
print('Original df_z size: ', df_z.shape)
print('Filtered df_z size: ', df_z_filtered.shape)

Original df_x size:  (13882, 2)
Filtered df_x size:  (13509, 2)
Original df_y size:  (13978, 2)
Filtered df_y size:  (13637, 2)
Original df_z size:  (13923, 2)
Filtered df_z size:  (12878, 2)


In [71]:
#B
# Define the clean_answer function
def clean_answer(s, pattern='Answer: '):
    """Removes a specified pattern from a string and strips extra whitespace."""
    if pd.isna(s):
        return s
    return str(s).replace(pattern, '').strip()

# Apply the clean_answer function to the filtered dataframes
df_x_filtered['result'] = df_x_filtered['result'].apply(clean_answer)
df_y_filtered['result'] = df_y_filtered['result'].apply(clean_answer)
df_z_filtered['result'] = df_z_filtered['result'].apply(clean_answer)

# Checking the value_count() for each to see what is left Besides A, B, C, D
print('df_x_filtered:')
print(df_x_filtered['result'].value_counts())

print('\ndf_y_filtered:')
print(df_y_filtered['result'].value_counts())

print('\ndf_z_filtered:')
print(df_z_filtered['result'].value_counts())

df_x_filtered:
result
A           5788
B           2965
C           2350
D           2333
Not Sure      73
Name: count, dtype: int64

df_y_filtered:
result
D           5757
C           3242
B           2519
A           2033
Not Sure      86
Name: count, dtype: int64

df_z_filtered:
result
D           3348
C           3255
B           3124
A           3026
Not Sure     125
Name: count, dtype: int64


In [72]:
# B
# **Ensure only valid answers (A, B, C, D) remain**
valid_answers = {'A', 'B', 'C', 'D'}
df_x_filtered = df_x_filtered[df_x_filtered['result'].isin(valid_answers)]
df_y_filtered = df_y_filtered[df_y_filtered['result'].isin(valid_answers)]
df_z_filtered = df_z_filtered[df_z_filtered['result'].isin(valid_answers)]

# Checking the value_count() for each to see what is left Besides A, B, C, D (should not happen)
print('df_x_filtered:')
print(df_x_filtered['result'].value_counts())

print('\ndf_y_filtered:')
print(df_y_filtered['result'].value_counts())

print('\ndf_z_filtered:')
print(df_z_filtered['result'].value_counts())


df_x_filtered:
result
A    5788
B    2965
C    2350
D    2333
Name: count, dtype: int64

df_y_filtered:
result
D    5757
C    3242
B    2519
A    2033
Name: count, dtype: int64

df_z_filtered:
result
D    3348
C    3255
B    3124
A    3026
Name: count, dtype: int64


In [73]:
# Function to print comparison results
def print_comparison(df_name, original_df, filtered_df):
    removed = original_df.shape[0] - filtered_df.shape[0]
    percentage_removed = (removed / original_df.shape[0]) * 100

    print(f"Original {df_name} size: {original_df.shape}")
    print(f"Filtered {df_name} size: {filtered_df.shape}")
    print(f"Removed answers: {removed}")
    print(f"Percentage of removed answers: {percentage_removed:.2f}%\n")  # Limits to 2 decimal places

# Compare each dataframe
print_comparison("df_x", df_x, df_x_filtered)
print_comparison("df_y", df_y, df_y_filtered)
print_comparison("df_z", df_z, df_z_filtered)


Original df_x size: (13882, 2)
Filtered df_x size: (13436, 2)
Removed answers: 446
Percentage of removed answers: 3.21%

Original df_y size: (13978, 2)
Filtered df_y size: (13551, 2)
Removed answers: 427
Percentage of removed answers: 3.05%

Original df_z size: (13923, 2)
Filtered df_z size: (12753, 2)
Removed answers: 1170
Percentage of removed answers: 8.40%



C. /Discuss:/ After all the filtering that was done, quite a few samples were removed. This can be problematic because it can change the original distribution of the data, and herefore, bias the results. For example, we see that for df_x, 3.21% of the answers were removed, which might not seem like much, but it's 446 samples. For the case of df_z, 8.4% of the samples were removed, and this corresponds to 1170 samples, which is quite signficant. Besides, by doing a better filtering, I am pretty sure we could capture the which answer the model was trying to give (essentially extract the A, B, C, or D), which would have prevented us from removing so many samples.
Besides, even the ones where this is not possible at all, which happens in cases where the model does not pick any of the four possible choices, would be nice to have in our analysis and see things like: what is the percentage of the answers where the model does not pick any of the answers? This could also reveal some information about the problem and the performance of these language models.

### 1.4 (3 pt)

Now that our answer columns are nicely formatted, let's take a look at model performance:

A. Both the `MMLU` dataframes and the language model score data frames have the columns `question_id`. For each of the language model score data frames, use an inner join operation with the `df_test` dataframe on the `question_id` column.

B. Add a new column to each of the resulting dataframes called `correct`, that checks if the model's answer in `result` is the same as the expected answer in the column `answer`. Then, print the average score of each model.

In [74]:
# Perform inner join on question_id for each dataset, keeping 'answer' and 'subject'
df_x_joined = df_x_filtered.merge(df_test[['question_id', 'answer', 'subject']], on='question_id', how='inner')
df_y_joined = df_y_filtered.merge(df_test[['question_id', 'answer', 'subject']], on='question_id', how='inner')
df_z_joined = df_z_filtered.merge(df_test[['question_id', 'answer', 'subject']], on='question_id', how='inner')

# Print sample results to confirm the join
print("df_x_joined sample:\n", df_x_joined.head())
print("df_y_joined sample:\n", df_y_joined.head())
print("df_z_joined sample:\n", df_z_joined.head())


df_x_joined sample:
    question_id result answer           subject
0            0      B      B  abstract algebra
1            1      C      C  abstract algebra
2            2      D      D  abstract algebra
3            3      B      B  abstract algebra
4            4      B      B  abstract algebra
df_y_joined sample:
    question_id result answer           subject
0            0      D      B  abstract algebra
1            1      D      C  abstract algebra
2            2      D      D  abstract algebra
3            4      D      B  abstract algebra
4            5      C      A  abstract algebra
df_z_joined sample:
    question_id result answer           subject
0            0      B      B  abstract algebra
1            1      B      C  abstract algebra
2            2      C      D  abstract algebra
3            3      B      B  abstract algebra
4            4      B      B  abstract algebra


In [75]:
# B
# Add the 'correct' column (True if result matches answer, False otherwise)
df_x_joined['correct'] = df_x_joined['result'] == df_x_joined['answer']
df_y_joined['correct'] = df_y_joined['result'] == df_y_joined['answer']
df_z_joined['correct'] = df_z_joined['result'] == df_z_joined['answer']

# Compute the average score (accuracy) for each model
accuracy_scores = {
    "Model": ["df_x", "df_y", "df_z"],
    "Accuracy": [
        df_x_joined['correct'].mean(),
        df_y_joined['correct'].mean(),
        df_z_joined['correct'].mean()
    ]
}

# Convert to DataFrame and print results
accuracy_df = pd.DataFrame(accuracy_scores)
print("Model Accuracy Comparison:\n", accuracy_df)


Model Accuracy Comparison:
   Model  Accuracy
0  df_x  0.767490
1  df_y  0.745849
2  df_z  0.663295


### 1.5 (5 pt)

Hmmm, something doesn't seem quite right. Let's investigate how "balanced" this dataset is:

A. For each of the 57 subjects in the MMLU, compare the number of questions answered by each model. Print the subjects for which there is a more than 10% difference.

B. Propose and implement a reasonable way to rebalance the results. (e.g., while throwing away 100% of the results perfectly rebalances the results, it is not reasonable).

C. Finally, print the updated accuracy on the rebalanced data.

**hint:**:
- (A) For a given subject, let model X and model Y have answered 181 and 200 questions respectively. You can consider this a 10% difference from the perspective of X, i.e., (200 - 181) / 181 > 0.10

In [76]:
# Checking the Distribution of the number of questions per subject
# Compute the value counts for subjects in df_test (in percentage, rounded to 2 decimal places)
subject_percentage_test = df_test['subject'].value_counts(normalize=True) * 100
subject_percentage_test = subject_percentage_test.round(2)  # Format to 2 decimal places

# Print the percentage of questions per subject
print(subject_percentage_test)

# Print the total number of subjects
print("\nTotal number of subjects:", subject_percentage_test.shape[0])


subject
professional law                       10.92
moral scenarios                         6.37
miscellaneous                           5.58
professional psychology                 4.36
high school psychology                  3.88
high school macroeconomics              2.78
elementary mathematics                  2.69
moral disputes                          2.46
prehistory                              2.31
philosophy                              2.21
high school biology                     2.21
nutrition                               2.18
professional accounting                 2.01
professional medicine                   1.94
high school mathematics                 1.92
clinical knowledge                      1.89
security studies                        1.74
high school microeconomics              1.69
high school world history               1.69
conceptual physics                      1.67
marketing                               1.67
human aging                             1.59
hi

Very interesting too see the discrepancy in the representation of each subjects. Professional Law, for example, accounts to almost 11% of the total questions, while Machine Learning, for example, only accounts to 0.8 %

In [77]:
# Count the number of questions answered per subject for each model
subject_counts = pd.DataFrame({
    "df_x": df_x_joined['subject'].value_counts(),
    "df_y": df_y_joined['subject'].value_counts(),
    "df_z": df_z_joined['subject'].value_counts()
}).fillna(0)

# Compute the maximum and minimum number of questions answered per subject
subject_counts['max_count'] = subject_counts[['df_x', 'df_y', 'df_z']].max(axis=1)
subject_counts['min_count'] = subject_counts[['df_x', 'df_y', 'df_z']].min(axis=1)

# Compute the percentage difference using the correct formula
subject_counts['percent_diff'] = ((subject_counts['max_count'] - subject_counts['min_count']) / subject_counts['min_count']) * 100

# Identify subjects with more than 10% difference
imbalanced_subjects_count = subject_counts[subject_counts['percent_diff'] > 10].round(2)

# Print subjects with more than 10% difference in the number of questions answered
print("Subjects with more than 10% difference in the number of answered questions:\n", imbalanced_subjects_count)

# Print the number of subjects that exceed the 10% difference threshold
print("\nNumber of subjects with >10% difference in answered questions:", imbalanced_subjects_count.shape[0])


Subjects with more than 10% difference in the number of answered questions:
                           df_x  df_y  df_z  max_count  min_count  percent_diff
subject                                                                       
college chemistry           96    98    84         98         84         16.67
college computer science    97    98    84         98         84         16.67
computer security           95    98    87         98         87         12.64
formal logic               109   123   113        123        109         12.84
high school geography      195   193   176        195        176         10.80
logical fallacies          154   136   147        154        136         13.24
medical genetics            97    98    89         98         89         10.11
moral disputes             329   304   250        329        250         31.60
moral scenarios            737   865   774        865        737         17.37

Number of subjects with >10% difference in answered q

In [78]:
# Compute accuracy (mean correctness) per subject for each model
subject_accuracy = pd.DataFrame({
    "df_x": df_x_joined.groupby("subject")["correct"].mean(),
    "df_y": df_y_joined.groupby("subject")["correct"].mean(),
    "df_z": df_z_joined.groupby("subject")["correct"].mean()
}).fillna(0)

# Compute the maximum and minimum accuracy per subject
subject_accuracy['max_acc'] = subject_accuracy[['df_x', 'df_y', 'df_z']].max(axis=1)
subject_accuracy['min_acc'] = subject_accuracy[['df_x', 'df_y', 'df_z']].min(axis=1)

# Compute the percentage difference using the correct formula
subject_accuracy['percent_diff'] = ((subject_accuracy['max_acc'] - subject_accuracy['min_acc']) / subject_accuracy['min_acc']) * 100

# Identify subjects with more than 10% difference in accuracy
imbalanced_accuracy_subjects = subject_accuracy[subject_accuracy['percent_diff'] > 10].round(2)

# Print subjects where models have more than a 10% difference in accuracy
print("Subjects with more than 10% difference in accuracy:\n", imbalanced_accuracy_subjects)

# Print the number of subjects that exceed the 10% difference threshold
print("\nNumber of subjects with >10% difference in accuracy:", imbalanced_accuracy_subjects.shape[0])


Subjects with more than 10% difference in accuracy:
                                      df_x  df_y  df_z  max_acc  min_acc  \
subject                                                                   
abstract algebra                     0.74  0.66  0.67     0.74     0.66   
anatomy                              0.78  0.72  0.63     0.78     0.63   
astronomy                            0.75  0.78  0.63     0.78     0.63   
business ethics                      0.73  0.72  0.66     0.73     0.66   
clinical knowledge                   0.79  0.74  0.64     0.79     0.64   
college biology                      0.82  0.68  0.67     0.82     0.67   
college chemistry                    0.75  0.73  0.60     0.75     0.60   
college mathematics                  0.78  0.80  0.69     0.80     0.69   
college medicine                     0.73  0.75  0.65     0.75     0.65   
college physics                      0.79  0.82  0.63     0.82     0.63   
computer security                    0.81  0.65

In [79]:
#B
# Function to rebalance the dataset by downsampling to the minimum count per subject
def rebalance_data(df, subject_counts):
    return df.groupby('subject', group_keys=False).apply(
        lambda x: x.sample(n=int(subject_counts['min_count'].loc[x['subject'].iloc[0]]), random_state=42)
    )

# Apply rebalancing to each model's dataset
df_x_balanced = rebalance_data(df_x_joined, subject_counts)
df_y_balanced = rebalance_data(df_y_joined, subject_counts)
df_z_balanced = rebalance_data(df_z_joined, subject_counts)

# Count the number of questions per subject after rebalancing
balanced_subject_counts = pd.DataFrame({
    "df_x": df_x_balanced['subject'].value_counts(),
    "df_y": df_y_balanced['subject'].value_counts(),
    "df_z": df_z_balanced['subject'].value_counts()
}).fillna(0)

# Print the new balanced subject distributions
print("Balanced Subject Counts:\n", balanced_subject_counts)


Balanced Subject Counts:
                                      df_x  df_y  df_z
subject                                              
professional law                     1390  1390  1390
moral scenarios                       737   737   737
miscellaneous                         716   716   716
professional psychology               564   564   564
high school psychology                515   515   515
high school macroeconomics            350   350   350
elementary mathematics                348   348   348
prehistory                            299   299   299
nutrition                             286   286   286
philosophy                            282   282   282
high school biology                   277   277   277
professional accounting               261   261   261
professional medicine                 257   257   257
moral disputes                        250   250   250
high school mathematics               247   247   247
clinical knowledge                    244   244   244
se

In [81]:
# Compute accuracy (mean correctness) per subject for each model after rebalancing
subject_accuracy_balanced = pd.DataFrame({
    "df_x": df_x_balanced.groupby("subject")["correct"].mean(),
    "df_y": df_y_balanced.groupby("subject")["correct"].mean(),
    "df_z": df_z_balanced.groupby("subject")["correct"].mean()
}).fillna(0)

# Compute the overall accuracy across all subjects after rebalancing
accuracy_scores_balanced = {
    "Model": ["df_x_balanced", "df_y_balanced", "df_z_balanced"],
    "Overall Accuracy": [
        df_x_balanced['correct'].mean(),
        df_y_balanced['correct'].mean(),
        df_z_balanced['correct'].mean()
    ]
}

# Convert to DataFrame and print results
accuracy_df_balanced = pd.DataFrame(accuracy_scores_balanced)

# Print updated accuracy after rebalancing
print("Updated Accuracy on Rebalanced Data:\n", accuracy_df_balanced)


Updated Accuracy on Rebalanced Data:
            Model  Overall Accuracy
0  df_x_balanced          0.767347
1  df_y_balanced          0.745688
2  df_z_balanced          0.663228


In [82]:
# Compute accuracy (mean correctness) per subject for each model after rebalancing
subject_accuracy_balanced = pd.DataFrame({
    "df_x": df_x_balanced.groupby("subject")["correct"].mean(),
    "df_y": df_y_balanced.groupby("subject")["correct"].mean(),
    "df_z": df_z_balanced.groupby("subject")["correct"].mean()
}).fillna(0)

# Compute the number of samples per subject for weighting
subject_sample_counts = pd.DataFrame({
    "df_x": df_x_balanced['subject'].value_counts(),
    "df_y": df_y_balanced['subject'].value_counts(),
    "df_z": df_z_balanced['subject'].value_counts()
}).fillna(0)

# Compute inverse weights (1/n) and normalize them so they sum to 1
inverse_weights = 1 / subject_sample_counts
normalized_weights = inverse_weights.div(inverse_weights.sum(axis=0), axis=1)  # Normalize per model

# Compute inverse-weighted accuracy for each model
inverse_weighted_accuracy_scores = {
    "Model": ["df_x_balanced", "df_y_balanced", "df_z_balanced"],
    "Inverse-Weighted Accuracy": [
        (subject_accuracy_balanced["df_x"] * normalized_weights["df_x"]).sum(),
        (subject_accuracy_balanced["df_y"] * normalized_weights["df_y"]).sum(),
        (subject_accuracy_balanced["df_z"] * normalized_weights["df_z"]).sum()
    ]
}

# Convert to DataFrame and print results
accuracy_df_inverse_weighted = pd.DataFrame(inverse_weighted_accuracy_scores)

# Print updated accuracy after rebalancing (inverse-weighted)
print("Updated Inverse-Weighted Accuracy on Rebalanced Data:\n", accuracy_df_inverse_weighted)


Updated Inverse-Weighted Accuracy on Rebalanced Data:
            Model  Inverse-Weighted Accuracy
0  df_x_balanced                   0.767433
1  df_y_balanced                   0.732959
2  df_z_balanced                   0.660655


## Task 2 (26 points): What do you mean A > D > B > C...?

Nice work! Having successfully inspected, cleaned, and rebalanced the provided data, you head over to director of the government's FIETS project operating under the code name Geronimo. He is happy with your work so far, but worried that the sloppy intern might have done more undetected damage. To be sure, he orders a new set of evaluations of all models on both MMLU and another dataset.

After cleaning up and rebalancing, you are left with the concatenated score files in the second folder `task_2`:
```
task_2/
│
└── lm_scores_mmlu.csv
│
└── lm_scores_other.csv
```

Each has a new column called `model_name`, which is one of `X, Y` or `Z`.



_NOTE: **only** use data from `task_2` and `task_2_5` for this assignment! The values in `lm_scores_mmlu.csv` will NOT be the same as the dataframes you finished in task 1. This is due to "randomness" or "temperature" in language model inference. This can slightly shift around generative results. (Conveniently: it also ensures any mistakes made in Task 1 don't propogate further ;) )_

In [7]:
# PROVIDED CODE
df_mmlu = pd.read_csv('data/task_2/lm_scores_mmlu.csv')
df_other = pd.read_csv('data/task_2/lm_scores_other.csv')

### 2.1 (4 pt)

Let's explore the new results:

A. Compute the mean accuracy and standard errors of each model on both datasets and print the results.

B. Then, show your results in a bar plot using standard errors with a 95% confidence interval around the mean. Make sure the plot is easy to read and well annotated.

C. /Discuss:/ the plot you created: (i) can you say that one of the models is the best? (ii) is there anything that seems odd?

In [6]:
#A

In [None]:
#B

### 2.2 (5 pt)

Geronimo has assured you that both datasets contain questions of similar difficulty, so, what could be going on here?

A. What is the distribution of correct answers (A, B, C, D) for each dataset? Create a bar chart to visualize this.

B. Perform a chi-square test at $\alpha = 0.05$, of independence to determine if there's a significant difference in the distribution of correct answers between the two datasets. What do you conclude?

**hints**:
- for (A), keep in mind that df_mmlu and df_other contain the results of all models, i.e., the `question_id` column is duplicated.
- for (A), take care to clearly annotate the bar chart, e.g., title, y-label, legend.
- for (B), clearly state the null hypothesis and alternative hypothesis
- use the `chi2_contingency` function from `scipy.stats`
- format your results from answer (A) as a 2D array

In [7]:
#A 

In [9]:
#B

### 2.3 (7 pt)

Let's dive in deeper:

A. What is language model X's mean accuracy conditioned on the four answer options for each dataset?

B. Compare LM X's performance when the correct answer is "A" between the two datasets. Use a T-test with CI = 0.95. What do you conclude?

C. Compare LM X's performance when the correct answer is "A" vs. "C or D" for each dataset. Use a T-test with CI = 0.95. What do you conclude?

In [10]:
#A

In [None]:
#B

In [None]:
#C

### 2.4 (2 pt)

What an intriguing finding! 

A. Print the mean accuracies conditioned on the correct answer for all LMs for each dataset.

B. /Discuss:/ What do you observe?

In [11]:
#A

B. /Discuss:/

### 2.5 (2 pt)

Concerned with your findings so far, you quickly consult with Geronimo. After thinking it over, Geronimo concludes that more tests are needed. He orders a second round of MMLU results. However, Geronimo thinks of the following twist: while keeping questions fixed, he randomly permutes the position of the correct answer. The new results can be found in the folder `data/task_2_5/`:
```
task_2_5/
│
└── lm_scores_mmlu_shuffle.csv
```

/Discuss:/ Why would Geronimo do this?

B. /Discuss:/

### 2.6 (4 pt)

Increasingly sceptical of the language models' performance, you read up on proper testing practices. You stumble upon the concept of [test-rested stability](https://en.wikipedia.org/wiki/Repeatability), which roughtly states that:

"_Measurements taken by a single person or instrument on the same item, under the same conditions, and in a short period of time, should have the same results._"

In our case, we would assume an LM would have the same performance on a given question regardless of the correct answer position. One way of testing this is by using the following metric:

$$\text{test-retest metric} = \frac{1}{N}\sum_{i=1}^N \frac{1}{M}\sum_{j=1}^M c^i_0 c_j^i,$$

where $c^i_0 \in \{0, 1\}$ indicates whether the model answers the $i^{\text{th}}$ question correctly (1 if correct, 0 if incorrect). $c_j^i$ indicates whether the model answers the $i^{\text{th}}$ question correctly in the $j^{\text{th}}$ shuffled version of the answer label content. Finally, $M$ is the total number of shuffles and $N$ is the dataset size.

Task: compute the test-retest metric for each language model using the original `lm_scores_mmlu.csv` file and the new `lm_scores_mmlu_shuffle.csv` file. Using a bar plot, visualize your results by comparing the accuracy of the original `lm_scores_mmlu.csv` and the test-retest scores.

**hints**
- what is $M$ in our case?

(bonus: no points, but so much sweet, sweet knowledge - check out [the following article](https://arxiv.org/pdf/2406.19470v1))

In [12]:
#fancy code

### 2.7 (2 pt)

A. Using the unshuffled data: For each LM, print the distribution of the answers they give as well as the accuracy conditioned on the answer they give.

B. /Discuss:/ Describe what you observe

[bonus: not scored, but again _that sweet, sweet knowledge_] Could you think of a plausible explanation?

In [13]:
#A

B. /Discuss:/

## Task 3 (16 points): What do Questions and Answers look like for a Language Model?

While you feel pretty good about the tests you conducted so far, something still bothers you: what if the language models don't see the data like you do? Suddenly, you receive a phone call from a wise AI sage based in Maastricht named Yodata:

```
"Hmmm, correct you are, jonge padawan, to question how the wereld is seen by large language models! Simple 'text,' it is not, nee nee nee! Characters and words, the way of gewone humans, this is not, heh heh heh.

'Tokens,' they use, ja! Mysterious and powerful, these tokens are. Expand our vocabulary, they do, beyond the simple 'a to Z.' Chunky blocks of text, they become, yes! 'Hello world,' a simple phrase it may seem. But to a language model, '[24912, 2375]' it might appear, hmm? Verwarrend, it is!

Wise, it would be, to explore these MMLU data points through the eyes of a language model, you think? Yes, yes! Much to learn, there is. The ways of the tokens, understand you must, if truly comprehend the great LMs, you wish to.

Meditate on this, you should. The force of natural language processing, strong it is. But geduld, you must have, my jonge padawan. For only through great study and contemplation, will the mysteries of the tokens reveal themselves to you, they will. Ja, hmmm!"
```

Admittingly, Yodata at times speaks in riddles... However, he was explaining a crucial aspect of modern LMs called [Tokenization](https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens):


“Tokens are words, character sets, or combinations of words and punctuation that are used by [language models (LMs)] to decompose text into. Tokenization is the first step in training”

Instead of characters, LMs process natural language using “tokens”. While this is useful for a number of reasons, it does at times introduce some “unintuitive” behavior…

In [14]:
# PROVIDED CODE

try:
    import tiktoken
except Exception as e:
    print('installing tiktoken package')
    
    !pip install tiktoken
    
    import tiktoken

def tokenize_text(s):
    enc = tiktoken.encoding_for_model('gpt-4o')
    tokens = enc.encode(str(s))
    return tokens

example_string = 'hello world'
print(f'humans see: "{example_string}" --> language models see: {tokenize_text(example_string)}')

KeyboardInterrupt: 

### 3.1 (5 pt)

Use the provided code in the cell above to "see the world through the eyes of a language model":

A. Tokenize the questions of the original MMLU data provided in task 1: `task_1/mmlu_data/test.csv` and plot the token distribution (the frequency of each token).

B. Same as (A), but now for the answers in columns (columns "A", "B", "C", and "D").

C. Isolate the tokens for the strings "A", "B", "C", and "D", then, for their occurances in both questions and answers, print their relative distribution to each other.

**hint**
- There are a _lot_ of tokens, consider using a cutoff point and log scale
- For (c), they should sum to 1

In [15]:
#A

In [None]:
#B

In [None]:
#C

### 3.2 (3 pt)

What if the number of "A", "B", "C", and "D" tokens in the question and answer pairs could influence a language model's decisions?

A. For each question-answer pair, compute: 
1. the number of "A", "B", "C", and "D" tokens that occur in the combined question and answers; 
2. an the total number of tokens.
3. then, group by the "correct" answer and compute the mean frequency of A, B, C, and D tokens and the total number of tokens. 
4. finally, print your results

B. /Discuss:/ What do you think of the hypothesis that the frequency of A, B, C, and D tokens could influence answers?


In [17]:
#A

Unnamed: 0_level_0,A,B,C,D,total
Unnamed: 0_level_1,mean,mean,mean,mean,mean
answer,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,0.243017,0.018932,0.02514,0.013035,93.187151
B,0.231947,0.019642,0.029463,0.012709,88.846332
C,0.22641,0.018984,0.034897,0.015355,92.653825
D,0.24285,0.014566,0.030985,0.014301,92.110169


B. /Discuss:/

### 3.3 (4 pt)

Three of the most important considerations when deciding between language models are:

Quality
Costs
Speed

So far, much of your analysis has focused on quality. However, the government has indicated that they are quite concerned about both the total costs and speed as well. Specifically, it has been brought to their attention that a new `turbo` model has been launched! 

This model is both cheaper and faster than the models you evaluated so far. However, there is a catch: the context length* is much smaller than that of the other LMS. Namely, it can only process **300** tokens during inference. Meanwhile, the other models can process up to 100K tokens! 

*_The “context length” refers to the number of tokens that can be given to an LM as input._

A. Are there subjects where using the cheaper model might be problematic? I.e., where part of the question and answer(s) might not fit completely in the context?

B. /Discuss:/ Can you think of a strategy that would balance the needs of the government?

**hint**:
- An LM needs to have both the question and the different answer options in its context

In [16]:
#A

B. /Discuss:/

### 3.4 (4 pt)

/Discuss:/ The time has come to give your final recommendation on the use of LMs in education to the government! Taking into account everything you analyzed in all the preceding tasks (1, 2, and 3), please write a short recommendation consisting of 4 bullet points discussing your concerns.

B. /Discuss:/

1.

2.

3.

4.