# Has chatGPT gotten dumber?
<p style="margin-top: -20px; font-size: 0.8em;">By Bryce Brady</p>

Recently, there have been claims that GPT-4's reasoning abilities, coding skills, and overall performance have degraded over time or that OpenAI has "nerfed" the model. This was sparked in part by a paper from Stanford researchers (https://arxiv.org/pdf/2307.09009.pdf) claiming to show declines in GPT-4 performance over time but used a horendous methodology.

As a regular user of GPT-4-0613 since its release, I have not personally noticed any changes. However, these are testable hypotheses, so I decided to conduct an analysis.

In this notebook, I test three different versions of GPT-4 on a sample of problems from the MMLU Formal Logic Benchmark, with temperature set to 0 for all models:

1. GPT-4 on May 8, 2023 (GPT-0314): I happened to perform this same analysis back on May 8, so we can compare the March vs June versions to see if changes were made, as some have claimed. The original analysis and version history is available here: [Smart GPT Eval](https://github.com/Bradybry/SmartGPT_eval)

2. GPT-4-0314 today: I hypothesize the performance will be identical to May, contrary to claims that OpenAI has degraded the model over time.

3. GPT-4-0613 today: I hypothesize the performance will be very similar to past versions.

The benchmark dataset is available from Hugging Face here and a copy is included in ./data/ of this repository: [MMLU Formal Logic](https://huggingface.co/datasets/tasksource/mmlu) 

By testing multiple historic versions of GPT-4 on the same logic problems, we can see if performance has actually changed over time, or remained steady as I expect. This should provide evidence to evaluate claims from the Stanford paper and elsewhere that GPT-4 has been "nerfed".

In [1]:
## Imports and helper functions
import pandas as pd
from tqdm import tqdm
import time
import tiktoken
from sklearn.metrics import classification_report
from expert import LanguageExpert
tqdm.pandas()

march_prompt = {'name': 'Test Taker',
                'description': 'Correctly answer the following multiple choice question. Respond in the example_ouput format provided.',
                'example_output': '<answer>{A, B, C, or D}</answer>',
                'model_params' : {'model_name': 'gpt-4-0314', 'temperature': 0.0, 'frequency_penalty': 1.0, 'presence_penalty': 0.5, 'n': 1, 'max_tokens': 512}}
june_prompt = {'name': 'Test Taker',
                'description': 'Correctly answer the following multiple choice question. Respond in the example_ouput format provided.',
                'example_output': '<answer>{A, B, C, or D}</answer>',
                'model_params' : {'model_name': 'gpt-4-0613', 'temperature': 0.0, 'frequency_penalty': 1.0, 'presence_penalty': 0.5, 'n': 1, 'max_tokens': 512}}


def generate_questions(row):
    question = row['Question']
    choices = row['Choices']
    letters = ['A', 'B', 'C', 'D']
    question =  f'<question>{question}</question>\n'
    choices = [f'\n<choice_{letter}>{choice}</choice_{letter}>' for letter, choice in zip(letters, choices)]
    for choice in choices:
        question += choice
    return question

import re
def extract_answer(xml_string):
    # The re.DOTALL flag allows the dot to match newline characters.
    answer_pattern = re.compile(r'<answer>(.*?)<\/answer>', re.IGNORECASE | re.DOTALL)
    match = answer_pattern.search(xml_string)

    if match:
        # Extract the text content, strip any leading/trailing whitespace or newlines, and convert to lowercase.
        answer_text = match.group(1).strip().lower()
        return answer_text
    return None


target_names = {0: "a", 1: "b", 2: "c", 3: "d"}
target_numbers = {v: k for k, v in target_names.items()}


## We will estimate tokens using the OpenAI Ada encoding. Not perfect but probably good enough.
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
encoding = tiktoken.get_encoding(embedding_encoding)


import ast
# Define a function to convert string to list
def str_to_list(s):
    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        return []
def extract_digit(s):
    match = re.search(r'\d', s)
    if match:
        return int(match.group())
    else:
        return None

In [2]:
## Load and format data
file = './data/MMLUFL_HF_100.csv'
df = pd.read_csv(file)
df['Choices'] = df['Choices'].apply(str_to_list)
df['Answer'] = df['Answer'].apply(extract_digit)
df['Question'] = df['Question'].str.strip('"').str.strip()
df['formatted_question'] = df.apply(generate_questions, axis=1)
df.head()


Unnamed: 0,Question,Choices,Answer,formatted_question
0,Identify the conclusion of the following argum...,[It is hard not to verify in our peers the sam...,3,<question>Identify the conclusion of the follo...
1,Select the best translation into predicate log...,"[Tdc, Tcd, Tcc, dTc]",0,<question>Select the best translation into pre...
2,Select the best English interpretation of the ...,[Some large houses are bigger than some apartm...,2,<question>Select the best English interpretati...
3,Construct a complete truth table for the follo...,"[Valid, Invalid. Counterexample when G and H a...",0,<question>Construct a complete truth table for...
4,Use the following key to translate the given f...,[If it's not the case that both Izzy plays Min...,1,<question>Use the following key to translate t...


In [3]:
df = df.sample(25, random_state=42) # Used the same random state as the original test set from May

## May 8th GPT-4 Test Results

See original analysis for commit history of results. [Smart GPT Eval](https://github.com/Bradybry/SmartGPT_eval)

Called the null_prompt in the original analysis.

In [4]:
may_stats = {'precision': 0.8044675324675324, 'recall': 0.68, 'f1': 0.6474920634920635, 'support': 25, 'completion_tokens': 6.0, 'prompt_tokens': 204.84, 'latency': 2.67309814453125}

# GPT-4-0314 Test Results

In [5]:
## Generate March Prompt Results
march_expert = LanguageExpert(**march_prompt)

start_time = time.time()
df['march_result'] = df.progress_apply(lambda x: march_expert(x['formatted_question']), axis=1)
time_per_it = (time.time() - start_time)/len(df)

                    frequency_penalty was transferred to model_kwargs.
                    Please confirm that frequency_penalty is what you intended.
                    presence_penalty was transferred to model_kwargs.
                    Please confirm that presence_penalty is what you intended.
100%|██████████| 25/25 [00:29<00:00,  1.19s/it]


In [9]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['march_answer'] = df.apply(lambda x: extract_answer(x['march_result']), axis=1)
df['march_pred'] = df.march_answer.map(target_numbers)

## Get token counts
df["march_completion_tokens"] = df.march_result.apply(lambda x: len(encoding.encode(x)))

expert_tokens = len(encoding.encode(march_expert.get_content().content))
df["march_prompt_tokens"] = df.formatted_question.apply(lambda x: len(encoding.encode(x))) + expert_tokens

## Calculate performance stats
y_true = df['Answer']
y_pred = df['march_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
march_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["march_completion_tokens"].mean(),
    "prompt_tokens": df["march_prompt_tokens"].mean(),
    "latency": time_per_it
}

print(march_stats)

{'precision': 0.6699047619047619, 'recall': 0.64, 'f1': 0.6059994907053731, 'support': 25, 'completion_tokens': 6.0, 'prompt_tokens': 204.84, 'latency': 1.1876980113983153}


# GPT-4-0614 Test Results

In [10]:
## Generate June Prompt Results
june_expert = LanguageExpert(**june_prompt)

start_time = time.time()
df['june_result'] = df.progress_apply(lambda x: june_expert(x['formatted_question']), axis=1)
time_per_it = (time.time() - start_time)/len(df)

                    frequency_penalty was transferred to model_kwargs.
                    Please confirm that frequency_penalty is what you intended.
                    presence_penalty was transferred to model_kwargs.
                    Please confirm that presence_penalty is what you intended.
100%|██████████| 25/25 [00:29<00:00,  1.19s/it]


In [14]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['june_answer'] = df.apply(lambda x: extract_answer(x['june_result']), axis=1)
df['june_pred'] = df.june_answer.map(target_numbers)

## Get token counts
df["june_completion_tokens"] = df.june_result.apply(lambda x: len(encoding.encode(x)))

expert_tokens = len(encoding.encode(june_expert.get_content().content))
df["june_prompt_tokens"] = df.formatted_question.apply(lambda x: len(encoding.encode(x))) + expert_tokens

## Calculate performance stats
y_true = df['Answer']
y_pred = df['june_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
june_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["june_completion_tokens"].mean(),
    "prompt_tokens": df["june_prompt_tokens"].mean(),
    "latency": time_per_it
}

print(june_stats)

{'precision': 0.711, 'recall': 0.68, 'f1': 0.676923076923077, 'support': 25, 'completion_tokens': 6.0, 'prompt_tokens': 204.84, 'latency': 1.1905951309204101}


## Analyze the Results

In [15]:
## Let's display a table of the stats before doing any analysis.
from IPython.display import display, HTML, Markdown
data = {"May GPT-4": may_stats, "gpt-4-0314": march_stats,"gpt-4-0613": june_stats}
results = pd.DataFrame(data).transpose()

# Display the HTML table
display(HTML(results.to_html()))

Unnamed: 0,precision,recall,f1,support,completion_tokens,prompt_tokens,latency
May GPT-4,0.804468,0.68,0.647492,25.0,6.0,204.84,2.673098
gpt-4-0314,0.669905,0.64,0.605999,25.0,6.0,204.84,1.187698
gpt-4-0613,0.711,0.68,0.676923,25.0,6.0,204.84,1.190595


# Conclusion

I set out with two hypotheses: 

1. The May GPT-4 and GPT-0314 (today) would have identical performance
2. GPT-4-0613 would have slightly different but overall equal or better performance.

The first hypothesis was clearly wrong based on the results. GPT-4-0314 performed significantly worse on this benchmark compared to May GPT-4, indicating changes have been made to the model over time.

The second hypothesis was supported - GPT-4-0613 did achieve better performance than both past versions on the formal logic problems. However, this is just a single benchmark with a limited set of examples. We can't extend the conclusion too broadly given the small sample size.

OpenAI is making changes to the model overtime that are causing variations in performance. This analysis shows that they seem to be getting better at formal reasoning, but perhaps it is getting worse in other areas. There may be some veracity to the claims that OpenAI is nerfing the model, but this analysis shows little proof of that claim. In fact, the current latest model is the most performant on the test set used in this analysis.