## Applied Exploration

Come up with a prompt to use in a model comparison and prompt sensitivity experiment. Something like

In [3]:
chat_test1 = [
    {"role": "system", "content": "You are a world-class chef."},
    {"role": "user", "content": "Explain the Maillard reaction to a first-year student in culinary school."},
]

In [4]:
chat_test2 = [
    {"role": "system", "content": "You are an expert computer science professor."},
    {"role": "user", "content": "Explain the Maillard reaction to a first-year student in culinary school."},
]

In [5]:
chat_test3 = [
    {"role": "system", "content": "You are a world-class chef."},
    {"role": "user", "content": "Explain the Maillard reaction to a fourth-year student in culinary school."},
]

In [8]:
OUTPUTDIR = r"./F2_LM_Responses"

Then, create two additional slight variations, one in the system prompt (e.g., *expert professor* instead of *helpful assistant*) and one in the user prompt.

Run each of the three variations using the [gpt-5.2 model](https://developers.openai.com/api/docs/models/gpt-5.2) (you can use web search if you want) three times and record all nine responses. 

Answer the following questions:
* When you repeated the request on the same prompt, how different were the responses?
* Were there any meaningful differences in the variations of the prompt you tried or was it similar to the differences you noticed in on repetitions of the same prompt?
* What changes seem to be the most meaningful?

In [13]:
from openai import OpenAI

# import API key; I know there's a library for this, but that's not really necessary here
with open("../.env") as envfile:
    env = {key: val for key, val in map(lambda l: l.split('=', 1), envfile.read().splitlines())}

client = OpenAI(api_key=env['OPENAI_API_KEY'])

for test, n in zip([chat_test1, chat_test2, chat_test3], range(1, 4)):
    with open(f'{OUTPUTDIR}/gpt_test{n}.txt', 'w') as outfile:
        outfile.write(f"Responses to test {n} with GPT 5.2\n")
        for _ in range(3):
            outfile.write(f"{'=' * 40} \n\n")
            gpt_response = client.responses.create(
                model="gpt-5.2",
                input=test
            )
            outfile.write(gpt_response.output_text + '\n')



### When you repeated the request on the same prompt, how different were the responses?
* not much difference, honestly
* the words are different, and some of the points are in different orders, but the information was largely the same
    * what the Maillard reaction is
    * how it happens/how to control it
    * common misconceptions

### Were there any meaningful differences in the variations of the prompt you tried or was it similar to the differences you noticed in on repetitions of the same prompt?
* the change in system prompt (world-class chef vs. expert computer scientist) didn't seem to have any effect
* the change in user prompt (first-year vs. fourth-year culinary student) did seem to have a minor effect
    * went into slightly more detail, with a longer response
    * talked about specific chemical reactions that were happening (e.g. reducing sugars reacting with amino groups)

### What changes seem to be the most meaningful?
* changing the user prompt seems much more meaningful than changing the system prompt, at least for this example
* I imagine the user prompt (changing the role of the chatbot) might matter more for more in-depth domain-specific knowledge
* I wonder if the chatbot has been trained to ignore the system prompt to some extent in favor of the user prompt - it seems reasonable that the user might want to talk about multiple things in one chat (even if only because they aren't very good at using the ChatGPT interface), and trying to frame everything as a computer science problem might not be what most people want

Then, repeat the experiment using a small model like [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct). 

* What differences did you notice between the large and small models?

In [12]:
from transformers import pipeline
from accelerate import Accelerator

device = Accelerator().device

smol = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct", device = device)

for test, n in zip([chat_test1, chat_test2, chat_test3], range(1, 4)):
    with open(f'{OUTPUTDIR}/smol_test{n}.txt', 'w') as outfile:
        outfile.write(f"Responses to test {n} with SmolLM2\n")
        for _ in range(3):
            outfile.write(f"{'=' * 40} \n\n")
            smol_response = smol(test)
            outfile.write(smol_response[0]['generated_text'][-1]['content'] + '\n')


Loading weights: 100%|██████████| 290/290 [00:00<00:00, 823.55it/s, Materializing param=model.norm.weight]                              
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Ple

### Differences between large and small models
* obviously, the small language models are much less sophisticated
    * some lovely wisdom: "the Maillard reaction is a chemical reaction between amino acids and reducing sugars in cooked food, typically involving amino acids and reducing sugars"
* the small LMs also cut off randomly, not sure what that's about - perhaps they have a token limit?
* the small LMs also seem to make it much more dramatic, as if they're giving a presentation about it
