# Prompt Engineering

In this notebook, we will use the GPT-2 model to explore prompt engineering.

## Loading the model

We can load the pretrained model from a repository called [Hugging Face](https://huggingface.co/openai-community/gpt2).

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model     = GPT2LMHeadModel.from_pretrained("gpt2")

Let's create a simple function to generate responses, given a prompt.

In [None]:
import torch

def generate_text(prompt, max_length=20, num_return_sequences=1):
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Generate text
    output_sequences = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        do_sample=True,
        temperature=0.5,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode and return generated sequences
    return [tokenizer.decode(output_sequence, skip_special_tokens=True) for output_sequence in output_sequences]

def print_response(prompt, n_responses=5, max_length=20):
    generated_text = generate_text(
        prompt, max_length=max_length,
        num_return_sequences=n_responses
    )

    print(f"Prompt:\n{prompt}\n")
    print("Generated Responses:")
    for response in generated_text:
        print(f"- {response}")
        print("\n")

And we can test it out!

In [None]:
print_response("Once upon a time in a land far away, there lived a", n_responses=5)

## Hallucination

A very common problem with large language models is **hallucination** --- the confident reporting of false information as true. We can see that GPT-2, as a very simple model, is prone to hallucination. Let's try a few examples and see if we can beat the model and become better prompters.

### Capital of France

In [None]:
query = "Q: What is is the capital of France?\nA:"

print_response(query, n_responses=5, max_length=20)

You may notice that the responses are not very good at all. Maybe the model gets it right a couple of times, but likely it is inconsistent. Try to write a prompt that performs better, using prompt engineering techniques that you learned from the lecture. Make sure to change the `max_length` parameter as needed if your prompt grows longer.

In [None]:
query = "Your query here"
print_response(query, n_responses=5, max_length=20)

### Simple math

In [None]:
query = "Q: What is 5 x 3?\nA:"

print_response(query, n_responses=5, max_length=20)

This is remarkably bad. Again, see if you can come up with a prompt that can guide the model towards a correct answer.

In [None]:
query = "Your query here"

print_response(query, n_responses=5, max_length=30)

You may find that you can get the model to output a number, but perhaps not always the correct number. A simple LLM like GPT-2 does not have any built-in reasoning capabilities --- the only reasoning it can do is whatever is encoded in language.

## Bias

Large language models are trained on data produced by humans. This data comes with biases. It is worth noting that, unless we are careful, biases can make their way into a model through the training data, even if the model itself is not trained to be biased. In general, de-biasing models (both physics models and LLMs) is a difficult problem, and an active area of research.

GPT-2 was trained on a large dataset without effective de-biasing. Let's look at a clear example of bias, to illustrate the problem. 

A word of warning: Note that the GPT produces random output, and might produce material that is violent, sexist, or racist. We have tried to give an example that is not too offensive in this regard; nevertheless, if you might be bothered, then we recommend that you skip this section.

In [None]:
print_response("The person is a man, so he works as")

In [None]:
print_response("The person is a woman, so she works as")

Clearly the model has encoded some ideas about how gender determines a person's career. Other forms of bias can be more extreme, or more subtle. It is worth watching out for them.

This is a problem not only socially, but also in your physics research. Training a model on a biased dataset (for example, a simulated dataset with mismodeling) can cause problems when the model is applied to measurements that it has not seen before. Be careful when using your models, and do not ever trust them completely.