# Step 1: Generate Candidate Answers

First, we need to generate a set of answers for our prompts. We will use a base model for this. The following code will:

1. Define 5 prompts.
2. Load a pre-trained model and tokenizer (`distilgpt2`).
3. Generate 4 candidate answers for each prompt.
4. Save the prompts and answers to `answers.csv` with a placeholder `rank` column.

In [None]:
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 1. Define prompts
prompts = [
    "Tell me a joke about a programmer.",
    "Summarize the plot of the movie Inception in one paragraph.",
    "Write a mini-essay on the importance of recycling.",
    "What is the difference between a fruit and a vegetable?",
    "Explain the concept of machine learning to a 5-year-old."
]

# 2. Load model and tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token if it's not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 3. Generate answers
data = []
for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    # Generate 4 answers
    outputs = model.generate(
        **inputs,
        max_length=100,
        num_return_sequences=4,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        no_repeat_ngram_size=2
    )
    
    for i in range(4):
        answer = tokenizer.decode(outputs[i], skip_special_tokens=True)
        # Remove the prompt from the answer
        answer = answer[len(prompt):].strip()
        data.append({
            'prompt': prompt,
            'answer': answer,
            'rank': 1  # Placeholder rank
        })

# 4. Save to CSV
df = pd.DataFrame(data)
df.to_csv('q2_reward/answers.csv', index=False)

print("Successfully generated and saved answers to `q2_reward/answers.csv`.")
print("Please manually edit `q2_reward/answers.csv` to rank the answers for each prompt from 1 (best) to 4 (worst).")

# Step 2: Manually Rank the Answers

Now, open the `q2_reward/answers.csv` file in a spreadsheet editor or a text editor. For each prompt, you will see 4 generated answers. Please evaluate them and assign a rank from 1 to 4, where 1 is the best answer and 4 is the worst. 

**Do not proceed until you have ranked all the answers.**

After you have ranked the answers, you can train the reward model by running the `train.py` script from your terminal:
```bash
python q2_reward/train.py
```
```

# Step 3: Evaluate the Reward Model

Once the reward model is trained, we can use it to score answers. The following code will:

1. Load the trained reward model and tokenizer from the `reward_model/` directory.
2. Load the ranked answers from `answers.csv`.
3. Calculate the reward score for each answer.
4. Plot the reward scores against the manual ranks to see if they correlate.

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import os

# 1. Load the reward model and tokenizer
model_path = 'q2_reward/reward_model'
if not os.path.exists(model_path):
    print(f"Model directory `{model_path}` not found.")
    print("Please train the model first by running `python q2_reward/train.py`.")
else:
    reward_tokenizer = AutoTokenizer.from_pretrained(model_path)
    reward_model = AutoModelForSequenceClassification.from_pretrained(model_path)

    # 2. Load ranked answers
    df = pd.read_csv('q2_reward/answers.csv')

    # 3. Calculate reward scores
    scores = []
    for _, row in df.iterrows():
        text = f"{row['prompt']} {row['answer']}"
        inputs = reward_tokenizer(text, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            score = reward_model(**inputs).logits[0].item()
        scores.append(score)

    df['score'] = scores

    # 4. Plot the results
    plt.style.use('seaborn-v0_8-whitegrid')
    fig, ax = plt.subplots(figsize=(10, 6))

    sns.boxplot(x='rank', y='score', data=df, ax=ax)

    ax.set_title('Reward Score vs. Manual Rank')
    ax.set_xlabel('Manual Rank (1=Best, 4=Worst)')
    ax.set_ylabel('Reward Score')
    plt.show()
      print("\nAnalysis of the plot:")
    print("A successful reward model should show a decreasing trend in scores as the rank increases (from 1 to 4).")
    print("This means that higher-quality answers (rank 1) should receive higher scores from the model.")
