 1. Setup & Installation  
 Install the necessary Python libraries for the experiment.  

In [3]:
# %%
!pip install -q transformers datasets openai matplotlib tqdm


[notice] A new release of pip available: 22.3.1 -> 24.3.1
[notice] To update, run: C:\Users\Technic\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


 2. Import Libraries  
 Import the necessary Python modules for the experiment.  

In [5]:
# %%
import openai
from transformers import pipeline, AutoTokenizer
import numpy as np
from datasets import load_dataset
import matplotlib.pyplot as plt
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


KeyboardInterrupt: 

 3. Hugging Face Authentication  
Log in to Hugging Face to access gated models like Mistral.  

In [None]:
# %%
!pip install -q huggingface_hub

In [None]:
# %%
# @title 🔐 Authentication Setup
from huggingface_hub import notebook_login
notebook_login()


 4. What's Happening So Far?  
 **Step 1**: Installed all required libraries.  
 **Step 2**: Imported the necessary modules.  
 **Step 3**: Authenticated with Hugging Face to access gated models.  
 
 Next, we'll configure the experiment and load the dataset.  

 5. Next Steps  
 In the next section, we'll:  
 1. Load the GSM8K dataset.  
 2. Set up the Chain-of-Thought (CoT) prompt.  
 3. Run experiments on different models.  

In [None]:
# %%
# Load the GSM8K dataset
dataset = load_dataset("gsm8k", "main")['test']
print(f"Loaded {len(dataset)} math problems!")

 1. Import OpenAI and Define Models  
 In this step, we:  
 - Import the `openai` library to interact with GPT-4.  
 - Define a dictionary (`MODELS`) to store configurations for the models we'll use in the experiment.  
 
 Key Points:  
 - **`MODELS` Dictionary**:  
   - Contains configurations for GPT-4, Mistral-7B, and Phi-3.  
   - Each model has:  
     - `"type"`: `"api"` for OpenAI models, `"local"` for Hugging Face models.  
     - `"model_name"`: The model's identifier (e.g., `"mistralai/Mistral-7B-Instruct-v0.1"`).  
     - `"max_length"`: Maximum number of tokens the model can generate.  
 - **Why This Matters**:  
   - This setup allows us to easily switch between different models during the experiment.  

In [None]:
import openai
MODELS = {
    "gpt-4": {"type": "api", "cot": True},
    "mistral-7b": {
        "type": "local", 
        "model_name": "mistralai/Mistral-7B-Instruct-v0.1",
        "max_length": 512
    },
    "phi-3": {
        "type": "local",
        "model_name": "microsoft/phi-3-mini-128k-instruct",
        "max_length": 1024
    }
}





 2. Define Tasks  
 In this step, we:  
- Define a dictionary (`TASKS`) to store configurations for the tasks we'll test.  
 
 Key Points:  
 - **`TASKS` Dictionary**:  
   - Contains configurations for two tasks:  
     - `"gsm8k"`: A dataset of grade school math problems.  
     - `"mmlu"`: A dataset of multiple-choice questions, focused on philosophy.  
   - Each task has:  
     - `"dataset"`: The name of the dataset.  
     - `"split"`: The dataset split to use (e.g., `"test"`).  
     - `"subject"` (optional): Specific subject for MMLU.  
 - **Why This Matters**:  
   - This setup allows us to easily load and test different datasets.  

In [None]:
TASKS = {
    "gsm8k": {"dataset": "gsm8k", "split": "test"},
    "mmlu": {"dataset": "cais/mmlu", "split": "test", "subject": "philosophy"}
}




 3. Define Chain-of-Thought Prompt  
 In this step, we:  
 - Define a template (`COT_PROMPT`) for generating Chain-of-Thought (CoT) responses.  
 
 Key Points:  
 - **`COT_PROMPT`**:  
   - A template that forces the model to "think out loud" by breaking down the problem into steps.  
   - `{question}`: Placeholder for the actual question.  
   - Example:  
     ```
     Let's solve this problem step by step.  
     Question: {question}  
     Reasoning Steps:  
     ```  
 - **Why This Matters**:  
   - CoT prompting improves reasoning by making the model explain its thought process.  

In [None]:
COT_PROMPT = """Let's solve this problem step by step. 

Question: {question}

First, identify the key components and reasoning requirements. Then work through each logical step carefully. Finally, present your answer clearly.

Reasoning Steps:
"""

 4. Define Helper Function to Generate Responses  
 In this step, we:  
 - Define a function (`generate_response`) to generate responses using the specified model.  
 
 ### Key Points:  
 - **`generate_response` Function**:  
   - Takes `model_config` (model settings) and `prompt` (input question) as inputs.  
   - **For API Models (GPT-4)**:  
     - Uses `openai.ChatCompletion.create` to send the prompt to GPT-4.  
     - `temperature=0.3`: Balances creativity and determinism.  
     - `max_tokens=600`: Limits response length.  
   - **For Local Models (Mistral, Phi-3)**:  
     - Uses Hugging Face's `pipeline` for text generation.  
     - `device_map="auto"`: Automatically uses GPU if available.  
     - `do_sample=True`: Enables sampling for diverse responses.  
   - **Error Handling**:  
     - Catches and prints errors, returning an empty string if something goes wrong.  
 - **Why This Matters**:  
   - This function allows us to easily generate responses from any model in the `MODELS` dictionary.  

In [None]:
def generate_response(model_config, prompt):
    """Generate response with CoT prompting"""
    try:
        if model_config["type"] == "api":
            response = openai.ChatCompletion.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.3,
                max_tokens=600
            )
            return response.choices[0].message.content
        else:
            pipe = pipeline(
                "text-generation",
                model=model_config["model_name"],
                device_map="auto",
                max_new_tokens=model_config["max_length"]
            )
            return pipe(prompt, do_sample=True, temperature=0.3)[0]['generated_text']
    except Exception as e:
        print(f"Error generating response: {str(e)}")
        return ""


 ## 6. Define Function to Evaluate Responses  
 In this step, we:  
 - Define a function (`evaluate_response`) to check if the model's response matches the correct answer.  
 
 ### Key Points:  
 - **Purpose**:  
   - This function evaluates the accuracy of the model's response by comparing it to the correct answer.  
   - It handles two types of tasks:  
     - **GSM8K (Math Problems)**: Extracts the final answer from the Chain-of-Thought response.  
     - **MMLU (Multiple-Choice Questions)**: Checks if the predicted letter matches the correct answer.  
 
 ### How It Works:  
 - **For GSM8K**:  
   - Splits the response at `"Final Answer:"` to extract the model's final answer.  
   - Compares the extracted answer to the correct answer (after removing formatting like `####`).  
   - Example:  
     ```
     Response: "Final Answer: 42"  
     Correct Answer: "#### 42"  
     Result: True  
     ```  
 
 - **For MMLU**:  
   - Extracts the last character of the response (the predicted letter).  
   - Compares it to the correct answer (converted to lowercase).  
  - Example:  
    ```
    Response: "The answer is A."  
     Correct Answer: "A"  
     Result: True  
     ```  
 
 - **Error Handling**:  
   - If anything goes wrong (e.g., invalid response format), the function returns `False`.  
 
 

In [None]:
def evaluate_response(response, correct_answer, task_type):
    """Evaluate response accuracy"""
    try:
        if task_type == "gsm8k":
            # Extract final answer from CoT
            answer_blocks = response.split("Final Answer:") 
            if len(answer_blocks) > 1:
                predicted = answer_blocks[-1].split("\n")[0].strip()
                return predicted == correct_answer.split("#### ")[-1].strip()
        elif task_type == "mmlu":
            # Multiple-choice letter extraction
            predicted = response[-1].lower() if len(response) > 0 else ""
            return predicted == correct_answer.lower()
        return False
    except:
        return False

 7. Define Function to Run GSM8K Experiment  
In this step, we:  
- Define a function (`run_gsm8k_experiment`) to test models on the GSM8K dataset.   
 Key Points:  
- **Purpose**:  
 - This function runs experiments on the GSM8K dataset using the models defined earlier.  
 - It evaluates each model's accuracy and average response length.  

How It Works:  
 1. **Load Dataset**:  
   - Loads the GSM8K dataset and selects a subset of `num_samples` questions.  
    - Example: `dataset = load_dataset("gsm8k", "main")['test'].shuffle().select(range(50))`  
 
 2. **Initialize Results**:  
    - Creates a dictionary (`results`) to store accuracy and response length for each model.  
 
 3. **Test Each Model**:  
    - Loops through the `MODELS` dictionary to test each model.  
    - For API models (e.g., GPT-4), prompts the user to enter their OpenAI API key.  
    - For each question:  
      - Formats the question using the `COT_PROMPT` template.  
      - Generates a response using the `generate_response` function.  
      - Evaluates the response using the `evaluate_response` function.  
      - Tracks the number of correct answers.  
 
 4. **Store Results**:  
    - Calculates accuracy (`correct / num_samples`).  
    - Calculates average response length using `np.mean`.  
    - Stores results in the `results` dictionary.  
 
 5. **Return Results**:  
    - Returns the `results` dictionary containing accuracy and average response length for each model.  
 
  

In [None]:
def run_gsm8k_experiment(num_samples=50):
    print("🚀 Loading GSM8K Dataset...")
    dataset = load_dataset("gsm8k", "main")['test'].shuffle().select(range(num_samples))
    
    results = {}
    for model_name, config in MODELS.items():
        print(f"\n🔍 Testing {model_name}...")
        correct = 0
        responses = []
        
        if config["type"] == "api":
            openai.api_key = input("Enter OpenAI API key: ").strip()
        
        for example in tqdm(dataset, desc=model_name):
            prompt = COT_PROMPT.format(question=example["question"])
            response = generate_response(config, prompt)
            responses.append(response)
            
            if evaluate_response(response, example["answer"], "gsm8k"):
                correct += 1
        
        results[model_name] = {
            "accuracy": correct/num_samples,
            "avg_length": np.mean([len(r) for r in responses])
        }
    
    return results

In [None]:
results = run_gsm8k_experiment(num_samples=10)