# Assignment 04 – Reasoning
In this assignment you will explore **reasoning** in large language models, experiment with *chain‑of‑thought* prompting, and reflect on two recent position papers debating whether LLMs truly “think.”

# 1&nbsp; What is Reasoning?

Write **your own definition** of reasoning in the context of intelligent systems.  
*Hints:* consider notions such as logical inference, causal deduction, abstraction, multi‑step planning, and how humans articulate intermediate thoughts.

# 2&nbsp; Build a Basic Chain‑of‑Thought (CoT)

### 2 · Build a Basic Chain‑of‑Thought (CoT)

Your goal is to wrap **any LLM backend of your choice** with a helper that can optionally trigger a *chain‑of‑thought* style response.

#### What to implement

1. **Choose a backend** (set `USE_BACKEND`):  
   * `"gemma"` – use the `google/gemma-3-4b-it` checkpoint via 🤗 Transformers.  
   * `"openai"` – route to your `call_openai()` helper (e.g., GPT‑4o).  
   * `"gemini"` – route to your `call_gemini()` helper (e.g., Gemini 1.5 Pro).

2. **Load / authenticate**  
   * HF backends need `HF_TOKEN`.  
   * OpenAI backends need `OPENAI_API_KEY`.  
   * Gemini backends need `GOOGLE_API_KEY` + `GOOGLE_CSE_ID` or equivalent.

3. **Implement `run_llm(prompt, with_cot=False)`**  
   * When `with_cot=True`, prepend a CoT trigger such as **“Let’s think step by step.”**  
   * Return the *model’s final answer* (you may choose to strip the intermediate thoughts).

4. **Quick sanity‑check**  
   * Call the helper once *without* and once *with* CoT on a simple prompt (e.g., *“Is 17 a prime number?”*) and print both outputs.

**Reference:** Jason Wei, Xuezhi Wang, Dale Schuurmans, et al.  
*“Chain‑of‑Thought Prompting Elicits Reasoning in Large Language Models.”*  
arXiv:2201.11903 (2022) <https://arxiv.org/abs/2201.11903>


In [1]:

def call_openai(prompt: str, model_name: str = "gpt-4o") -> str:
    """Call OpenAI API and return generated text."""
    from openai import OpenAI
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
    )
    return response.choices[0].message.content.strip()

def call_gemini(prompt: str, model_name: str = "models/gemini-1.5-pro") -> str:
    """Call Google Gemini API and return generated text."""
    import google.generativeai as genai
    genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
    model = genai.GenerativeModel(model_name)
    response = model.generate_content(prompt)
    return response.text.strip()


In [None]:
# ----------------------------- STUDENT TODOs BELOW -----------------------------
# 1️⃣  Choose your backend: 'gemma', 'openai', or 'gemini'
USE_BACKEND = "gemini"  # <-- change me

# 2️⃣  Load / configure the model for the chosen backend
if USE_BACKEND == "gemma":
    # Gemma 3‑4B instruction‑tuned via Hugging Face 🧩
    from transformers import AutoTokenizer, Gemma3ForConditionalGeneration, pipeline

    MODEL_ID = "google/gemma-3-4b-it"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=True)
    model = Gemma3ForConditionalGeneration.from_pretrained(
        MODEL_ID,
        attn_implementation="flash_attention_2",
        device_map="auto",
        torch_dtype="auto",
        token=True,                # ← relies on HF_TOKEN env variable
    )
    generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        do_sample=True,
        temperature=0.7,
    )

elif USE_BACKEND == "openai":
    # OpenAI helper must be defined elsewhere in the notebook
    # Example: from my_helpers import call_openai
    pass  # TODO: nothing to load – just make sure call_openai() is available

elif USE_BACKEND == "gemini":
    # Gemini helper must be defined elsewhere in the notebook
    # Example: from my_helpers import call_gemini
    pass  # TODO: nothing to load – just make sure call_gemini() is available

else:
    raise ValueError("Unsupported backend selected!")

# 3️⃣  Implement the CoT helper
def run_llm(prompt: str, *, with_cot: bool = False, max_new_tokens: int = 256):
    """Run the chosen backend, optionally triggering chain‑of‑thought."""
    cot_prefix = "INSERT YOUR PROMPT HERE\n"
    full_prompt = cot_prefix + prompt if with_cot else prompt

    if USE_BACKEND == "gemma":
        return generator(full_prompt, max_new_tokens=max_new_tokens)[0]["generated_text"]

    if USE_BACKEND == "openai":
        # TODO: ensure `call_openai()` exists
        return call_openai(full_prompt)

    if USE_BACKEND == "gemini":
        # TODO: ensure `call_gemini()` exists
        return call_gemini(full_prompt)

    raise RuntimeError("No valid backend route found.")

# 4️⃣  Sanity‑check – compare baseline vs. CoT
_test_prompt = "Is 17 a prime number?"
print("→ Baseline:", run_llm(_test_prompt, with_cot=False))
print("→ With CoT:", run_llm(_test_prompt, with_cot=True))
# -------------------------------------------------------------------------------


# 3&nbsp; Evaluate CoT Performance vs. Single‑Call


Pick **10 reasoning‑intensive questions** (e.g. a logic puzzle, word problem, or multi‑step arithmetic query).  
Run the model twice: once *without* chain‑of‑thought and once *with* CoT using `run_llm`.  
Manually (or programmatically) judge which output is *more correct, complete, and transparent*.  
Record your observations below.

In [None]:
# ▶️ Comparison template
reasoning_question = "If all bloops are meems and some meems are glorps, are all bloops definitely glorps? Explain your answer."

baseline = run_llm(reasoning_question, with_cot=False)
cot = run_llm(reasoning_question, with_cot=True)

print("\n--- Baseline ---\n", baseline)
print("\n--- CoT ---\n", cot)

# TODO: Add your evaluation notes (e.g. accuracy, clarity) in the markdown cell that follows.


# 4&nbsp; Read Two Papers


* **“The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”** (Shojaee *et al.*, 2025).  
  *ArXiv:* <https://arxiv.org/abs/2506.06941>
* **“The Illusion of the Illusion of Thinking”** (Opus & Lawson, 2025) – a critical response.  
  *ArXiv:* <https://arxiv.org/abs/2506.09250>

Skim both (abstract → methods → main results) and take note of their competing claims about LRMs and chain‑of‑thought reasoning.

# 5&nbsp; Reflection (2 paragraphs)


In **≈150–250 words**, explain **which paper’s argument you find more convincing and why**.  
Consider the authors’ experimental setups, evidence, interpretation of “reasoning,” and any limitations you notice.


## Section 6 — Implementing Self‑Consistency Decoding

**Learning goal.** Experience how self‑consistency improves reasoning accuracy by sampling diverse chain‑of‑thoughts and aggregating answers.


### 6.1 Single‑Model Self‑Consistency

#### Detailed Steps

Follow the eight steps below to implement self‑consistency with a *single* language model.

| Step | What you do | Why it matters |
|------|-------------|----------------|
| **1. Choose a base model** | Pick **one** of the foundation models you have tried earlier (e.g., GPT‑4o, Claude‑3 Sonnet, Gemini 1.5 Pro, etc.). Set it in the `MODEL` constant below. | Keeps the experiment focused and cost‑controlled. |
| **2. Select tasks** | Re‑use at least **five reasoning problems** from previous sections and store them in `TASKS`. Each task must include a ground‑truth answer for evaluation. | Ensures comparability across sampling budgets. |
| **3. Generate reasoning paths** | Write `generate_paths(question, n_paths)` that samples `n_paths` Chain‑of‑Thought explanations from the chosen model (temperature ≥ 0.7). | Diversity of reasoning paths is the heart of self‑consistency. |
| **4. Parse final answers** | Implement `extract_final_answer(full_response)` that returns the answer string from a model response. | Needed for voting. |
| **5. Majority vote aggregator** | Implement `majority_vote(answers)` that returns the most frequent answer and its support size. Break ties by picking the answer whose chain has the highest average log‑probability. | Converts diverse chains into a single prediction. |
| **6. Run experiments** | For each sampling budget **k ∈ {3, 5, 10}** and each task, generate `k` paths → vote → record whether the voted answer matches ground truth. | Measures accuracy vs. compute. |
| **7. Collect metrics** | Track (a) accuracy, (b) average latency, (c) total tokens. Store them in a `pandas.DataFrame`. | Enables quantitative comparison. |
| **8. Analyze results** | Plot / tabulate metrics and write a short discussion: *Where do returns diminish? How does cost scale?* | Connects empirical findings to theory. |


In [None]:
# 🚀 Imports & configuration
import time, json, collections, statistics
import pandas as pd

# TODO: 🔑 Add your API key if needed, e.g. openai.api_key = "sk-..."
# import openai

MODEL = "gpt-4o-mini"  # TODO: replace with your chosen model
TEMPERATURE = 0.7      # Non‑zero for diverse sampling
MAX_TOKENS = 1200

# 👇 Populate with at least 5 {question, answer} dicts
TASKS = [
    # {"question": "What is 13 * 7?", "answer": "91"},
]


In [None]:
def generate_paths(question: str, n_paths: int):
    """Return a list of full model responses (thought + answer)."""
    paths = []
    for _ in range(n_paths):
        # TODO: 🔄 Call your LLM here with Chain‑of‑Thought prompting
        # Example pseudo‑call (replace with the correct SDK/method):
        # response = openai.ChatCompletion.create(
        #     model=MODEL,
        #     messages=[
        #         {"role": "user", "content": f"{question}\n\nLet's think step by step."}
        #     ],
        #     temperature=TEMPERATURE,
        #     max_tokens=MAX_TOKENS,
        # )
        # paths.append(response['choices'][0]['message']['content'])
        pass
    return paths


In [None]:
def extract_final_answer(response: str) -> str:
    """Extract the answer after the last occurrence of 'Answer:' (simple heuristic)."""
    import re
    match = re.findall(r"Answer\s*[:=]\s*(.*)", response, flags=re.IGNORECASE)
    return match[-1].strip() if match else ""

def majority_vote(answers):
    """Return (winning_answer, support_count, counts_dict)."""
    counts = collections.Counter(answers)
    winner, support = counts.most_common(1)[0]
    return winner, support, counts


In [None]:
%%time
records = []

for k in [3, 5, 10]:
    for task in TASKS:
        question, truth = task['question'], task['answer']
        t0 = time.time()
        paths = generate_paths(question, k)
        gen_time = time.time() - t0

        answers = [extract_final_answer(p) for p in paths]
        pred, support, _ = majority_vote(answers)
        is_correct = (pred == truth)

        records.append({
            'k_paths': k,
            'question': question,
            'truth': truth,
            'predicted': pred,
            'support': support,
            'latency_sec': round(gen_time, 2),
            'correct': is_correct,
            # TODO: add token_usage if available from your SDK
        })

df_results = pd.DataFrame(records)
df_results.groupby('k_paths')['correct'].mean().rename('accuracy').to_frame()


In [None]:
# 📊 Inspect detailed results
df_results.head()

In [None]:
# ✍️ Reflection
# In a new markdown cell below, discuss:
# - How accuracy changes with k
# - Cost/latency implications
# - Any qualitative observations about reasoning diversity


### 6.2 Cross‑Model Self‑Consistency

#### Detailed Steps

This experiment ensembles **five distinct language models** by majority voting over *one* reasoning path from each model.

| Step | What you do | Why it matters |
|------|-------------|----------------|
| **1. Pick five models** | Populate the `MODELS` list below with at least five different LLM identifiers available to you (e.g., `gpt‑4o`, `claude‑3‑opus`, `gemini‑1.5‑pro‑latest`, `llama‑3‑70b‑instruct`, `mistral‑large`). | Horizontal diversity often yields complementary reasoning. |
| **2. Re‑use the tasks** | Use the same `TASKS` list created in Section 6.1 so results are comparable. | Controls for task variation. |
| **3. Generate one path per model** | Implement `generate_one_path(question, model)` that returns a single Chain‑of‑Thought response from the given model (*temperature ≈ 0* for deterministic decoding). | Mimics a cost‑constrained ensemble where each model fires once. |
| **4. Parse answers** | Re‑use `extract_final_answer` from 6.1 to extract each model’s answer string. | Enables voting. |
| **5. Majority vote across models** | Fill in `majority_vote(answers)` (already defined) to aggregate answers across models and return the winning answer + support. | Core of ensemble self‑consistency. |
| **6. Run the experiment** | For every task, call each model → vote → record accuracy, per‑task support distribution, latency, and token cost. | Produces cross‑model performance metrics. |
| **7. Compare strategies** | Tabulate accuracy/latency/cost vs. the 10‑path single‑model result from Section 6.1. | Quantifies trade‑offs between vertical and horizontal ensembles. |
| **8. Reflect** | Discuss which method you’d choose under (a) limited budget, (b) need for highest accuracy, and why. | Connects empirical evidence to deployment choices. |


In [None]:
# 🚀 Configuration for cross‑model ensemble
import time, collections, statistics
import pandas as pd

MODELS = [
    # "gpt-4o-mini",
    # "claude-3-haiku",
    # "gemini-1.5-pro",
    # "llama-3-70b-instruct",
    # "mistral-large",
]
TEMPERATURE_CROSS = 0.0  # Near-deterministic decoding
MAX_TOKENS = 512


In [None]:
def generate_one_path(question: str, model: str) -> str:
    """Return a Chain‑of‑Thought response from *one* model run."""
    # TODO: 🔄 Replace the pseudo‑call with your provider's SDK.
    # Example:
    # response = openai.ChatCompletion.create(
    #     model=model,
    #     messages=[
    #         {"role": "user", "content": f"{question}\n\nLet's think step by step."}
    #     ],
    #     temperature=TEMPERATURE_CROSS,
    #     max_tokens=MAX_TOKENS,
    # )
    # return response['choices'][0]['message']['content']
    return ""


In [None]:
%%time
records_cm = []

for task in TASKS:
    question, truth = task['question'], task['answer']
    answers_by_model = {}
    t0 = time.time()
    for model in MODELS:
        resp = generate_one_path(question, model)
        ans = extract_final_answer(resp)
        answers_by_model[model] = ans
    latency = time.time() - t0

    voted_answer, support, support_dict = majority_vote(list(answers_by_model.values()))
    is_correct = (voted_answer == truth)

    records_cm.append({
        'question': question,
        'truth': truth,
        'predicted': voted_answer,
        'support': support,
        'support_breakdown': support_dict,
        'latency_sec': round(latency, 2),
        'correct': is_correct,
        # TODO: aggregate token usage if your SDK provides it
    })

df_cm = pd.DataFrame(records_cm)
df_cm['correct'].mean()


In [None]:
# 📊 Cross‑model ensemble results
df_cm.head()

In [None]:
# ✍️ Reflection
# In a new markdown cell below, compare:
# - Accuracy vs. Section 6.1 (k=10)
# - Latency & token costs
# - Qualitative differences in reasoning styles across models
