# Prompt Engineering Lab

**From Zero-shot to Self-Consistency — A Practical Introduction to Prompt Engineering** 


## Introduction

Welcome to this interactive notebook on **Text Classification with Large Language Models (LLMs)**.  
In this lesson, you will explore how LLMs can be guided using **different prompting techniques** – such as *zero-shot*, *few-shot*, and *chain-of-thought (CoT)* – to perform classification tasks.

### LLM Backend

This notebook uses **Ollama** (local model) or **OpenAI API** for inference.  
Make sure you've completed the setup in `index.html` before proceeding.

**Current configuration:**
- Backend: `ollama` (or `openai` if configured)
- Model: `llama3.2-long` (Ollama) or `gpt-4o-mini` (OpenAI)

If you encounter errors, verify that:
- Ollama is running: `ollama list` should show `llama3.2-long`
- `.env` file is configured correctly

#### Throughout this notebook, you will:
- Understand the **core concepts** behind each prompting method.
- Experiment with **real examples**.
- Test your understanding through **interactive quizzes**.
- Receive hints and feedback along the way.

#### Note about the code cells:

This notebook follows a clean architecture pattern where **all implementation details are kept in separate modules** under the `src/` folder. 

**If you're curious about how things work under the hood:**
- **Prompting logic** (how prompts are built): `src/prompting.py`
- **LLM calls** (API communication): `src/llm.py`
- **Evaluation metrics** (accuracy, precision, recall, F1): `src/metrics.py`
- **Interactive widgets & visualizations**: `src/notebook_helpers.py`
- **Progress tracking**: `src/progress.py`
- **Quiz answers & feedback**: `src/quiz_answers.py`

This design keeps the notebook **focused on learning**, not implementation details. You can always explore the source files if you want to understand or modify the underlying code! 🔍


---

### Setup & imports

Let's begin with the first and simplest approach – **Zero-shot prompting**. But first let's run this very first cell to complete the setup!

In [None]:
# === IMPORTS (run this cell first) ===
import sys, os, csv, pathlib, subprocess
from pathlib import Path
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML
import time

# Project structure
project_root = Path.cwd().parent
os.chdir(project_root)
sys.path.insert(0, str(project_root))

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# My own modules
from src.config import MODEL_BACKEND, OLLAMA_MODEL, OPENAI_MODEL
from src.llm import call_llm
from src.progress import load_progress, save_progress
from src.metrics import precision_recall_f1
from src.quiz_answers import check_answer
from src.prompting import (
    run_zero_shot, 
    run_few_shot, 
    run_cot, 
    run_self_consistency,
    build_zero_shot_prompt, 
    build_few_shot_prompt, 
    build_cot_prompt, 
    normalize_label
)
from src.notebook_helpers import (
    # Zero-shot
    display_zero_shot_example,
    create_zero_shot_interactive,
    create_zero_shot_quiz,
    run_zero_shot_full_dataset,
    # Few-shot
    display_few_shot_example,
    create_few_shot_interactive,
    create_few_shot_quiz,
    run_few_shot_full_dataset,
    # CoT
    display_cot_example,
    create_cot_interactive,
    create_cot_quiz,
    run_cot_full_dataset,
    # Self-Consistency
    display_self_consistency_example,
    create_self_consistency_interactive,
    create_self_consistency_quiz,
    run_self_consistency_full_dataset,
    # Comparison
    display_method_comparison,
    # Progress Tracker
    run_progress_tracker_and_validation
)

# Load dataset
DATASET = pathlib.Path('data/sentiment_tiny.csv')
rows = []
with open(DATASET, newline='', encoding='utf-8') as f:
    r = csv.DictReader(f)
    for row in r:
        rows.append((row['text'], row['label']))

# Load progress
progress = load_progress()
quiz = progress.setdefault('quiz', {})

for old_key in ['intro', 'Testar notebook', 'zero_shot', 'few_shot']:
    quiz.pop(old_key, None)

# Display configuration
print('='*60)
print('✅ Setup complete!')
print('='*60)
print(f'📊 Dataset loaded: {len(rows)} examples')
print(f'🤖 LLM Backend: {MODEL_BACKEND}')
if MODEL_BACKEND == 'ollama':
    print(f'   Model: {OLLAMA_MODEL}')
elif MODEL_BACKEND == 'openai':
    print(f'   Model: {OPENAI_MODEL}')
print(f'📁 Progress tracking ready')
print('='*60)
print()

---

# Section 1 — Zero-shot

<div style="background:#dbeafe; border-left:6px solid #2563eb; padding:16px 20px; border-radius:6px; color:#1e3a8a; line-height:1.6;">
<strong style="font-size:16px;">Zero-shot prompting</strong> means giving the model only an instruction or question, <strong>with no examples</strong>. The model must rely solely on its <strong>pretrained knowledge</strong> and <strong>understanding of instructions</strong>. It works best for clear, factual tasks and is cost-efficient since it requires fewer tokens.

<strong style="margin-top:12px; display:block;">Zero-shot prompting is powerful when:</strong>
<ul style="margin:8px 0 12px 20px;">
  <li>You have <strong>no labeled data</strong>.</li>
  <li>The task is <strong>clearly described</strong>.</li>
  <li>You want to quickly test a model's reasoning or world knowledge.</li>
</ul>

<em style="color:#1e40af;">However, it can sometimes produce inconsistent results since the model has no examples to guide its reasoning.</em>
</div>


### 1.1 Example: Zero-shot Classification

<div style="background:#eff6ff; border-left:4px solid #3b82f6; padding:16px; margin:16px 0; border-radius:6px; color:#1e3a8a;">
<strong>📚 What happens in this cell?</strong><br><br>
You will see a <strong>complete step-by-step example</strong> of how zero-shot prompting works.

<strong>📝 Example sentence used:</strong><br>
<em>"The movie was absolutely wonderful and full of emotion."</em>

<strong>What you will see:</strong>
- 📄 The input sentence and expected label
- 🔍 The exact prompt sent to the model
- 🤖 The model's raw response + normalized result
- 📊 A summary table with the classification result
- ⏱️ Response latency (how long it took)

<strong>💡 Pay attention to:</strong> The prompt is <em>very simple</em> – just instructions, no examples!
</div>

Run the cell below to see the demonstration:

In [None]:
# === Example: Zero-shot Classification ===
display_zero_shot_example(rows)

### 1.2 Try it yourself!

<div style="background:#eff6ff; border-left:4px solid #3b82f6; padding:16px; margin:16px 0; border-radius:6px; color:#1e3a8a;">
<strong>🎯 Now it's your turn!</strong><br><br>
Test zero-shot prompting with <strong>your own sentence</strong>.

<strong>What will happen:</strong>
- You enter a sentence (e.g., "This movie was terrible")
- The model classifies it as positive/negative
- You see the entire process: prompt → model → result

<strong>💡 Try experimenting with:</strong>
- Clearly positive sentences: "I loved it!"
- Clearly negative sentences: "Awful and boring."
- Difficult edge cases: "It was okay, not great but not bad."

<strong>⏱️ Expected time:</strong> ~2-3 seconds per classification
</div>

Run the cell below:

In [None]:
# === Try it yourself! (Zero-shot) ===
create_zero_shot_interactive()

### 1.3 Quiz: Test Your Understanding

<div style="background:#eff6ff; border-left:4px solid #3b82f6; padding:16px; margin:16px 0; border-radius:6px; color:#1e3a8a;">
<strong>🔍 Knowledge Check</strong><br><br>
Before moving forward, make sure you understand the <strong>zero-shot concept</strong>.

<strong>What you will get:</strong>
- ✅ Multiple choice question with clickable radio buttons
- 💡 Immediate feedback (green = correct, red = incorrect)
- 💾 Automatic saving of correct answers to progress file

<strong>🎯 Goal:</strong> Answer correctly to proceed to the next section!
</div>

Run the cell below:

In [None]:
# === Quiz: Test Your Understanding ===
create_zero_shot_quiz(progress)

<details>
  <summary>💡 Need hints? (click to expand)</summary> 

Think about what "zero" means in **zero-shot** - how many examples does the model see?

To understand the concept better, ask your AI:
> "What are the advantages and drawbacks of zero-shot vs few-shot learning for text classification?"
</details>


### 1.4 Run Zero-shot on Full Dataset

<div style="background:#eff6ff; border-left:4px solid #3b82f6; padding:16px; margin:16px 0; border-radius:6px; color:#1e3a8a;">
<strong>🚀 Time to run on the entire dataset!</strong><br><br>
Now we will classify <strong>all 15 sentences</strong> in the dataset using zero-shot prompting.

<strong>What will happen:</strong>
1. ⏳ The model classifies each sentence (takes ~30-60 seconds total)
2. 📊 You get a table with accuracy, precision, recall, F1-score
3. 📈 A bar chart visualizes the results
4. 🎯 A target line shows if you reached 60% accuracy
5. 📋 Detailed breakdown (true positives, false positives, etc.)
6. 💾 All metrics are automatically saved to <code>progress/lesson_progress.json</code>

<strong>🎯 Goal:</strong> Reach at least <strong>60% accuracy</strong> to pass this section!

<strong>⚠️ Note:</strong> If zero-shot doesn't reach the target – don't worry! That's why we have few-shot and CoT.
</div>

Run the cell below:

In [None]:
# === 🚀 Run Zero-shot on Full Dataset ===
metrics_z = run_zero_shot_full_dataset(rows, progress)

---

# Section 2 — Few-shot

<div style="background:#f5f3ff; border-left:6px solid #8b5cf6; padding:16px 20px; border-radius:6px; color:#5b21b6; line-height:1.6;">
<strong style="font-size:16px;">Few-shot prompting</strong> means providing the model with <strong>a few labeled examples</strong> before asking it to classify new text. These examples help the model understand the <strong>pattern</strong> and <strong>desired output format</strong>.

<strong style="margin-top:12px; display:block;">Few-shot prompting is effective when:</strong>
<ul style="margin:8px 0 12px 20px;">
  <li>You have <strong>a small set of labeled examples</strong> (typically 2-10).</li>
  <li>The task requires understanding <strong>specific patterns or style</strong>.</li>
  <li>Zero-shot performance is insufficient and needs guidance.</li>
</ul>

<em style="color:#6d28d9;">Few-shot prompting typically improves accuracy compared to zero-shot, but requires more tokens (and thus higher cost).</em>
</div>

### 2.1 Example: Few-shot Classification

<div style="background:#f5f3ff; border-left:4px solid #8b5cf6; padding:16px; margin:16px 0; border-radius:6px; color:#5b21b6;">
<strong>📚 What happens in this cell?</strong><br><br>
You will see how <strong>few-shot prompting</strong> works with demonstration examples.

<strong>📝 Example sentence used:</strong><br>
<em>"The movie was absolutely wonderful and full of emotion."</em>

<strong>What you will see:</strong>
- 📄 The same example sentence as in zero-shot
- 📚 3 demonstration examples shown to the model FIRST
- 🔍 The complete prompt (with all examples included)
- 🤖 The model's response after seeing the examples
- 📊 Results in table format
- ⏱️ Response latency (how long it took)

<strong>💡 Notice the difference:</strong> The prompt is now <em>much longer</em> because it contains examples!
</div>

Run the cell below to see the demonstration:

In [None]:
# === Example: Few-shot Classification ===
display_few_shot_example(rows)

### 2.2 Try it yourself!

<div style="background:#f5f3ff; border-left:4px solid #8b5cf6; padding:16px; margin:16px 0; border-radius:6px; color:#5b21b6;">
<strong>🎯 Test few-shot with your own sentence!</strong><br><br>

<strong>What will happen:</strong>
- You enter a sentence
- The model sees 3 demonstration examples FIRST
- Then it classifies your sentence
- You see the full prompt (including all examples)

<strong>💡 Experiment:</strong> Test the same sentences you used in zero-shot – did the results improve now?
</div>

Run the cell below:

In [None]:
# === Try it yourself! (Few-shot) ===
create_few_shot_interactive()

### 2.3 Quiz: Test Your Understanding

<div style="background:#f5f3ff; border-left:4px solid #8b5cf6; padding:16px; margin:16px 0; border-radius:6px; color:#5b21b6;">
<strong>🔍 Knowledge Check: Few-shot</strong><br><br>

<strong>What you will get:</strong>
- ✅ Multiple choice question about few-shot prompting
- 💡 Immediate feedback on your answer
- 📝 Explanation of why the answer is correct/incorrect
- 💾 Automatic saving upon correct answer

<strong>🎯 Focus:</strong> Understand <em>the difference between zero-shot and few-shot</em>!
</div>

Run the cell below:

In [None]:
# === Quiz: Test Your Understanding ===
create_few_shot_quiz(progress)

<details>
  <summary>💡 Need hints? (click to expand)</summary> 

What's the main difference between **Zero-shot** and **few-shot**? What does the model get to see in each approach?

To understand the concept better, ask your AI:
> "What are the advantages of few-shot learning over zero-shot? When would the extra token cost be justified?"
</details>

### 2.4 Run Few-shot on Full Dataset

<div style="background:#f5f3ff; border-left:4px solid #8b5cf6; padding:16px; margin:16px 0; border-radius:6px; color:#5b21b6;">
<strong>🚀 Run few-shot on all 15 sentences!</strong><br><br>

<strong>What will happen:</strong>
1. ⏳ Each sentence is classified with 3 demonstration examples (takes ~30-60 seconds)
2. 📊 Metrics: accuracy, precision, recall, F1
3. 📈 Bar chart for visualization
4. 🎯 Comparison against the 60% target
5. 💾 Automatically saved to progress

<strong>💡 Compare the results:</strong> Did few-shot perform better than zero-shot? Why/why not?

<strong>💰 Cost:</strong> Few-shot is more expensive than zero-shot (more tokens per prompt)!
</div>

Run the cell below:

In [None]:
# === Run Few-shot on Full Dataset ===
metrics_fs = run_few_shot_full_dataset(rows, progress)

---

# Section 3 – Chain-of-Thought (CoT)

<div style="background:#fce7f3; border-left:6px solid #ec4899; padding:16px 20px; border-radius:6px; color:#831843; line-height:1.6;">
<strong style="font-size:16px;">Chain-of-Thought (CoT) prompting</strong> encourages the model to <strong>explain its reasoning step-by-step</strong> before giving a final answer. Instead of jumping directly to a conclusion, the model shows its thought process.

<strong style="margin-top:12px; display:block;">CoT prompting is powerful when:</strong>
<ul style="margin:8px 0 12px 20px;">
  <li>The task requires <strong>complex reasoning</strong> or multi-step logic.</li>
  <li>You want to <strong>understand how the model reached its conclusion</strong>.</li>
  <li>Simple prompts give inconsistent or incorrect results.</li>
</ul>

<em style="color:#9d174d;">By asking the model to "show its work," we often get more accurate and interpretable results, especially for nuanced tasks like sentiment analysis.</em>
</div>

### 3.1 Example: Chain-of-Thought Classification

<div style="background:#fdf4ff; border-left:4px solid #ec4899; padding:16px; margin:16px 0; border-radius:6px; color:#831843;">
<strong>📚 What happens in this cell?</strong><br><br>
Now you will see how the model <strong>"thinks out loud"</strong> before answering.

<strong>📝 Example sentence used:</strong><br>
<em>"The movie was absolutely wonderful and full of emotion."</em>

<strong>What you will see:</strong>
- 📄 The example sentence (same as before)
- 🔍 A prompt asking the model to explain its reasoning
- 🧠 The model's <strong>complete reasoning</strong> (step-by-step)
- 🎯 The extracted final answer (positive/negative)
- 📊 Results table

<strong>💡 Notice:</strong> The output is now <em>much longer</em> because the model explains its thought process!
</div>

Run the cell below to see the demonstration:

In [None]:
# === Example: Chain-of-Thought Classification ===
display_cot_example(rows)

### 3.2 Try it yourself!

<div style="background:#fdf4ff; border-left:4px solid #ec4899; padding:16px; margin:16px 0; border-radius:6px; color:#831843;">
<strong>🎯 Test Chain-of-Thought with your own sentence!</strong><br><br>

<strong>What will happen:</strong>
- You enter a sentence
- The model gets instructions to "show its reasoning step-by-step"
- You see the model's <strong>thought process</strong> (reasoning)
- The last word in the reasoning becomes the answer

<strong>💡 Use this for:</strong>
- Difficult edge cases: "It was okay but not great"
- Complex sentences: "The acting was good but the plot was terrible"
- See <em>why</em> the model chose its answer!
</div>

Run the cell below:

In [None]:
# === Try it yourself! (Chain-of-Thought) ===
create_cot_interactive()

### 3.3 Quiz: Test Your Understanding

<div style="background:#fdf4ff; border-left:4px solid #ec4899; padding:16px; margin:16px 0; border-radius:6px; color:#831843;">
<strong>🔍 Knowledge Check: Chain-of-Thought</strong><br><br>

<strong>What you will get:</strong>
- ✅ Question about CoT's main advantage
- 💡 Feedback and explanation
- 💾 Automatic progress saving

<strong>🎯 Focus:</strong> Understand <em>why</em> explicit reasoning often leads to better results! Or not...
</div>

Run the cell below:

In [None]:
# === Quiz: Test Your Understanding ===
create_cot_quiz(progress)

<details>
  <summary>💡 Need hints? (click to expand)</summary> 

What does **Chain-of-Thought (CoT)** suggest about how the model should approach the problem?

To understand the concept better, ask your AI:
> "Why does asking a model to show its reasoning often lead to better answers? What cognitive process does this mimic?"
</details>

### 3.4 Run Chain-of-Thought on Full Dataset

<div style="background:#fdf4ff; border-left:4px solid #ec4899; padding:16px; margin:16px 0; border-radius:6px; color:#831843;">
<strong>🚀 Run CoT on all 15 sentences!</strong><br><br>

<strong>What will happen:</strong>
1. ⏳ Each sentence is classified WITH explicit reasoning (takes ~1-2 minutes)
2. 📊 Metrics are calculated (accuracy, precision, recall, F1)
3. 📈 Visualization of results
4. 🎯 Comparison against the 60% target
5. 💾 Saved to progress

<strong>⏱️ Notice:</strong> CoT takes longer because the model generates more text!

<strong>💡 Compare:</strong> Did CoT perform better than zero-shot and few-shot? Which method is best so far?
</div>

Run the cell below:

In [None]:
# === Run Chain-of-Thought on Full Dataset ===
metrics_cot = run_cot_full_dataset(rows, progress)

---

# Section 4 – Self-Consistency

<div style="background:#fef3c7; border-left:6px solid #f59e0b; padding:16px 20px; border-radius:6px; color:#78350f; line-height:1.6;">
<strong style="font-size:16px;">Self-Consistency</strong> takes Chain-of-Thought prompting one step further. Instead of generating <strong>one reasoning path</strong>, it generates <strong>multiple diverse reasoning paths</strong> and then uses <strong>majority voting</strong> to select the most consistent answer.

<strong style="margin-top:12px; display:block;">Self-Consistency is powerful when:</strong>
<ul style="margin:8px 0 12px 20px;">
  <li>You need <strong>high reliability</strong> and can afford extra compute cost.</li>
  <li>A single reasoning path might be flawed or biased.</li>
  <li>The task benefits from <strong>multiple perspectives</strong> before deciding.</li>
</ul>

<em style="color:#92400e;">By sampling multiple reasoning chains with higher temperature and aggregating results, Self-Consistency often achieves the highest accuracy – at the cost of k× more API calls (typically k=5-10).</em>
</div>


### 4.1 Example: Self-Consistency Classification

<div style="background:#fffbeb; border-left:4px solid #f59e0b; padding:16px; margin:16px 0; border-radius:6px; color:#78350f;">
<strong>📚 What happens in this cell?</strong><br><br>
Self-Consistency = CoT <strong>× 5 times</strong> + voting!

<strong>📝 Example sentence used:</strong><br>
<em>"The movie was absolutely wonderful and full of emotion."</em>

<strong>What you will see:</strong>
- 🔄 The same sentence classified 5 TIMES (with temperature=0.7)
- 🧠 5 different reasoning paths (the model thinks differently each time)
- 🗳️ Majority voting (which answer got the most votes?)
- 📊 Vote distribution (e.g., "positive: 4/5, negative: 1/5")
- 🎯 Final answer based on majority

<strong>⚠️ Warning:</strong> This is <em>5× more expensive</em> than CoT (5 API calls per sentence)!

<strong>💡 The point:</strong> By running multiple different thought processes and voting, we get more <em>reliable</em> results.
</div>

Run the cell below to see the demonstration:

In [None]:
# === Example: Self-Consistency Classification ===
display_self_consistency_example(rows, k=5)

### 4.2 Try it yourself!

<div style="background:#fffbeb; border-left:4px solid #f59e0b; padding:16px; margin:16px 0; border-radius:6px; color:#78350f;">
<strong>🎯 Test Self-Consistency with your own sentence!</strong><br><br>

<strong>What will happen:</strong>
- You enter a sentence
- The model runs <strong>5 reasoning paths</strong> (takes ~10-15 seconds)
- You see all 5 thought processes (may be different!)
- Majority vote determines the final answer
- You see vote distribution (e.g., 4/5 positive)

<strong>💡 Test with:</strong>
- Difficult edge cases where you're unsure
- Compare: was the consensus answer more reliable?

<strong>💰 Cost:</strong> 5× more API calls = much more expensive!
</div>

Run the cell below:

In [None]:
# === Try it yourself! (Self-Consistency) ===
create_self_consistency_interactive(k=5)

### 4.3 Quiz: Test Your Understanding

<div style="background:#fffbeb; border-left:4px solid #f59e0b; padding:16px; margin:16px 0; border-radius:6px; color:#78350f;">
<strong>🔍 Knowledge Check: Self-Consistency</strong><br><br>

<strong>What you will get:</strong>
- ✅ Question about the Self-Consistency mechanism
- 💡 Feedback and explanation
- 💾 Progress saving

<strong>🎯 Focus:</strong> Understand the <em>trade-off</em> between cost and reliability!
</div>

Run the cell below:

In [None]:
# === Quiz: Test Your Understanding ===
create_self_consistency_quiz(progress)

<details>
  <summary>💡 Need hints? (click to expand)</summary> 

Think about what happens when you ask multiple experts to solve a problem independently, then going with the consensus.

To understand the concept better, ask your AI:
> "Why does sampling multiple reasoning paths with higher temperature and then voting often lead to better results than a single deterministic path?"
</details>

### 4.4 Run Self-Consistency on Full Dataset

<div style="background:#fffbeb; border-left:4px solid #f59e0b; padding:16px; margin:16px 0; border-radius:6px; color:#78350f;">
<strong>🚀 Run Self-Consistency on all 15 sentences!</strong><br><br>

<strong>What will happen:</strong>
1. ⏳ Each sentence is classified 5 TIMES (takes ~4-6 minutes total!)
2. 📊 Metrics are calculated from majority-voted results
3. 📈 Visualization and comparison
4. 🎯 Check against the 60% target
5. 💾 Saved to progress

<strong>💰 Total cost:</strong> 15 sentences × 5 paths = <strong>75 API calls</strong>

<strong>💡 Expected:</strong> Self-Consistency should give the <em>highest accuracy</em> of all methods (but at the highest cost)!

<strong>🎉 After this cell:</strong> You have run all 4 methods and can compare them in the next section!
</div>

Run the cell below:

In [None]:
# === Run Self-Consistency on Full Dataset ===
metrics_sc = run_self_consistency_full_dataset(rows, progress, k=5)

---

# Comparison & Measurement

<div style="background:#f3f4f6; border-left:6px solid #6b7280; padding:20px; margin:20px 0; border-radius:6px; color:#1f2937; line-height:1.6;">
<strong style="font-size:17px;">📊 Time to Compare All Methods!</strong><br><br>

You've now run <strong>all four prompting methods</strong> on the same dataset. In this section, you will:
<ul style="margin:12px 0 12px 20px; line-height:1.8;">
  <li><strong>Compare performance:</strong> Which method achieved the best accuracy?</li>
  <li><strong>Analyze trade-offs:</strong> Cost vs. accuracy — when is each method worth it?</li>
  <li><strong>Understand metrics:</strong> What do precision, recall, and F1-score actually mean?</li>
  <li><strong>Reflect critically:</strong> Answer reflection questions about your findings</li>
</ul>

<strong style="display:block; margin-top:12px;">🎯 Your Task:</strong>
<ol style="margin:8px 0 0 20px; line-height:1.8;">
  <li>Review the comparison tables and charts below</li>
  <li>Identify which method performed best (and why)</li>
  <li>Answer the <strong>3 reflection questions</strong> at the end</li>
  <li>Click <strong>"💾 Save Reflections"</strong> to complete the lesson</li>
</ol>
</div>

---

### Understanding the Metrics

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:16px; margin:16px 0; border-radius:6px; color:#78350f;">
<strong>📚 What do these metrics mean?</strong><br><br>

<strong>1. Accuracy</strong> — Overall correctness<br>
<em style="color:#92400e;">Formula: (Correct predictions) / (Total predictions)</em><br>
Example: If the model correctly classifies 12 out of 15 sentences → Accuracy = 80%<br>
<strong>When to use:</strong> Good for balanced datasets, but can be misleading if classes are imbalanced.<br><br>

<strong>2. Precision</strong> — How accurate are the positive predictions?<br>
<em style="color:#92400e;">Formula: (True Positives) / (True Positives + False Positives)</em><br>
Example: Of all sentences predicted as "positive", how many were actually positive?<br>
<strong>When to use:</strong> Important when false positives are costly (e.g., spam detection).<br><br>

<strong>3. Recall</strong> — How many actual positives did we find?<br>
<em style="color:#92400e;">Formula: (True Positives) / (True Positives + False Negatives)</em><br>
Example: Of all actually positive sentences, how many did we correctly identify?<br>
<strong>When to use:</strong> Important when missing positives is costly (e.g., disease detection).<br><br>

<strong>4. F1-Score</strong> — Balanced measure of precision and recall<br>
<em style="color:#92400e;">Formula: 2 × (Precision × Recall) / (Precision + Recall)</em><br>
Harmonic mean that balances both precision and recall.<br>
<strong>When to use:</strong> Best overall metric when you care about both false positives and false negatives.<br><br>

<strong>💡 Key Insight:</strong> High accuracy doesn't always mean a good model! Always look at precision, recall, and F1 together.
</div>

---

### Confusion Matrix Terminology

<div style="background:#dbeafe; border-left:4px solid #3b82f6; padding:16px; margin:16px 0; border-radius:6px; color:#1e3a8a;">
<strong>🔍 Understanding True/False Positives/Negatives</strong><br><br>

<table style="width:100%; border-collapse: collapse; margin-top:12px;">
  <tr style="background:#eff6ff;">
    <th style="padding:12px; border:1px solid #bfdbfe; text-align:center;"></th>
    <th style="padding:12px; border:1px solid #bfdbfe; text-align:center;">Predicted: Positive</th>
    <th style="padding:12px; border:1px solid #bfdbfe; text-align:center;">Predicted: Negative</th>
  </tr>
  <tr>
    <td style="padding:12px; border:1px solid #bfdbfe; font-weight:bold;">Actually: Positive</td>
    <td style="padding:12px; border:1px solid #bfdbfe; background:#d1fae5; text-align:center;">
      <strong>True Positive (TP)</strong><br>
      ✅ Correctly identified positive
    </td>
    <td style="padding:12px; border:1px solid #bfdbfe; background:#fee2e2; text-align:center;">
      <strong>False Negative (FN)</strong><br>
      ❌ Missed a positive (predicted negative)
    </td>
  </tr>
  <tr>
    <td style="padding:12px; border:1px solid #bfdbfe; font-weight:bold;">Actually: Negative</td>
    <td style="padding:12px; border:1px solid #bfdbfe; background:#fee2e2; text-align:center;">
      <strong>False Positive (FP)</strong><br>
      ❌ Wrong positive alert (predicted positive)
    </td>
    <td style="padding:12px; border:1px solid #bfdbfe; background:#d1fae5; text-align:center;">
      <strong>True Negative (TN)</strong><br>
      ✅ Correctly identified negative
    </td>
  </tr>
</table>

<br>
<strong>💡 Example:</strong><br>
If a model predicts "positive" for 8 sentences:<br>
- 6 were actually positive → <strong>6 True Positives (TP)</strong><br>
- 2 were actually negative → <strong>2 False Positives (FP)</strong><br>
- Precision = 6/(6+2) = 75%
</div>

---

Run the cell below to see the complete comparison and answer the reflection questions:

In [None]:
# === Method Comparison: Overview & Reflection ===
display_method_comparison(progress)

---

# Main Takeaways

### 🎯 What You Learned

### **1. Zero-shot Prompting**
- ✅ How zero-shot prompts work (instruction only, no examples)
- ✅ Best for simple, clearly-defined tasks
- ✅ Lowest token cost (most efficient)
- ✅ Quick to implement and test

### **2. Few-shot Prompting**
- ✅ How to provide demonstrations in prompts
- ✅ Demonstrations are included **before** the task input
- ✅ More tokens = higher cost than zero-shot
- ✅ Helps model learn patterns from examples

### **3. Chain-of-Thought (CoT)**
- ✅ Asking models to show reasoning step-by-step
- ✅ Improves performance on complex tasks
- ✅ Longer responses (more tokens)
- ✅ Reasoning transparency helps debugging

### **4. Self-Consistency**
- ✅ Running multiple CoT samples and voting
- ✅ Most reliable but most expensive (k×cost)
- ✅ Best for high-stakes decisions
- ✅ Aggregates diverse reasoning paths

---

### 📊 Trade-offs Summary

| Method | Cost | Accuracy | Use When |
|--------|------|----------|----------|
| **Zero-shot** | 💰 | Variable | Simple tasks, budget constraints |
| **Few-shot** | 💰💰 | Better | Need pattern examples |
| **CoT** | 💰💰 | High (complex) | Complex reasoning needed |
| **Self-Consistency** | 💰💰💰 | Highest | Critical decisions |

---

### 💡 Understanding Your Results

<div style="background:#e0f2fe; border-left:4px solid #0284c7; padding:16px; margin:16px 0; border-radius:6px; color:#075985;">
<strong>📊 A note about the trade-offs table above:</strong><br><br>

The table shows **general expectations**, but your actual results might differ!

You may notice that:
- ✅ **Zero-shot performed surprisingly well** — Sometimes simplicity wins
- ⚠️ **Few-shot wasn't always better** — Bad examples can hurt more than help
- 🤔 **CoT didn't always improve accuracy** — Requires careful prompt design
- 🎯 **Self-Consistency was most reliable** — Voting reduces variance

<strong>Why do results vary from expectations?</strong>
- 📝 **Dataset difficulty** — With clear sentiment words, all methods work well
- 🤖 **Model capability** — Strong models (like llama3.2-long) can handle complexity in zero-shot
- 🎲 **Prompt quality** — Generic CoT prompts may not always trigger better reasoning
- 📊 **Sample size** — With only 15 examples, small variations matter

<strong>💼 Key lesson:</strong> Never trust theoretical trade-offs blindly! Always benchmark on YOUR data with YOUR model before choosing a method in production.
</div>

---

### 💭 Further Reflection 

Take a moment to consider what you've learned:

1. **Which method surprised you the most in terms of performance?** Did the results match your expectations?

2. **When would you choose few-shot over CoT?** Think about real-world scenarios where token cost matters versus reasoning transparency.

3. **How did temperature affect Self-Consistency?** What happens if temperature is too low (0.0) or too high (1.5)?

4. **What limitations did you observe?** Consider edge cases, truncation issues, or when prompting alone might not be enough.

5. **How would you apply this in production?** Balance accuracy, cost, latency, and reliability for a real application.

---
### 🚀 Next Steps

Now that you understand the core prompting techniques, consider exploring:

- **RAG (Retrieval-Augmented Generation)** – Combine prompting with knowledge retrieval
- **Fine-tuning (LoRA/QLoRA)** – Adapt models to specific domains
- **Auto-prompting** – Automatically optimize prompts
- **Experiment tracking** – MLOps practices (LangFuse, Weights & Biases)

---

**Congratulations!** 🎉 You've completed the Prompt Engineering Lab!

Run `python src/verify.py` to validate your work and generate your completion receipt.

---

# Progress tracker

Run the autotest and see a compact progress report. If the autograder prints PASS and a receipt is created, you are done.


In [None]:
# === Progress Tracker & Validation ===
run_progress_tracker_and_validation(progress, project_root)