# Notebook 02 ‚Ä¢ Prompt Engineering Lab

**Goal:** Experiment with different prompting techniques and learn to evaluate their effectiveness.

---

## 1. Setup

We'll use the OpenAI API (or any compatible endpoint). Make sure you have your API key set.

In [None]:
# Install if needed
# !pip install openai python-dotenv pandas

import os
from openai import OpenAI
import pandas as pd
import json
from typing import List, Dict
import time

# Initialize client
# Option 1: Use environment variable
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Option 2: For local LLMs (Ollama, LM Studio, etc.)
# client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

## 2. Helper Functions

Let's create some utilities for making API calls and tracking metrics.

In [None]:
def call_llm(
    messages: List[Dict],
    model: str = "gpt-3.5-turbo",
    temperature: float = 0.7,
    max_tokens: int = 500,
) -> Dict:
    """
    Call the LLM and return response with metadata.
    
    Returns:
        Dict with 'content', 'tokens_in', 'tokens_out', 'latency_ms'
    """
    start = time.time()
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    
    latency_ms = (time.time() - start) * 1000
    
    return {
        "content": response.choices[0].message.content,
        "tokens_in": response.usage.prompt_tokens,
        "tokens_out": response.usage.completion_tokens,
        "latency_ms": round(latency_ms, 2),
    }


def simple_prompt(user_input: str, system: str = "You are a helpful assistant.") -> Dict:
    """Simple single-turn prompt."""
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": user_input},
    ]
    return call_llm(messages)


print("Helper functions ready!")

## 3. Zero-shot vs Few-shot Prompting

### Zero-shot
Asking the model to do something without examples.

In [None]:
# Zero-shot sentiment classification
result = simple_prompt(
    user_input="Classify the sentiment of this review as positive, negative, or neutral:\n\n"
                "'The product arrived late but the quality exceeded my expectations.'",
    system="You are a sentiment classifier. Respond with only: POSITIVE, NEGATIVE, or NEUTRAL.",
)

print(f"Response: {result['content']}")
print(f"Tokens: {result['tokens_in']} in / {result['tokens_out']} out")
print(f"Latency: {result['latency_ms']}ms")

### Few-shot
Providing examples in the prompt to guide the model.

In [None]:
# Few-shot prompt with examples
few_shot_system = """You are a sentiment classifier. Classify reviews as POSITIVE, NEGATIVE, or NEUTRAL.

Examples:
Review: "Love this! Best purchase ever."
Sentiment: POSITIVE

Review: "Terrible quality, broke after one day."
Sentiment: NEGATIVE

Review: "It's okay. Does what it says."
Sentiment: NEUTRAL

Review: "Mixed feelings. Great features but poor support."
Sentiment: NEUTRAL

Now classify the following review. Respond with only the sentiment."""

result = simple_prompt(
    user_input="The product arrived late but the quality exceeded my expectations.",
    system=few_shot_system,
)

print(f"Response: {result['content']}")
print(f"Tokens: {result['tokens_in']} in / {result['tokens_out']} out")
print(f"Latency: {result['latency_ms']}ms")

## 4. Chain-of-Thought Prompting

Asking the model to "think step by step" can improve reasoning.

In [None]:
problem = "A store sells apples for $2 each. If you buy 7 apples and pay with a $20 bill, how much change do you get?"

# Without CoT
result_direct = simple_prompt(
    user_input=problem,
    system="You are a math helper. Give the answer directly.",
)
print("Direct answer:")
print(result_direct['content'])
print()

# With Chain-of-Thought
result_cot = simple_prompt(
    user_input=problem,
    system="You are a math helper. Think step by step, showing your work. Then give the final answer.",
)
print("Chain-of-Thought:")
print(result_cot['content'])

## 5. Temperature and Sampling

Temperature controls randomness. Let's see the effect.

In [None]:
prompt = "Write a one-sentence tagline for a coffee shop."
temperatures = [0.0, 0.5, 1.0, 1.5]

print("Same prompt, different temperatures:\n")

for temp in temperatures:
    result = call_llm(
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=50,
    )
    print(f"Temperature {temp}:")
    print(f"  {result['content']}\n")

## 6. Structured Output (JSON)

Getting structured data from the LLM.

In [None]:
json_prompt = """Extract information from this text and return ONLY valid JSON:

Text: "John Smith is a 32-year-old software engineer from Seattle who loves hiking and coffee."

Return JSON with keys: name, age, occupation, city, hobbies (array)
Example format: {"name": "...", "age": 0, "occupation": "...", "city": "...", "hobbies": ["..."]}
"""

result = simple_prompt(json_prompt, system="You return only valid JSON, no other text.")

print("Raw response:")
print(result['content'])
print()

# Try to parse it
try:
    data = json.loads(result['content'])
    print("Parsed successfully:")
    print(json.dumps(data, indent=2))
except json.JSONDecodeError as e:
    print(f"Failed to parse: {e}")

## 7. üéØ Sentiment Classification Benchmark

Let's build a proper benchmark to test different prompts on a labeled dataset.

In [None]:
# Sample labeled dataset
TEST_DATA = [
    {"text": "This is the best product I've ever bought!", "label": "POSITIVE"},
    {"text": "Completely disappointed. Waste of money.", "label": "NEGATIVE"},
    {"text": "It works as expected.", "label": "NEUTRAL"},
    {"text": "Amazing customer service, very helpful!", "label": "POSITIVE"},
    {"text": "The item arrived damaged.", "label": "NEGATIVE"},
    {"text": "Average product, nothing special.", "label": "NEUTRAL"},
    {"text": "Highly recommend to everyone!", "label": "POSITIVE"},
    {"text": "Never buying from here again.", "label": "NEGATIVE"},
    {"text": "Does the job, I guess.", "label": "NEUTRAL"},
    {"text": "Exceeded all my expectations!", "label": "POSITIVE"},
]

def evaluate_prompt(test_data, system_prompt, verbose=False):
    """
    Evaluate a prompt on test data.
    Returns accuracy and details.
    """
    correct = 0
    results = []
    
    for item in test_data:
        result = simple_prompt(
            user_input=f"Review: {item['text']}",
            system=system_prompt,
        )
        
        prediction = result['content'].strip().upper()
        is_correct = prediction == item['label']
        
        if is_correct:
            correct += 1
        
        results.append({
            "text": item['text'][:40],
            "expected": item['label'],
            "predicted": prediction,
            "correct": is_correct,
            "tokens": result['tokens_in'] + result['tokens_out'],
        })
        
        if verbose:
            status = "‚úì" if is_correct else "‚úó"
            print(f"{status} | Expected: {item['label']} | Got: {prediction}")
    
    accuracy = correct / len(test_data)
    total_tokens = sum(r['tokens'] for r in results)
    
    return {
        "accuracy": accuracy,
        "correct": correct,
        "total": len(test_data),
        "total_tokens": total_tokens,
        "results": results,
    }


print("Evaluation function ready!")

In [None]:
# Test different prompt strategies
prompts = {
    "Zero-shot (minimal)": "Classify the sentiment. Respond only: POSITIVE, NEGATIVE, or NEUTRAL.",
    
    "Zero-shot (detailed)": """You are a sentiment classifier.
Analyze the given review text and classify its overall sentiment.
Consider the tone, words used, and implied emotion.
Respond with ONLY one word: POSITIVE, NEGATIVE, or NEUTRAL.""",
    
    "Few-shot": """Classify sentiment. Examples:
"Love this!" -> POSITIVE
"Terrible experience" -> NEGATIVE
"It's okay" -> NEUTRAL
Now classify. Respond only: POSITIVE, NEGATIVE, or NEUTRAL.""",
}

# Run evaluations (commented out to save API calls - uncomment to run)
# for name, prompt in prompts.items():
#     print(f"\n{'='*50}")
#     print(f"Prompt: {name}")
#     print(f"{'='*50}")
#     eval_result = evaluate_prompt(TEST_DATA, prompt, verbose=True)
#     print(f"\nAccuracy: {eval_result['accuracy']:.1%}")
#     print(f"Total tokens: {eval_result['total_tokens']}")

## 8. Building Your Prompt Catalog

Create a collection of reusable prompt templates.

In [None]:
PROMPT_CATALOG = {
    "summarize": {
        "system": "You are a concise summarizer. Capture the key points in 2-3 sentences.",
        "template": "Summarize the following text:\n\n{text}",
        "use_case": "Document summarization",
    },
    "extract-entities": {
        "system": "You extract named entities. Return JSON with keys: people, organizations, locations, dates.",
        "template": "Extract entities from:\n\n{text}",
        "use_case": "Information extraction",
    },
    "code-review": {
        "system": "You are a code reviewer. Focus on: bugs, performance, readability, security.",
        "template": "Review this code and list issues/suggestions:\n\n```\n{code}\n```",
        "use_case": "Code quality checks",
    },
    "explain-like-im-5": {
        "system": "You explain complex topics simply, using analogies a 5-year-old would understand.",
        "template": "Explain this concept: {topic}",
        "use_case": "Educational content",
    },
    "rewrite-formal": {
        "system": "You rewrite text to be more professional and formal while keeping the meaning.",
        "template": "Rewrite this formally:\n\n{text}",
        "use_case": "Professional communication",
    },
}

# Save to JSON for reuse
with open("../prompts/catalog.json", "w") as f:
    json.dump(PROMPT_CATALOG, f, indent=2)

print("Prompt catalog saved!")
print(f"\nAvailable prompts: {list(PROMPT_CATALOG.keys())}")

## 9. üéØ Your Tasks

### Task 1: Improve the Sentiment Classifier
Try to get 100% accuracy on the test set. Experiment with:
- Different system prompts
- More few-shot examples
- Chain-of-thought

### Task 2: Create Your Own Prompt Template
Add a new template to the catalog for a use case you care about.

In [None]:
# Your experiments here

# TODO: Create a prompt for a task you find useful
my_prompt = {
    "system": "Your system prompt here",
    "template": "Your template with {placeholders}",
    "use_case": "What this is for",
}

# Test it
# result = simple_prompt(
#     user_input=my_prompt["template"].format(placeholder="your input"),
#     system=my_prompt["system"],
# )
# print(result['content'])

### Task 3: Track and Compare

Create a comparison table of your prompt experiments.

In [None]:
# TODO: Run your experiments and record results
experiments = [
    # {"name": "Experiment 1", "accuracy": 0.8, "tokens": 150, "notes": "..."},
]

df = pd.DataFrame(experiments)
df

## 10. üìù Reflection

1. What prompt technique gave the best results for sentiment classification?
2. How does temperature affect the consistency of outputs?
3. What are the trade-offs between few-shot examples and prompt length?
4. When would you use JSON output vs. free-form text?

---

**Next:** Continue to Module 03 to learn about fine-tuning!