In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Homework 11: Text Generation and Decoding Strategies

**Name:** [Your Name Here]  
**Total Points: 40**

## Submission Checklist
- [ ] All code cells executed with output saved
- [ ] All questions answered in markdown cells
- [ ] Used `DATA_PATH` and `MODELS_PATH` variables (no hardcoded paths)
- [ ] Decoding strategies compared with visualizations
- [ ] Creative application implemented
- [ ] Reflection questions answered
- [ ] Notebook exported to HTML
- [ ] Canvas filename includes `_GRADE_THIS_ONE`
- [ ] Files uploaded to Canvas

---

**Point Breakdown:**
- Part 1 (Decoding Strategies): 10 pts
- Part 2 (API Helper Functions): 8 pts
- Part 3 (Model Size Comparison): 8 pts
- Part 4 (Creative Application): 8 pts
- Part 5 (Analysis and Comparison): 4 pts
- Part 6 (Reflection): 2 pts

## Part 1 - Decoding Strategies (10 pts)

In this part, you'll experiment with different decoding strategies to see how they affect generated text quality and diversity.

**Model Suggestions for A6000 GPUs (48GB VRAM):**

**Small models (3-4B parameters, ~2-3GB VRAM):**
- `"unsloth/Llama-3.2-3B-Instruct-bnb-4bit"` - Fast, good for testing
- `"unsloth/Qwen2.5-3B-Instruct-bnb-4bit"` - Alternative 3B option

**Medium models (7-8B parameters, ~5-6GB VRAM):**
- `"unsloth/Llama-3.1-8B-Instruct-bnb-4bit"` - Balanced quality/speed
- `"unsloth/Meta-Llama-3.1-8B-bnb-4bit"` - Base 8B variant
- `"unsloth/Qwen2.5-7B-Instruct-bnb-4bit"` - Strong 7B model

**Large models (14B parameters, ~8-10GB VRAM):**
- `"unsloth/Qwen2.5-14B-Instruct-bnb-4bit"` - High quality generation

**Very large models (70B+ parameters, ~40-45GB VRAM):**
- `"unsloth/Meta-Llama-3.1-70B-bnb-4bit"` - Maximum quality (use if you have exclusive GPU access)

**Note:** All these models use 4-bit quantization (bnb-4bit) which reduces memory by ~4x compared to full precision. Start with a 3B or 8B model for faster experimentation, then try larger models if you want to compare quality.

### 1.1 Understanding Model Outputs (2 pts)

Before implementing decoding strategies, you need to understand how to extract and display token probabilities from the model.

**Instructions:**

1. Load a text generation model (use one of the quantized models suggested above)
2. Define a prompt and get the model's output for the next token prediction
3. Complete the provided `get_top_tokens_with_probs()` function to:
   - Extract logits from model output
   - Convert logits to probabilities using softmax
   - Return the top-k tokens with their probabilities
4. Create a visualization showing the probability distribution
5. Answer the analysis questions below

**Hints:**
- Review Section 6 in the lesson for similar examples
- Use `torch.softmax()` to convert logits to probabilities
- Use `torch.topk()` to get the top tokens
- Probabilities should sum to approximately 1.0

## Storage Guidance

**Always use the path variables** (`MODELS_PATH`, `DATA_PATH`, `CACHE_PATH`) instead of hardcoded paths. The actual locations depend on your environment:

| Variable | CoCalc Home Server | Compute Server |
|----------|-------------------|----------------|
| `MODELS_PATH` | `Homework_11_Models/` | `Homework_11_Models/` *(synced)* |
| `DATA_PATH` | `~/home_workspace/data/` | `~/cs_workspace/data/` *(local)* |
| `CACHE_PATH` | `~/home_workspace/downloads/` | `~/cs_workspace/downloads/` *(local)* |

**Why this matters:**
- On **Compute Servers**: Only `MODELS_PATH` syncs back to CoCalc (~10GB limit). Data and cache stay local (~50GB).
- On **CoCalc Home**: Everything syncs and counts against the ~10GB limit.
- **Storage_Cleanup.ipynb** (in this folder) helps free synced space when needed.

**Tip:** Always write `MODELS_PATH / 'model.pt'` ‚Äî never hardcode paths like `'Homework_11_Models/model.pt'`.

In [None]:
# YOUR CODE HERE
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt

# Load model and tokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"  # or another model from suggestions

# YOUR CODE: Load tokenizer and model


# YOUR CODE: Define a prompt and tokenize it
prompt = "The future of artificial intelligence is"


def get_top_tokens_with_probs(model, input_ids, k=10):
    """
    Extract top-k tokens and their probabilities for next token prediction.

    Args:
        model: The language model
        input_ids: Tokenized input (tensor)
        k: Number of top tokens to return

    Returns:
        List of (token_string, probability) tuples

    Hints:
    - Get model output: output = model(input_ids)
    - Extract logits for last position: logits = output.logits[0, -1, :]
    - Convert to probabilities: probs = torch.softmax(logits, dim=-1)
    - Get top-k: top_probs, top_ids = torch.topk(probs, k=k)
    - Decode tokens: tokenizer.decode(token_id)
    """
    # YOUR CODE HERE
    pass


def plot_probability_distribution(probabilities, title="Token Probabilities"):
    """
    Plot probability distribution for top 50 tokens and cumulative probability.

    Hints:
    - Sort probabilities: sorted_probs, _ = torch.sort(probs, descending=True)
    - Use plt.bar() for histogram
    - Use torch.cumsum() for cumulative probability
    - Create a 2-subplot figure
    """
    # YOUR CODE HERE
    pass


# YOUR CODE: Call the functions and display results
# 1. Get top 10 tokens with probabilities
# 2. Print them in a formatted table
# 3. Create probability distribution plot


### 1.2 Implement Three Decoding Strategies (5 pts)

Now implement three core decoding strategies and compare their behavior. Use the **same prompt** for all three strategies so you can compare the results fairly.

**Strategy 1: Greedy Search**
- `do_sample=False` (no sampling parameter needed)
- Always picks the token with highest probability
- Deterministic - same output every time

**Strategy 2: Beam Search with No-Repeat N-Gram**
- `num_beams=5`
- `no_repeat_ngram_size=2` (prevents repetitive phrases)
- `do_sample=False`
- Explores multiple paths simultaneously

**Strategy 3: Nucleus (Top-P) Sampling**
- `do_sample=True`
- `top_p=0.9` (sample from tokens covering 90% probability mass)
- `temperature=0.7` (slightly lower than default for more focused generation)
- Stochastic - different output each time

**For each strategy:**
1. Generate text (max_length=100 or max_new_tokens=50)
2. Display the generated text
3. Show the top 5 token probabilities at **three key timesteps**:
   - Beginning (token 1 or 2 after prompt)
   - Middle (around token 25)
   - End (last few tokens)
4. Create a comparison visualization
5. Comment on the text quality and probability patterns

In [None]:
# YOUR CODE HERE
# Strategy 1: Greedy Search

prompt = "In a shocking scientific discovery, researchers found"  # Use your own prompt
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)

print("=" * 70)
print("STRATEGY 1: GREEDY SEARCH")
print("=" * 70)

# Generate with greedy search
# YOUR CODE HERE

# Display generated text
# YOUR CODE HERE

# Show probability distributions at key timesteps
# Hint: You'll need to call the model again with the generated sequence
#       to get probabilities at different positions
# YOUR CODE HERE


In [None]:
# YOUR CODE HERE
# Strategy 2: Beam Search with No-Repeat N-Gram

print("\n" + "=" * 70)
print("STRATEGY 2: BEAM SEARCH (num_beams=5, no_repeat_ngram_size=2)")
print("=" * 70)

# YOUR CODE HERE


In [None]:
# YOUR CODE HERE
# Strategy 3: Nucleus (Top-P) Sampling

print("\n" + "=" * 70)
print("STRATEGY 3: NUCLEUS SAMPLING (top_p=0.9, temperature=0.7)")
print("=" * 70)

# YOUR CODE HERE

### 1.3 Comparison and Analysis (3 pts)

Write 2-3 paragraphs comparing the three decoding strategies. Address:

**Text Quality:**
- Which strategy produced the most coherent text?
- Did any strategy produce repetitive text?
- Which was most creative/diverse?

**Probability Patterns:**
- How did the probability distributions differ between strategies?
- Did greedy search show higher confidence (more peaked distributions)?
- How did sampling affect the probability spread?
- Did you notice the distributions changing from beginning to end of generation?

**Use Case Recommendations:**
- When would you use greedy search?
- When would beam search be preferred?
- When would you choose nucleus sampling?
- What trade-offs exist between coherence and diversity?

üìù **YOUR ANALYSIS HERE:**

## Part 2 - API Helper Functions (8 pts)

In Lesson 11, Section 5.5, you learned about the `Conversation` class for building chatbots with APIs. In Section 7.5, you learned about decoding parameters (temperature, top_p, max_tokens) that control text generation behavior.

In this part, you'll combine these concepts by extending the `Conversation` class to support decoding parameter control, allowing you to fine-tune the chatbot's behavior for different use cases.

**Learning Objectives:**
- Practice working with chat roles (system/user/assistant) and conversation history
- Understand how decoding parameters affect model behavior
- Apply object-oriented design patterns to build maintainable APIs
- Test different parameter combinations for various tasks

In [None]:
# YOUR CODE HERE
# Import necessary libraries
import os
from openai import OpenAI

# Copy and extend the Conversation class
class Conversation:
    """
    Extended chatbot class with decoding parameter control.

    Args:
        api_key (str): OpenRouter API key
        model (str): Model name (e.g., 'google/gemini-2.5-flash-lite')
        system_prompt (str, optional): Initial instructions for the model
        temperature (float): Sampling temperature (default: 0.7)
        top_p (float): Nucleus sampling parameter (default: 0.9)
        max_tokens (int): Maximum tokens to generate (default: 256)
    """

    def __init__(self, api_key, model="google/gemini-2.5-flash-lite",
                 system_prompt=None, temperature=0.7, top_p=0.9, max_tokens=256):
        # YOUR CODE HERE
        pass

    def send(self, user_message, temperature=None, top_p=None, max_tokens=None):
        """
        Send a message and get a response.

        Args:
            user_message (str): The user's input
            temperature (float, optional): Override default temperature
            top_p (float, optional): Override default top_p
            max_tokens (int, optional): Override default max_tokens

        Returns:
            str: The assistant's response
        """
        # YOUR CODE HERE
        pass

    def display(self, show_recent_only=True):
        """Show conversation history (you can use a simple print for now)"""
        # YOUR CODE HERE
        pass

    def reset(self, system_prompt=None):
        """Reset conversation history"""
        # YOUR CODE HERE
        pass

print("‚úì Extended Conversation class defined")

In [None]:
# YOUR CODE HERE
# Recipe 1: Factual Question Answering

print("=" * 70)
print("RECIPE 1: FACTUAL QUESTION ANSWERING")
print("(temperature=0, top_p=1.0, max_tokens=150)")
print("=" * 70)

# Create chatbot and test with multi-turn conversation
# YOUR CODE HERE

## Part 3 - Model Size Comparison (8 pts)

Different model sizes offer different trade-offs between quality and computational cost. In this part, you'll compare text generation across different model sizes.

**Tasks:**
1. Load at least two different-sized models (e.g., 3B and 8B parameter models). Suggested options:
   - "unsloth/Llama-3.2-3B-Instruct-bnb-4bit" (3B parameters, 4-bit quantized)
   - "unsloth/Llama-3.1-8B-Instruct-bnb-4bit" (8B parameters, 4-bit quantized)
   
2. Create 3 diverse prompts that test different capabilities:
   - A factual knowledge question
   - A creative writing task
   - A reasoning/problem-solving task
   
3. Generate responses from each model for each prompt using the same decoding parameters (e.g., temperature=0.7, top_p=0.9)

4. For each generation, measure and record:
   - Generation time
   - Memory usage (if possible)
   - Response length
   - Qualitative assessment of quality
   
5. Create visualizations:
   - Bar chart comparing generation times
   - Any other relevant comparisons
   
6. Write an analysis (2-3 paragraphs) discussing:
   - Performance differences between model sizes
   - Quality differences in outputs
   - When you would choose each model size
   - Trade-offs between speed and quality

In [None]:
# YOUR CODE HERE
# Part 3: Model Size Comparison

# Load and compare different model sizes
# Suggested: Compare a 3B model vs an 8B model

import time

# Define your test prompts
prompts = [
    "Explain how photosynthesis works in plants.",  # Factual
    "Write a short poem about the ocean at sunset.",  # Creative
    "If a train travels 60 mph for 2.5 hours, how far does it go? Show your reasoning."  # Reasoning
]

# YOUR CODE: Load models, generate responses, measure performance
# Track: generation time, response length, quality assessment

üìù **YOUR PART 3 ANALYSIS HERE:**

Discuss your findings on model size comparison (2-3 paragraphs):
- Performance differences between model sizes
- Quality differences in outputs
- When you would choose each model size
- Trade-offs between speed and quality

## Part 4 - Creative Application (8 pts)

Design and implement a creative text generation application that combines the techniques you've learned. Choose ONE of the following options or propose your own:

**Option A: Interactive Story Generator**
- Create a choose-your-own-adventure style story
- Use different temperature settings for different story branches
- Allow user choices to influence the narrative direction

**Option B: Writing Style Mimicry**
- Prompt the model to write in different author styles (e.g., Shakespeare, Hemingway, technical documentation)
- Compare outputs across different styles
- Analyze how well the model captures stylistic elements

**Option C: Multi-Agent Conversation**
- Create multiple chatbot personas with different system prompts
- Have them "discuss" a topic with each other
- Experiment with different temperature settings for different personalities

**Option D: Custom Application**
- Propose your own creative application
- Must demonstrate understanding of decoding strategies and/or API-based generation
- Should be something you find interesting or useful

**Requirements:**
1. Implement your chosen application with working code
2. Use appropriate decoding parameters for your use case
3. Generate at least 3-5 example outputs
4. Write a brief explanation (1-2 paragraphs) of:
   - What you built and why
   - What parameters you chose and why
   - What worked well and what could be improved

In [None]:
# YOUR CODE HERE
# Part 4: Creative Application

# Implement your chosen option (A, B, C, or D)
# Remember to:
# 1. Use appropriate decoding parameters
# 2. Generate 3-5 example outputs
# 3. Document your parameter choices

üìù **YOUR PART 4 EXPLANATION HERE:**

- What did you build and why?
- What parameters did you choose and why?
- What worked well and what could be improved?

## Part 5 - Analysis and Comparison (4 pts)

Synthesize your findings from all the previous parts into a comprehensive analysis.

**Write 2-3 paragraphs addressing:**

1. **Local vs. API-Based Generation:**
   - Compare the experience of using local models vs. API-based models
   - Discuss trade-offs in terms of control, cost, latency, and quality
   - When would you choose each approach for a production application?

2. **Decoding Strategy Selection:**
   - Which decoding strategies performed best for different types of tasks?
   - How did you balance coherence and diversity in your experiments?
   - What general principles would you follow when choosing decoding parameters?

3. **Practical Applications:**
   - What real-world applications would benefit from the techniques you explored?
   - What challenges did you encounter that would need to be addressed for production use?
   - What additional features or improvements would make your implementations more robust?

üìù **YOUR ANALYSIS HERE:**

## Part 6 - Reflection (2 pts)

* What, if anything, did you find difficult to understand for the lesson? Why?

üìù **YOUR ANSWER HERE:**

* What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

üìù **YOUR ANSWER HERE:**

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()