In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Homework 11: Text Generation and Decoding Strategies

**Total Points: 50**
- Reading Questions: 10 points
- Part 1 (Decoding Strategies Comparison): 10 points
- Part 2 (Building API Helper Functions): 8 points
- Part 3 (Model Size Comparison): 8 points
- Part 4 (Creative Text Generation): 8 points
- Part 5 (Analysis and Comparison): 4 points
- Part 6 (Reflection): 2 points

## Reading Questions (10 points)

Answer the following questions based on Chapter 5: Text Generation from *Natural Language Processing with Transformers*.

**Question 1 (2 points):** Explain how autoregressive (causal) language models like GPT-2 generate text. What is conditional text generation, and how does the chain rule of probability factor into the text generation process? Be specific about how the model predicts each token.

📝 **YOUR ANSWER HERE:**

**Question 2 (2 points):** Why do we use log probabilities instead of regular probabilities when scoring sequences in text generation? Explain the numerical stability problem that arises with regular probabilities and how log probabilities solve it. Include a brief discussion of the mathematical transformation involved.

📝 **YOUR ANSWER HERE:**

**Question 3 (2 points):** Compare and contrast greedy search decoding with beam search decoding. What are the advantages of beam search over greedy search? What problem do both methods share, and how can the `no_repeat_ngram_size` parameter help address it?

📝 **YOUR ANSWER HERE:**

**Question 4 (2 points):** Explain the role of the temperature parameter in sampling-based text generation. How does temperature affect the probability distribution over tokens? What happens when temperature is very low (T << 1) versus very high (T >> 1), and what are the trade-offs in terms of text quality?

📝 **YOUR ANSWER HERE:**

**Question 5 (2 points):** Describe top-k and nucleus (top-p) sampling methods. How do they differ in how they restrict the vocabulary for sampling? According to the textbook, which decoding methods should you use for tasks that require factual correctness (like summarization) versus tasks that benefit from creativity (like story generation)?

📝 **YOUR ANSWER HERE:**

## Part 1 - Decoding Strategies Comparison (10 points)

In this part, you'll experiment with different decoding strategies to see how they affect generated text quality and diversity.

**Tasks:**
1. Load a text generation model (e.g., GPT-2 or a Llama model like "unsloth/Llama-3.2-3B-Instruct-bnb-4bit")
2. Create an interesting prompt for text generation (e.g., a story opening, a technical explanation, or a dialogue start)
3. Generate text using at least 5 different decoding configurations:
   - Greedy search (do_sample=False)
   - Beam search with num_beams=5
   - Beam search with no_repeat_ngram_size=2
   - Sampling with temperature=0.5
   - Sampling with temperature=1.5
   - Top-k sampling (top_k=50)
   - Nucleus sampling (top_p=0.9)
   
4. For each generated text:
   - Display the full generated text
   - Calculate and display the sequence log probability
   - Comment on the coherence, diversity, and quality
   
5. Write a summary (3-4 paragraphs) comparing the different decoding strategies. Which produced the most coherent text? Which was most creative? Which would you choose for different use cases?

In [None]:
# YOUR CODE HERE
# Load model and tokenizer

In [None]:
# YOUR CODE HERE
# Define functions for log probability calculation (similar to textbook examples)

In [None]:
# YOUR CODE HERE
# Experiment with different decoding strategies

📝 **YOUR COMPARISON SUMMARY HERE:**

## Part 2 - Extending the Conversation Class with Decoding Parameters (8 points)

In Lesson 11, Section 5.5, you learned about the `Conversation` class for building chatbots with APIs. In Section 7.5, you learned about decoding parameters (temperature, top_p, max_tokens) that control text generation behavior.

In this part, you'll combine these concepts by extending the `Conversation` class to support decoding parameter control, allowing you to fine-tune the chatbot's behavior for different use cases.

**Learning Objectives:**
- Practice working with chat roles (system/user/assistant) and conversation history
- Understand how decoding parameters affect model behavior
- Apply object-oriented design patterns to build maintainable APIs
- Test different parameter combinations for various tasks


### Task 2.1: Copy and Extend the Conversation Class (3 points)

**Instructions:**

1. Copy the `Conversation` class from Lesson 11, Section 5.5 (you'll find it in the "Building a Chatbot Class" section)

2. Extend the class to support decoding parameters:
   - Add `temperature`, `top_p`, and `max_tokens` parameters to `__init__()` with these defaults:
     - `temperature=0.7`
     - `top_p=0.9`
     - `max_tokens=256`
   - Modify the `send()` method to pass these parameters to the API call
   - Allow per-message parameter overrides in `send()` (e.g., `send(message, temperature=0.0)`)

3. Your extended class should:
   - Store default parameters as instance variables
   - Use defaults if no overrides provided
   - Pass all three parameters to `client.chat.completions.create()`

**Hints:**
- The original `Conversation` class already handles the OpenAI client and message history
- You just need to add parameter handling to `__init__()` and `send()`
- Use `if parameter is not None` to check for overrides


In [None]:
# YOUR CODE HERE
# Import necessary libraries
import os
from openai import OpenAI

# Copy and extend the Conversation class
class Conversation:
    """
    Extended chatbot class with decoding parameter control.

    Args:
        api_key (str): OpenRouter API key
        model (str): Model name (e.g., 'google/gemini-2.5-flash-lite')
        system_prompt (str, optional): Initial instructions for the model
        temperature (float): Sampling temperature (default: 0.7)
        top_p (float): Nucleus sampling parameter (default: 0.9)
        max_tokens (int): Maximum tokens to generate (default: 256)
    """

    def __init__(self, api_key, model="google/gemini-2.5-flash-lite",
                 system_prompt=None, temperature=0.7, top_p=0.9, max_tokens=256):
        # YOUR CODE HERE
        pass

    def send(self, user_message, temperature=None, top_p=None, max_tokens=None):
        """
        Send a message and get a response.

        Args:
            user_message (str): The user's input
            temperature (float, optional): Override default temperature
            top_p (float, optional): Override default top_p
            max_tokens (int, optional): Override default max_tokens

        Returns:
            str: The assistant's response
        """
        # YOUR CODE HERE
        pass

    def display(self, show_recent_only=True):
        """Show conversation history (you can use a simple print for now)"""
        # YOUR CODE HERE
        pass

    def reset(self, system_prompt=None):
        """Reset conversation history"""
        # YOUR CODE HERE
        pass

print("✓ Extended Conversation class defined")


### Task 2.2: Test Different Parameter Recipes (3 points)

**Instructions:**

Create three different chatbot configurations optimized for different use cases:

**Recipe 1: Factual Question Answering**
- `temperature=0` (deterministic/greedy)
- `top_p=1.0` (not used when temperature=0)
- `max_tokens=150`
- System prompt: "You are a knowledgeable assistant providing accurate, factual information."
- Test with a factual question (e.g., "What is photosynthesis?")

**Recipe 2: Creative Writing**
- `temperature=1.2` (more random)
- `top_p=0.95` (wide vocabulary)
- `max_tokens=300`
- System prompt: "You are a creative storyteller with a vivid imagination."
- Test with a creative prompt (e.g., "Start a science fiction story about AI")

**Recipe 3: General Conversation**
- `temperature=0.7` (balanced)
- `top_p=0.9` (nucleus sampling)
- `max_tokens=200`
- System prompt: "You are a friendly, helpful assistant."
- Test with a conversational prompt (e.g., "Explain neural networks to a beginner")

For each recipe:
1. Create a new `Conversation` instance with the specified parameters
2. Send at least 2 messages to create a multi-turn conversation
3. Display the conversation (or print the messages)
4. Note the characteristics of the responses (consistency, creativity, length)


In [None]:
# YOUR CODE HERE
# Recipe 1: Factual Question Answering

print("=" * 70)
print("RECIPE 1: FACTUAL QUESTION ANSWERING")
print("(temperature=0, top_p=1.0, max_tokens=150)")
print("=" * 70)

# Create chatbot
factual_chat = Conversation(
    api_key=os.getenv("OPENROUTER_API_KEY"),
    model="google/gemini-2.5-flash-lite",
    system_prompt="You are a knowledgeable assistant providing accurate, factual information.",
    temperature=0,
    top_p=1.0,
    max_tokens=150
)

# Test with multi-turn conversation
# YOUR CODE HERE - send at least 2 messages and display responses


In [None]:
# YOUR CODE HERE
# Recipe 2: Creative Writing

print("\n" + "=" * 70)
print("RECIPE 2: CREATIVE WRITING")
print("(temperature=1.2, top_p=0.95, max_tokens=300)")
print("=" * 70)

# YOUR CODE HERE - create chatbot and test with creative prompts


In [None]:
# YOUR CODE HERE
# Recipe 3: General Conversation

print("\n" + "=" * 70)
print("RECIPE 3: GENERAL CONVERSATION")
print("(temperature=0.7, top_p=0.9, max_tokens=200)")
print("=" * 70)

# YOUR CODE HERE - create chatbot and test with conversational prompts


### Task 2.3: Analysis and Comparison (2 points)

**Write 2-3 paragraphs addressing:**

1. **Parameter Effects:**
   - How did temperature affect the responses? Compare the factual (temp=0) vs creative (temp=1.2) outputs.
   - What role did `top_p` play? Was the difference noticeable?
   - How did `max_tokens` affect response completeness?

2. **Use Case Recommendations:**
   - Which recipe worked best for each type of task?
   - When would you choose low temperature vs high temperature?
   - What trade-offs did you observe between consistency and creativity?

3. **Conversation History:**
   - Did the chatbot maintain context across multiple turns?
   - How does sending the full conversation history enable this?
   - What are the cost implications of longer conversations?


📝 **YOUR ANALYSIS HERE:**


## Part 3 - Model Size Comparison (8 points)

Different model sizes offer different trade-offs between quality and computational cost. In this part, you'll compare text generation across different model sizes.

**Tasks:**
1. Load at least two different-sized models (e.g., 3B and 8B parameter models). Suggested options:
   - "unsloth/Llama-3.2-3B-Instruct-bnb-4bit" (3B parameters, 4-bit quantized)
   - "unsloth/Llama-3.1-8B-Instruct-bnb-4bit" (8B parameters, 4-bit quantized)
   
2. Create 3 diverse prompts that test different capabilities:
   - A factual knowledge question
   - A creative writing task
   - A reasoning/problem-solving task
   
3. Generate responses from each model for each prompt using the same decoding parameters (e.g., temperature=0.7, top_p=0.9)

4. For each generation, measure and record:
   - Generation time
   - Memory usage (if possible)
   - Response length
   - Qualitative assessment of quality
   
5. Create visualizations:
   - Bar chart comparing generation times
   - Any other relevant comparisons
   
6. Write an analysis (2-3 paragraphs) discussing:
   - Performance differences between model sizes
   - Quality differences in outputs
   - When you would choose each model size
   - Trade-offs between speed and quality

In [None]:
# YOUR CODE HERE
# Load different-sized models

In [None]:
# YOUR CODE HERE
# Create prompts and generate responses with timing

In [None]:
# YOUR CODE HERE
# Create visualizations comparing performance

📝 **YOUR ANALYSIS HERE:**

## Part 4 - Creative Text Generation Application (8 points)

Apply what you've learned about text generation to build a creative application. Choose ONE of the following:

### Option A: Story Continuation System
- Create a system that takes a story opening and generates multiple continuations using different decoding strategies
- Allow the user to select which continuation they prefer
- Continue the story iteratively, building on the selected continuations
- Generate at least 3 rounds of story development

### Option B: Dialogue Generator
- Create a multi-character dialogue system
- Define personalities for 2-3 characters (via system prompts)
- Generate a conversation between the characters on a given topic
- Experiment with different decoding parameters for each character to reflect their personality

### Option C: Writing Style Transformer
- Take a piece of text and rewrite it in different styles (formal, casual, poetic, technical, etc.)
- Use both local models and API-based models
- Compare how different models handle style transformation
- Test with at least 3 different input texts and 4 different target styles

**Requirements:**
- Use at least 2 different models (one local, one via API)
- Experiment with at least 3 different decoding configurations
- Include clear output formatting and labeling
- Write a reflection (2-3 paragraphs) on what worked well and what didn't, and what you learned about text generation from this exercise

In [None]:
# YOUR CODE HERE
# Implement your chosen creative application

📝 **YOUR REFLECTION HERE:**

## Part 5 - Analysis and Comparison (4 points)

Synthesize your findings from all the previous parts into a comprehensive analysis.

**Write 2-3 paragraphs addressing:**

1. **Local vs. API-Based Generation:**
   - Compare the experience of using local models vs. API-based models
   - Discuss trade-offs in terms of control, cost, latency, and quality
   - When would you choose each approach for a production application?

2. **Decoding Strategy Selection:**
   - Which decoding strategies performed best for different types of tasks?
   - How did you balance coherence and diversity in your experiments?
   - What general principles would you follow when choosing decoding parameters?

3. **Practical Applications:**
   - What real-world applications would benefit from the techniques you explored?
   - What challenges did you encounter that would need to be addressed for production use?
   - What additional features or improvements would make your implementations more robust?

📝 **YOUR ANALYSIS HERE:**

## Part 6 - Reflection (2 points)

1. What, if anything, did you find difficult to understand for the lesson? Why?

📝 **YOUR ANSWER HERE:**

2. What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

📝 **YOUR ANSWER HERE:**

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()