In [None]:
# DS776 Environment Setup & Package Update
# Configures storage paths for proper cleanup/sync, then updates introdl if needed
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Homework 12 - Text Summarization

**Total Points: 40**

We're going to work with conversational data in this homework.  The `SAMsum` dataset consists of chat-like conversations and summaries like this:

Conversation-
```
Olivia: Who are you voting for in this election?
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great
```

Summary-
```
Olivia and Olivier are voting for liberals in this election.
```

Applications for this kind of summarization include generating chat and meeting summaries.

Throughout this assignment you'll work with the first 100 conversations and summaries from the validation split of ["knkarthick/samsum"](https://huggingface.co/datasets/knkarthick/samsum) on Hugging Face.

## Task 1 - Build a zero-shot LLM conversation summarizer (10 points)

Use either an 8B local Llama model or an API-based model like `gemini-2.0-flash-lite` or better to build an `llm_summarizer` function that takes as input a list of conversations and returns a list of extracted summaries.  

**Implementation Options:**
- Use the `llm_generate()` function from the `introdl` package, OR
- Write your own code/wrapper to access the OpenRouter API (as demonstrated in Lesson 11)

Your function should be constructed similarly to `llm_classifier` or `llm_ner_extractor` in Lessons 8 and 10, respectively.  

Put some effort into the prompt to make it good at generating succinct summaries of converations that identify both the topics and the people.

Your list of returned summaries should be cleanly extracted summaries with no additional text such as parts of the input prompt.

Give a qualitative evaluation of the first three generated summaries compared to the ground-truth summaries.

In [None]:
# YOUR CODE: Load the SAMsum dataset and get first 100 from validation split
# Hint: from datasets import load_dataset
# dataset = load_dataset("knkarthick/samsum")
# val_data = dataset["validation"].select(range(100))

## Storage Guidance

**Always use the path variables** (`MODELS_PATH`, `DATA_PATH`, `CACHE_PATH`) instead of hardcoded paths. The actual locations depend on your environment:

| Variable | CoCalc Home Server | Compute Server |
|----------|-------------------|----------------|
| `MODELS_PATH` | `Homework_12_Models/` | `Homework_12_Models/` *(synced)* |
| `DATA_PATH` | `~/home_workspace/data/` | `~/cs_workspace/data/` *(local)* |
| `CACHE_PATH` | `~/home_workspace/downloads/` | `~/cs_workspace/downloads/` *(local)* |

**Why this matters:**
- On **Compute Servers**: Only `MODELS_PATH` syncs back to CoCalc (~10GB limit). Data and cache stay local (~50GB).
- On **CoCalc Home**: Everything syncs and counts against the ~10GB limit.
- **Storage_Cleanup.ipynb** (in this folder) helps free synced space when needed.

**Tip:** Always write `MODELS_PATH / 'model.pt'` — never hardcode paths like `'Homework_12_Models/model.pt'`.

In [None]:
# YOUR CODE: Build llm_summarizer function here
# Requirements:
# - Takes a list of conversations as input
# - Returns a list of summaries
# - Uses llm_generate() or your own API wrapper
# - Prompt should encourage concise summaries that identify topics and people
# - Clean the output to return just the summary (no prompt text)

In [None]:
# YOUR CODE: Test llm_summarizer on first 3 conversations
# Get first 3 conversations and their ground-truth summaries
# Generate summaries using llm_summarizer
# Display results for comparison

### Qualitative Evaluation of First Three Summaries

📝 **YOUR EVALUATION HERE:**

For each of the three examples, discuss:
- How does the generated summary compare to the ground-truth?
- Does it capture the key topics and identify the people involved?
- Is it concise and coherent?
- Any issues with the generated summary?

## Task 2 - Build a few-shot LLM conversation summarizer (6 points)

Follow the same instructions as in Task 1, but add a few examples from the training data.  Don't simply pick the first examples, rather take some care to choose diverse conversations and/or conversations that are difficult to summarize.

In [None]:
# YOUR CODE: Select diverse few-shot examples from training data
# Look at several training examples and pick 2-4 diverse ones
# Consider: different conversation styles, lengths, topics

In [None]:
# YOUR CODE: Build few-shot llm_summarizer function
# Include your selected examples in the prompt
# Format: show conversation -> summary pairs before the test conversation

In [None]:
# YOUR CODE: Test few-shot llm_summarizer on first 3 conversations
# Use the same 3 conversations as Task 1 for comparison

### Qualitative Evaluation

📝 **YOUR EVALUATION HERE:**

Compare the few-shot results to:
- The ground-truth summaries
- The zero-shot results from Task 1

Does few-shot prompting improve the quality? In what ways?

## Task 3 - Refine the llm_score function (10 points)

For this task you can use a local Llama model or an API-based model.  (I personally find the API-based models much easier to use.)

Start with the `llm_score` function from last week and refine the prompt to improve the scoring to better reflect similarities in semantic meaning between two texts.  Here are some guidelines that you should incorporate into your prompt:

- A score of **100** means the texts have **identical meaning**.
- A score of **80–99** means they are **strong paraphrases** or very similar in meaning.
- A score of **50–79** means they are **somewhat related**, but not expressing the same idea.
- A score of **1–49** means they are **barely or loosely related**.
- A score of **0** means **no semantic similarity**.
- Take into account word meaning, order, and structure.
- Synonyms count as matches.
- Do not reward scrambled words unless they convey the same meaning.
- Make the prompt few-shot by including several text pairs and the corresponding similarity scores.

Demonstrate your `llm_score` function by applying it to the 7 sentence pairs from the lesson.  Comment on the performance of the scoring.  Does it still get fooled by the sixth and seventh pairs like BERTScore did?


In [None]:
# The 7 sentence pairs from Lesson 12 for testing
test_pairs = [
    {
        "reference": "The cat sat on the mat.",
        "prediction": "The cat sat on the mat.",
        "description": "Exact match"
    },
    {
        "reference": "The cat sat on the mat.",
        "prediction": "The feline rested on the rug.",
        "description": "Synonym substitution"
    },
    {
        "reference": "The cat sat on the mat in the afternoon.",
        "prediction": "In the afternoon, the cat was sitting on the mat.",
        "description": "Paraphrase with reordering"
    },
    {
        "reference": "The government announced a stimulus package to support the economy during the recession.",
        "prediction": "A stimulus package was announced.",
        "description": "Shorter prediction"
    },
    {
        "reference": "The plane crashed due to engine failure.",
        "prediction": "The aircraft accident was caused by mechanical problems.",
        "description": "Different vocabulary"
    },
    {
        "reference": "The court ruled in favor of the defendant.",
        "prediction": "The judge made a ruling in the case of the cat and the fiddle.",
        "description": "Copying style not content (FAILURE CASE)"
    },
    {
        "reference": "The stock market crashed due to unexpected inflation news.",
        "prediction": "Inflation stock news market due crashed the unexpected.",
        "description": "Word salad (FAILURE CASE)"
    }
]

In [None]:
# YOUR CODE: Build llm_score function
# 
# Requirements:
# - Must handle both single strings and lists of strings
#   - llm_score(ref, pred) -> single score
#   - llm_score([ref1, ref2], [pred1, pred2]) -> list of scores
# - Use llm_generate() with the guidelines from the task description
# - Return scores as integers from 0-100
# - Consider using temperature=0 for consistent scoring
#
# Note: Zero-shot prompting works surprisingly well for this task with modern LLMs.
# You can experiment with few-shot examples if you'd like, but it may not improve performance.
# 
# Hint: llm_generate() already handles lists of prompts, so you can leverage that!

In [None]:
# YOUR CODE: Test llm_score on the 7 sentence pairs
# For each pair, display:
# - The description
# - The reference and prediction texts
# - The llm_score result
# 
# You can test them one at a time or use the list functionality

### Performance Analysis

📝 **YOUR ANALYSIS HERE:**

- How well does llm_score perform on these examples?
- Does it correctly handle the two FAILURE CASES (pairs 6 and 7)?
- Compare to BERTScore performance from the lesson - does llm_score avoid the same pitfalls?
- Did BERTScore get fooled by the sixth and seventh pairs? Does llm_score?

## Task 4 - Evaluate a Pre-trained Model and LLM_summarizer (10 points)

For this task you're going to qualitatively and quantitatively compare the generated summaries from:
1. The already fine-tuned Hugging Face model - ['philschmid/flan-t5-base-samsum'](https://huggingface.co/philschmid/flan-t5-base-samsum)
2. The zero-shot or few shot LLM summarizer from above.

If, for some reason, you can't get the specified Hugging Face model to work, then find a different Hugging Face summarization model that has already been fine-tuned on SAMsum.

First, qualititavely compare the first three generated summaries from each approach to the ground-truth summaries.  Explain how the the two approaches seem to be working on the three examples.

Second, compute ROUGE scores, BERTScore, and llm_score for the first 100 examples in the validation set. 

What do these scores suggest about the performance of the two approaches?  Is one approach clearly better than the other?  Is llm_score working well as a metric?  Does it agree with the other metrics?

In [None]:
# YOUR CODE: Load the fine-tuned Hugging Face model
# Use 'philschmid/flan-t5-base-samsum' or another SAMsum-fine-tuned model

In [None]:
# YOUR CODE: Generate summaries from both approaches for first 3 examples
# 1. Generate using the Hugging Face model
# 2. Generate using your llm_summarizer (zero-shot or few-shot)
# 3. Display all results alongside ground-truth

### Qualitative Comparison (First 3 Examples)

📝 **YOUR ANALYSIS HERE:**

For each approach, discuss:
- How do the summaries compare to ground-truth?
- Which approach produces better summaries?
- How do the two approaches seem to be working?
- Any notable differences in style or content?

In [None]:
# YOUR CODE: Compute metrics for all 100 validation examples
# 
# For both approaches:
# 1. Generate summaries for all 100 examples
# 2. Compute ROUGE scores (use evaluate.load("rouge"))
# 3. Compute BERTScore (use evaluate.load("bertscore"))
# 4. Compute llm_score for all 100 examples
#    Hint: Your llm_score should handle lists of texts
# 
# Display the results in a clear format for comparison

### Quantitative Analysis (100 Examples)

📝 **YOUR ANALYSIS HERE:**

- What do the ROUGE, BERTScore, and llm_score results suggest about performance?
- Is one approach clearly better than the other?
- Do the three metrics agree with each other?
- Is llm_score working well as a metric? Does it correlate with ROUGE and BERTScore?
- Which metric do you think best reflects actual summary quality?

## Task 5 - Comparison and Reflection (4 points)

* Give a brief summary of what you learned in this assignment.

* What did you find most difficult to understand?

## Task 6 - Using a Specialized Summarizer in Production (6 points)

In this task, you'll explore how to use the HuggingFace `pipeline` API to simplify deployment of the fine-tuned summarization model from Task 4.

### Part A: Create a Summarization Pipeline (2 points)

Create a summarization pipeline using the `philschmid/flan-t5-base-samsum` model (or whichever fine-tuned model you used in Task 4). Demonstrate generating a summary for a single conversation from the validation set using the pipeline.

**Hint**: The `pipeline()` function from the `transformers` library makes this very simple - you just need to specify the task type and model.

### Part B: Batch Processing (2 points)

Use the pipeline to generate summaries for a batch of 5 conversations at once. Compare the wall-clock time for batch processing versus processing the same 5 conversations individually in a loop.

**Note**: You can use the `batch_size` parameter in the pipeline call to control how many examples are processed together.

### Part C: Production Considerations (2 points)

Answer the following questions based on your experience with the pipeline API:

1. What are the main advantages of using the `pipeline` API compared to manually handling tokenization, model inference, and decoding?

2. For a production system that needs to summarize thousands of conversations per day, what factors would you need to consider when choosing between:
   - A specialized fine-tuned model (like `philschmid/flan-t5-base-samsum`)
   - A general-purpose LLM accessed via API (like your `llm_summarizer` from Tasks 1-2)
   
   Consider aspects like cost, latency, quality, and maintenance.

In [None]:
# YOUR CODE: Part A - Create summarization pipeline
# 
# Hint: from transformers import pipeline
# summarizer = pipeline("summarization", model="philschmid/flan-t5-base-samsum")
# 
# Test on a single conversation from validation set

In [None]:
# YOUR CODE: Part B - Batch processing comparison
# 
# Compare timing for:
# 1. Processing 5 conversations individually in a loop
# 2. Processing 5 conversations together with batch_size parameter
#
# Timing pattern:
import time

# Approach 1: Individual processing
start_time = time.time()
# YOUR CODE: Process conversations one at a time
individual_time = time.time() - start_time

# Approach 2: Batch processing
start_time = time.time()
# YOUR CODE: Process all conversations together
batch_time = time.time() - start_time

print(f"Individual processing: {individual_time:.2f} seconds")
print(f"Batch processing: {batch_time:.2f} seconds")
print(f"Speedup: {individual_time/batch_time:.2f}x")

### Part C: Production Considerations

📝 **YOUR ANSWERS HERE:**

**Question 1:** What are the main advantages of using the `pipeline` API compared to manually handling tokenization, model inference, and decoding?

**Question 2:** For a production system that needs to summarize thousands of conversations per day, what factors would you need to consider when choosing between:
- A specialized fine-tuned model (like `philschmid/flan-t5-base-samsum`)
- A general-purpose LLM accessed via API (like your `llm_summarizer`)

Consider aspects like:
- Cost (compute vs API fees)
- Latency (response time)
- Quality (accuracy and consistency)
- Maintenance (updates, monitoring, debugging)

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()