In [None]:
# =============================================================================
# DS776 REQUIRED SETUP - Run this cell FIRST, before any other code!
# =============================================================================
# This cell:
#   1. Configures cache paths so downloads go to the right location
#   2. Updates the course package (introdl) if needed
#   3. Suppresses TensorFlow/Keras warnings
#
# If this fails, see: Lessons/Course_Tools/SETUP_HELP.md
# =============================================================================
%run ../../Lessons/Course_Tools/auto_update_introdl.py

In [None]:
# =============================================================================
# IMPORTS AND PATH CONFIGURATION
# =============================================================================

# Course utilities - ALWAYS use flattened imports from introdl
from introdl import (
    config_paths_keys,
    get_device,
    wrap_print_text,
)

# Wrap print to format text nicely at 120 characters
print = wrap_print_text(print, width=120)

device = get_device()

# Configure paths - ALWAYS use these variables, never hardcode paths!
paths = config_paths_keys()
DATA_PATH = paths['DATA_PATH']      # Where datasets are stored
MODELS_PATH = paths['MODELS_PATH']  # Where your trained models are saved
# CACHE_PATH = paths['CACHE_PATH']  # (Optional) Where pretrained models cache

# Homework 12 - Text Summarization

We're going to work with conversational data in this homework.  The `SAMsum` dataset consists of chat-like conversations and summaries like this:

Conversation-
```
Olivia: Who are you voting for in this election?
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great
```

Summary-
```
Olivia and Olivier are voting for liberals in this election.
```

Applications for this kind of summarization include generating chat and meeting summaries.

Throughout this assignment you'll work with the first 100 conversations and summaries from the validation split of ["knkarthick/samsum"](https://huggingface.co/datasets/knkarthick/samsum) on Hugging Face.

## Storage Guidance

**Always use the path variables** (`MODELS_PATH`, `DATA_PATH`, `CACHE_PATH`) instead of hardcoded paths. The actual locations depend on your environment:

| Variable | CoCalc Home Server | Compute Server |
|----------|-------------------|----------------|
| `MODELS_PATH` | `Homework_12_Models/` | `Homework_12_Models/` *(synced)* |
| `DATA_PATH` | `~/home_workspace/data/` | `~/cs_workspace/data/` *(local)* |
| `CACHE_PATH` | `~/home_workspace/downloads/` | `~/cs_workspace/downloads/` *(local)* |

**Why this matters:**
- On **Compute Servers**: Only `MODELS_PATH` syncs back to CoCalc (~10GB limit). Data and cache stay local (~50GB).
- On **CoCalc Home**: Everything syncs and counts against the ~10GB limit.
- **Storage_Cleanup.ipynb** (in this folder) helps free synced space when needed.

**Tip:** Always write `MODELS_PATH / 'model.pt'` — never hardcode paths like `'Homework_12_Models/model.pt'`.

## Task 1 - Build a zero-shot LLM conversation summarizer (10 points)

Use either an 8B local Llama model or an API-based model like `gemini-2.0-flash-lite` or better to build an `llm_summarizer` function that takes as input a list of conversations and returns a list of extracted summaries.  Your function should be constructed similarly to `llm_classifier` or `llm_ner_extractor` in Lessons 8 and 10, respectively.  

Put some effort into the prompt to make it good at generating succinct summaries of converations that identify both the topics and the people.

Your list of returned summaries should be cleanly extracted summaries with no additional text such as parts of the input prompt.

Give a qualitative evaluation of the first three generated summaries compared to the ground-truth summaries.

## Task 2 - Build a few-shot LLM conversation summarizer (6 points)

Follow the same instructions as in Task 1, but add a few examples from the training data.  Don't simply pick the first examples, rather take some care to choose diverse conversations and/or conversations that are difficult to summarize.

## Task 3 - Refine the llm_score function (10 points)

For this task you can use a local Llama model or an API-based model.  (I personally find the API-based models much easier to use.)

Start with the `llm_score` function from last week and refine the prompt to improve the scoring to better reflect similarities in semantic meaning between two texts.  Here are some guidelines that you should incorporate into your prompt:

- A score of **100** means the texts have **identical meaning**.
- A score of **80–99** means they are **strong paraphrases** or very similar in meaning.
- A score of **50–79** means they are **somewhat related**, but not expressing the same idea.
- A score of **1–49** means they are **barely or loosely related**.
- A score of **0** means **no semantic similarity**.
- Take into account word meaning, order, and structure.
- Synonyms count as matches.
- Do not reward scrambled words unless they convey the same meaning.
- Make the prompt few-shot by including several text pairs and the corresponding similarity scores.

Demonstrate your `llm_score` function by applying it to the 7 sentence pairs from the lesson.  Comment on the performance of the scoring.  Does it still get fooled by the sixth and seventh pairs like BERTScore did?


## Task 4 - Evaluate a Pre-trained Model and LLM_summarizer (10 points)

For this task you're going to qualitatively and quantitatively compare the generated summaries from:
1. The already fine-tuned Hugging Face model - ['philschmid/flan-t5-base-samsum'](https://huggingface.co/philschmid/flan-t5-base-samsum)
2. The zero-shot or few shot LLM summarizer from above.

If, for some reason, you can't get the specified Hugging Face model to work, then find a different Hugging Face summarization model that has already been fine-tuned on SAMsum.

First, qualititavely compare the first three generated summaries from each approach to the ground-truth summaries.  Explain how the the two approaches seem to be working on the three examples.

Second, compute ROUGE scores, BERTScore, and llm_score for the first 100 examples in the validation set. 

What do these scores suggest about the performance of the two approaches?  Is one approach clearly better than the other?  Is llm_score working well as a metric?  Does it agree with the other metrics?

## Task 5 - Comparison and Reflection (4 points)

* Give a brief summary of what you learned in this assignment.

* What did you find most difficult to understand?

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()