# LLM-Based Summarization Lab

**Objective**: Explore LLM summarization using local open-source models (LLaMA/Mistral)

**Topics Covered**:
1. Loading local LLMs via Hugging Face
2. Prompt engineering (zero-shot vs few-shot)
3. Decoding parameter experiments (temperature, top-p, repetition penalty)
4. Handling long documents (chunking strategy)

**Duration**: 60-90 minutes

**Prerequisites**: Hugging Face Transformers, PyTorch, 8GB+ GPU (or CPU with patience)

---
## Part 1: Setup and Model Loading

We'll use a smaller open-source model that can run locally:
- **Model**: FLAN-T5-small or distilgpt2 (for low-resource environments)
- **Alternative**: mistral-7b-instruct (if you have 16GB+ GPU)

In [4]:
# Install required packages (if needed)
!pip install transformers torch sentencepiece accelerate

Collecting accelerate
  Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.11.0-py3-none-any.whl (375 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.11.0




In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.5.1+cpu
CUDA available: False


In [8]:
# Load FLAN-T5-small (good for summarization, runs on CPU)
model_name = "google/flan-t5-small"  # 80M parameters, fast
# Alternative: "google/flan-t5-base" (250M parameters, better quality)

print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
print(f"Model loaded on {device}")

Loading google/flan-t5-small...


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Model loaded on cpu


---
## Part 2: Prompt Engineering Experiments

Compare **zero-shot** vs **few-shot** prompting on the same article.

In [10]:
# Sample article to summarize
article = """The Federal Reserve raised interest rates by 0.25 percentage points on Wednesday, 
marking the tenth consecutive increase in borrowing costs as the central bank continues its 
battle against persistent inflation. The decision brings the benchmark federal funds rate to 
a range of 5.00% to 5.25%, the highest level in 16 years. Fed Chair Jerome Powell indicated 
that officials would carefully monitor economic data before deciding on future rate changes, 
suggesting a possible pause in rate hikes. Inflation has shown signs of cooling, with the 
consumer price index rising 4.9% year-over-year in April, down from 5.0% in March. However, 
core inflation, which excludes volatile food and energy prices, remains elevated at 5.5%. 
The Fed's preferred measure, the personal consumption expenditures price index, rose 4.2% 
in March. Financial markets reacted positively to the Fed's statement, with major stock 
indexes gaining ground. Economists are divided on whether the Fed will pause or implement 
one more rate increase at its next meeting in June."""

print(f"Article length: {len(article)} characters, {len(article.split())} words")

Article length: 1054 characters, 160 words


In [13]:
def summarize(text, prompt_prefix="", max_length=100, temperature=1.0, top_p=1.0, repetition_penalty=1.0):
    """Generate summary with specified parameters"""
    # For FLAN-T5, format as instruction
    full_input = f"{prompt_prefix}\n\n{text}" if prompt_prefix else f"Summarize: {text}"
    
    inputs = tokenizer(full_input, return_tensors="pt", max_length=512, truncation=True).to(device)
    
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        do_sample=True if temperature > 0 else False,
        num_return_sequences=1
    )
    
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [15]:
# Experiment 1: Zero-shot (simple instruction)
print("=== ZERO-SHOT PROMPT ===")
zero_shot_prompt = "Summarize this article in 3 sentences"
print(f"Prompt: {zero_shot_prompt}\n")

summary_zero = summarize(article, zero_shot_prompt)
print(f"Summary: {summary_zero}\n")
print(f"Length: {len(summary_zero.split())} words")

=== ZERO-SHOT PROMPT ===
Prompt: Summarize this article in 3 sentences

Summary: Federal Reserve Chair Jerome Powell says the financial markets remained happy on Tuesday, predicting the next week that the rate would lower for a year.

Length: 25 words


In [17]:
# Experiment 2: Few-shot (with example)
print("=== FEW-SHOT PROMPT ===")

few_shot_prompt = """You are a financial news summarizer. Here's an example:

Article: The stock market rose 2% today as tech companies reported strong earnings...
Summary: Markets gained on tech earnings. Major indexes up 2%. Investor confidence increased.

Now summarize this article in the same style (3 short sentences, focus on facts):"""

print(f"Prompt (truncated): {few_shot_prompt[:100]}...\n")

summary_few = summarize(article, few_shot_prompt)
print(f"Summary: {summary_few}\n")
print(f"Length: {len(summary_few.split())} words")

=== FEW-SHOT PROMPT ===
Prompt (truncated): You are a financial news summarizer. Here's an example:

Article: The stock market rose 2% today as ...

Summary: Federal Reserve raises interest rates by 0.25 percentage points on Wednesday, a pause that continues to be a trend between the central bank and the central bank.

Length: 27 words


In [18]:
# Comparison
print("=== COMPARISON ===")
print(f"Zero-shot: {summary_zero}")
print(f"\nFew-shot: {summary_few}")
print(f"\nObservation: Which is more consistent with desired format?")

=== COMPARISON ===
Zero-shot: Federal Reserve Chair Jerome Powell says the financial markets remained happy on Tuesday, predicting the next week that the rate would lower for a year.

Few-shot: Federal Reserve raises interest rates by 0.25 percentage points on Wednesday, a pause that continues to be a trend between the central bank and the central bank.

Observation: Which is more consistent with desired format?


---
## Part 3: Decoding Parameter Experiments

Test how **temperature**, **top-p**, and **repetition penalty** affect output quality.

In [22]:
# Experiment 3A: Temperature variations
print("=== TEMPERATURE EXPERIMENTS ===")
base_prompt = "Summarize this financial news in 2 sentences"

temperatures = [0.3, 0.7, 1.0]
for temp in temperatures:
    print(f"\n--- Temperature = {temp} ---")
    summary = summarize(article, base_prompt, temperature=temp, max_length=60)
    print(summary)

=== TEMPERATURE EXPERIMENTS ===

--- Temperature = 0.3 ---
Federal Reserve chiefs have raised interest rates to a range of 5.00% to 5.25%, the highest level in 16 years.

--- Temperature = 0.7 ---
Federal Reserve President Mark Zuckerberg told the Wall Street Journal the Federal Reserve remained calm in the wake of the flurry of interest rates.

--- Temperature = 1.0 ---
Federal Reserve chair Jerome Powell said the US rate had been lowered, a move which highlights ongoing uncertainty as the central bank faces interest rates.


In [24]:
# Experiment 3B: Top-p (nucleus) variations
print("=== TOP-P (NUCLEUS) EXPERIMENTS ===")

top_ps = [0.8, 0.9, 0.95]
for p in top_ps:
    print(f"\n--- Top-p = {p} ---")
    summary = summarize(article, base_prompt, temperature=0.7, top_p=p, max_length=60)
    print(summary)

=== TOP-P (NUCLEUS) EXPERIMENTS ===

--- Top-p = 0.8 ---
Federal Reserve officials have raised interest rates by 0.25 percentage points in a bid to cut interest rates, despite a decline in inflation.

--- Top-p = 0.9 ---
Federal Reserve officials say they will monitor data on a possible rate hike to keep inflation lower.

--- Top-p = 0.95 ---
Federal Reserve Chairman Jerome Powell said he would monitor the current rate growth rate and make changes to interest rates in a way that would ensure a pause in rate hikes.


In [26]:
# Experiment 3C: Repetition penalty
print("=== REPETITION PENALTY EXPERIMENTS ===")

penalties = [1.0, 1.2, 1.5]
for penalty in penalties:
    print(f"\n--- Repetition Penalty = {penalty} ---")
    summary = summarize(article, base_prompt, temperature=0.7, repetition_penalty=penalty, max_length=60)
    print(summary)

=== REPETITION PENALTY EXPERIMENTS ===

--- Repetition Penalty = 1.0 ---
Federal Reserve Chairman Jerome Powell has warned that inflation may impede growth, but does not raise interest rates.

--- Repetition Penalty = 1.2 ---
Federal Reserve Chairman Jerome Powell has said he expects to raise interest rates by 0.25 per cent to 1.5%.

--- Repetition Penalty = 1.5 ---
Federal Reserve chiefs have announced a drop in interest rates.


---
## Part 4: Long Document Handling (Chunking)

For documents exceeding model context limits, we need to chunk and merge.

In [None]:
# Simulate long document (concatenate article multiple times)
long_document = (article + "\n\n") * 5  # ~5x original length
print(f"Long document: {len(long_document)} characters, {len(long_document.split())} words")
print(f"\nThis exceeds typical context windows for smaller models.")

In [None]:
def chunk_text(text, chunk_size=500, overlap=100):
    """Split text into overlapping chunks by words"""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk:  # Only add non-empty chunks
            chunks.append(chunk)
    
    return chunks

# Chunk the long document
chunks = chunk_text(long_document, chunk_size=300, overlap=50)
print(f"Split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):  # Show first 3
    print(f"\nChunk {i+1}: {len(chunk.split())} words")
    print(chunk[:150] + "...")

In [None]:
# Summarize each chunk
print("=== SUMMARIZING EACH CHUNK ===")
chunk_summaries = []

for i, chunk in enumerate(chunks):
    print(f"\nProcessing chunk {i+1}/{len(chunks)}...")
    summary = summarize(chunk, "Summarize briefly", max_length=50, temperature=0.3)
    chunk_summaries.append(summary)
    print(f"Summary: {summary}")

In [None]:
# Merge chunk summaries into final summary
print("\n=== MERGING CHUNK SUMMARIES ===")
combined_summaries = "\n\n".join(chunk_summaries)
print(f"Combined summaries ({len(combined_summaries.split())} words):\n")
print(combined_summaries)

print("\n=== FINAL SUMMARY (summarizing the summaries) ===")
final_summary = summarize(combined_summaries, "Combine these summaries into one coherent summary", 
                          max_length=80, temperature=0.3)
print(final_summary)

---
## Summary and Key Takeaways

In this lab, you explored:

1. **Model Loading**: Used FLAN-T5-small, a local open-source model
2. **Prompt Engineering**: 
   - Zero-shot: Simple instructions work but vary
   - Few-shot: Examples improve consistency
3. **Decoding Parameters**:
   - Temperature: Low (0.3) = factual, High (1.0) = creative
   - Top-p: Filters unlikely words, typically 0.9
   - Repetition penalty: Reduces redundancy (1.2 recommended)
4. **Long Documents**: Chunking + merging strategy handles context limits

### Next Steps

- Try larger models (FLAN-T5-base, mistral-7b) if you have GPU
- Experiment with different prompt templates
- Test on real documents from different domains (news, scientific, legal)
- Implement map-reduce or hierarchical summarization strategies