# **Leveraging LLMs for Text Generation and Summarization**


# **Table of Contents**

1. [Architectural Overview of LLMs](#architectural-overview-of-llms")
2. [Categories of LLMs](#categories-of-llms)
3. [Advanced Text Generation Techniques](#advanced-text-generation-techniques)
4. [Parameter Tuning for Different Needs](#parameter-tuning-for-different-needs)
5. [Temperature tuning experiment](#temperature-tuning-experiment)
6. [Structured output generation](#structured-output-generation)
7. [Extractive Summarization](#extractive-summarization)
8. [Understanding TF-IDF for Extractive Summarization](#understanding-tf-idf-for-extractive-summarization)
8. [Evaluation Metrics for Summarization](#evaluation-metrics-for-summarization)
9. [Abstractive Summarization & Control Parameters](#abstractive-summarization--control-parameters)
10. [Key Models for Abstractive Summarization](#key-models-for-abstractive-summarization)
11. [Controllable Summarization](#controllable-summarization)
12. [Building a Multi-Stage Summarizer](#building-a-multi-stage-summarization-pipeline)
13. [Multimodal Summarization](#multimodal-summarization)
14. [Practical Exercise: Building Your Custom Summarization System](#practical-exercise-building-your-custom-summarization-system)

# PART I

## Learning Objectives for Part 1:
- Understand why LLMs revolutionized summarization
- Master in-context learning techniques (zero-shot, few-shot)
- Compare foundation models vs specialized approaches through hands-on experiments
- Build practical skills for immediate application

# **Architectural Overview of LLMs**

## The Transformer Architecture

The transformer architecture revolutionized NLP when introduced in the paper "Attention is All You Need" (Vaswanathan et al., 2017).

<!-- Transformer Architecture -->
<div align="center">
<img src='images/transformer architecture.png' alt="Transformer architecture" width=1000 height=700>
</div>


### Key Components:
- **Self-Attention Mechanism**: Allows the model to weigh the importance of different words in context
- **Multi-Head Attention**: Parallel attention mechanisms capturing different relationships
- **Positional Encoding**: Helps the model understand word order
- **Feed-Forward Networks**: Process the representations from attention layers
- **Layer Normalization**: Stabilizes training


### Why Transformers Excel at Text Tasks,
1. **Parallel Processing**: Unlike RNNs, can process entire sequences simultaneously
2. **Long-Range Dependencies**: Attention mechanism captures distant relationships
3. **Context Awareness**: Each token attends to all other tokens in the sequence
4. **Scalability**: Architecture scales well with data and compute

# Categories of LLMs

LLMs come in different architectural variants, each with strengths for different tasks:

## 1. Decoder-Only Models (Autoregressive)
- Examples: GPT series, LLaMA, Claude
- Trained to predict the next token
- **Strengths for summarization**: Creative text generation, coherent narrative
- **Weaknesses**: May hallucinate or add information not in source

## 2. Encoder-Only Models
- Examples: BERT, RoBERTa
- Trained on masked language modeling
- **Strengths for summarization**: Understanding document context, good for extractive summarization
- **Weaknesses**: Not designed for generation

## 3. Encoder-Decoder Models
- Examples: T5, BART
- Trained on sequence-to-sequence tasks
- **Strengths for summarization**: Balanced understanding and generation, ideal for abstractive summarization
- **Weaknesses**: Larger compute requirements

### Which architecture is best for summarization?
It depends on the task! We'll explore the tradeoffs throughout this tutorial.

# Advanced Text Generation Techniques

### The Anatomy of Effective Prompts

<div align="center">
<img src="images/co-star.png>" width=700 height=500>
</div>



| Element       | Description                              |
| ------------- | ---------------------------------------- |
| **C**ontext   | Provide background information           |
| **O**bjective | State the goal of the task               |
| **S**tyle     | Specify tone, format, or constraints     |
| **T**ask      | What the model should actually do        |
| **A**udience  | Who the output is intended for           |
| **R**esponse  | Clarify what the output should look like |

📌 **Prompt Example:**

*Context*: You are a career advisor writing content for a university’s job preparation website. Many students are unsure how to describe their achievements on resumes, particularly in action-result format.

*Objective*: Help students craft clear and impressive resume bullet points for internships in data science.

*Style*: Keep the tone professional and concise. Use strong action verbs and quantify results wherever possible. Avoid first-person language.

*Task*: Based on the details provided, generate 3 resume bullet points that follow best practices in resume writing.

*Audience*: Undergraduate students applying for internships in data science roles.

*Response*: Your output should be a bulleted list of exactly 3 resume-ready statements.


# Parameter Tuning for Different Needs

Understanding how generation parameters affect output quality:

- **Temperature (0.0-2.0)**: Controls randomness
  - 0.0-0.3: Deterministic, factual content
  - 0.4-0.7: Balanced creativity and coherence  
  - 0.8-1.2: Creative, varied output
  - 1.3+: Highly creative but potentially incoherent

- **Top-p (0.0-1.0)**: Nucleus sampling
  - Selects the most probable tokens whose cumulative probability exceeds a certain threshold p
  - Lower values: More focused, consistent
  - Higher values: More diverse vocabulary

- **Top-k**: Limits vocabulary to k most likely tokens and samples from it
  - Lower values: More predictable
  - Higher values: More creative word choices

- **Beam-Search**
<img src='images/beam-search.jpg' width=900 height=431>
  -  Sequence score is cumulative sum of the log probability of every token in the beam.

# Temperature tuning experiment

In [None]:
# Set up the environment
def is_colab():
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    from google.colab import userdata
    api_key = userdata.get('OPENROUTER_API_KEY')
else:
  from dotenv import load_dotenv
  load_dotenv()

import os
import dotenv
from dotenv import load_dotenv
from openai import OpenAI
import json
from typing import List
load_dotenv()

In [22]:
class LLMClient:

    def __init__(self, model_name="google/gemini-2.0-flash-lite-001"):
        self.model = model_name
        self.api_key = os.getenv("OPENROUTER_API_KEY")
        self.client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=self.api_key)

    def generate(self, text, temperature=0.8, max_tokens=2000, tools=None):
        """Generate text using autoregressive model"""
        try:
            response = self.client.chat.completions.create(
                messages=[
                    {"role": "user", "content": text}
                ],
                temperature=temperature,
                max_tokens=max_tokens,
                model=self.model,
                tool_choice="auto",
                tools=tools,
                extra_body={},
            )
            # Extract and return the response
            result = response.choices[0].message.content
            return result
        except Exception as e:
            return f"Error: {str(e)}"
        
    def zero_shot_summarize(self, text, instruction="Summarize the following text:"):
        """Zero-shot summarization with customizable instructions"""
        prompt = f"{instruction}\n\n{text}\n\nSummary:"
        return self.generate(prompt)
    
    def few_shot_summarize(self, text, examples):
        """Few-shot summarization with examples"""
        prompt = "Here are examples of good summaries:\n\n"
        
        for i, example in enumerate(examples, 1):
            prompt += f"Example {i}:\n"
            prompt += f"Text: {example['text']}\n"
            prompt += f"Summary: {example['summary']}\n\n"
        
        prompt += f"Now summarize this text:\n{text}\n\nSummary:"
        return self.generate(prompt)

In [None]:
def compare_parameters(prompt: str, temperatures: List[float], client: LLMClient):

    results = []
    for temperature in temperatures:
        result = client.generate(prompt, temperature=temperature)
        results.append(
            {
                "temperature": temperature,
                "response": result,
                "length": len(result.split())
            }
        )   
    return results


test_prompt = "Write a creative opening sentence for a science fiction story about time travel."
temperatures = [0.1, 0.5, 0.9, 1.2]

results = compare_parameters(test_prompt, temperatures, LLMClient()),
for result in results[0]:
    print(f"Temperature: {result['temperature']}"),
    print(f"Response: {result['response']}"),
    print(f"Word count: {result['length']}"),
    print("-" * 40)


# Structured output generation

Getting LLMs to produce consistent, parseable output formats is crucial for integration with downstream systems.

### Prompt based structured generation

In [None]:
schema = {
    "name": "string",
    "age": "integer",
    "profession": "string",
    "language": "string",
    "origin": "string"
}

character_bio = """
"Dr. Amara Liu is a 35-year-old astrophysicist from Shanghai. She researches dark matter and enjoys stargazing and sketching. Fluent in Mandarin and English."
"""

prompt = (
    f"{character_bio}\n\n"
    "Extract the following structured data as JSON:\n"
    f"{json.dumps(schema, indent=2)}\n\n"
    "Respond ONLY with valid JSON that matches the schema above."
)
llm_model = LLMClient()
response = llm_model.generate(prompt)

# Try to extract JSON
try:
    structured_data = json.loads(response)
except json.JSONDecodeError:
    import re
    match = re.search(r'\{.*\}', response, re.DOTALL)
    structured_data = json.loads(match.group()) if match else {"error": "invalid format"}

print(structured_data)


### Schema Aware structured Output

In [None]:
# Structure Output via function calling
tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_profile",
            "description": "Extracts a person's profile.",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "profession": {"type": "string"},
                    "language": {"type": "string"},
                    "origin": {"type": "string"}
                },
                "required": ["name", "age", "profession", "language", "origin"]
            }
        }
    }
]

prompt = f"What is the name, age, profession, language, and origin of the person in the following text:\n {character_bio}"
response = llm_model.generate(prompt, tools=tools)
print(response.choices[0].message.tool_calls[0].function.arguments)


## The LLM Revolution in Summarization

### Quick Context: What Changed?

**Pre-2020**: Building a summarization system required:
- Manual feature engineering (TF-IDF, sentence scoring)
- Task-specific model training
- Domain-specific datasets
- Complex evaluation pipelines

**Post-2020**: A simple prompt to GPT-3:
```
"Summarize this article in 3 sentences: [article]"
```
Often outperforms specialized models trained for months.

**Research Finding**: Human evaluators prefer GPT-3 summaries over specialized models 77% of the time, despite GPT-3 never being explicitly trained on summarization data.

<img src="images/text summarization.jpg" width=1000>

# Extractive Summarization

Extractive summarization selects the most important sentences from the orginal text to create the summary.

## The Basic Process:
1. Score sentences based on importance
2. Select top-scoring sentences using ranking algorithm (TF IDF, SVM, etc)
3. Arrange in coherent order (usually original order)

## Advantages:
- Factually accurate (uses original text)
- Computationally efficient
- Works well for objective content

## Disadvantages:
- May be disconnected or redundant
- Cannot reformulate or simplify complex content
- Limited by quality of source material

# Understanding TF-IDF for Extractive Summarization

## What is TF-IDF?

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a numerical statistic that reflects how important a word is to a document within a collection of documents. It's widely used in information retrieval and text mining, making it perfect for identifying the most important sentences in a document for summarization.

## The Mathematical Foundation

### Term Frequency (TF)
**Definition**: How frequently a word appears in a specific sentence/document.

$$TF(word, sentence) = \frac{\text{Number of times word appears in sentence}}{\text{Total number of words in sentence}}$$

**Example**: In the sentence "The cat sat on the mat", the word "the" has TF = 2/6 = 0.33

### Inverse Document Frequency (IDF)
**Definition**: How rare or common a word is across all sentences in the document.

$$IDF(word) = \log\left(\frac{\text{Total number of sentences}}{\text{Number of sentences containing the word}}\right)$$

**Intuition**: 
- Common words (like "the", "and") appear in many sentences → Low IDF → Less important
- Rare words (like "quantum", "photosynthesis") appear in few sentences → High IDF → More important

### TF-IDF Score
**Final Formula**:
$$TF\text{-}IDF(word, sentence) = TF(word, sentence) \times IDF(word)$$

**Sentence Score**: Sum of TF-IDF scores for all words in the sentence
$$Score(sentence) = \sum_{word \in sentence} TF\text{-}IDF(word, sentence)$$

## Visual Example

Let's work through a simple example:

**Document**: 
- Sentence 1: "The cat sat on the mat"
- Sentence 2: "The dog ran quickly"  
- Sentence 3: "Cats and dogs are pets"

### Step 1: Calculate TF for each word in Sentence 1

| Word | Count in S1 | Total words in S1 | TF |
|------|-------------|-------------------|-----|
| the  | 2           | 6                 | 0.33|
| cat  | 1           | 6                 | 0.17|
| sat  | 1           | 6                 | 0.17|
| on   | 1           | 6                 | 0.17|
| mat  | 1           | 6                 | 0.17|

### Step 2: Calculate IDF for each word

| Word | Sentences containing word | Total sentences | IDF |
|------|--------------------------|----------------|-----|
| the  | 2 (S1, S2)              | 3              | log(3/2) = 0.18|
| cat  | 1 (S1)                  | 3              | log(3/1) = 0.48|
| sat  | 1 (S1)                  | 3              | log(3/1) = 0.48|

### Step 3: Calculate TF-IDF scores

| Word | TF   | IDF  | TF-IDF |
|------|------|------|--------|
| the  | 0.33 | 0.18 | 0.06   |
| cat  | 0.17 | 0.48 | 0.08   |
| sat  | 0.17 | 0.48 | 0.08   |

**Sentence 1 Score** = 0.06 + 0.08 + 0.08 + ... = Final Score

### Why TF-IDF Works for Summarization

### 1. **Identifies Content Words**
- Function words ("the", "and", "is") get low scores
- Content words ("algorithm", "photosynthesis", "democracy") get high scores

### 2. **Balances Frequency and Rarity**
- A word mentioned often in one sentence (high TF) but rare overall (high IDF) = very important
- A word mentioned everywhere (low IDF) = less distinctive

### 3. **Context Awareness**
- Same word gets different importance scores in different documents
- Adapts to the specific content being summarized

### Issues with TF-IDF
1. 
- **Problem**: TF-IDF ignores where sentences appear in the document.
- **Solution**: Add position-based scoring.

2. 
- **Problem**: Longer sentences automatically get higher TF-IDF scores.
- **Solution**: Normalize by optimal sentence length.

3. 
- **Problem**: Multiple sentences might convey the same information.
- **Solution**: Remove semantically similar sentences.

In [None]:
import nltk
import numpy as np
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

class ExtractiveSummarizer:
    """Enhanced extractive summarizer using proper TF-IDF and additional features"""
    
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
    
    def preprocess_text(self, text):
        """Clean and prepare text for processing"""
        # Remove extra whitespace and normalize
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    
    def extract_sentences(self, text):
        """Extract and clean sentences"""
        sentences = sent_tokenize(text)
        # Filter out very short sentences (less than 5 words)
        sentences = [s for s in sentences if len(s.split()) >= 5]
        return sentences
    
    def calculate_tfidf_scores(self, sentences):
        """Calculate TF-IDF scores for sentences"""
        # Use sklearn's TfidfVectorizer for proper TF-IDF calculation
        vectorizer = TfidfVectorizer(
            stop_words='english',
            lowercase=True,
            max_features=1000,
            ngram_range=(1, 2)  # Include bigrams for better context
        )
        
        try:
            tfidf_matrix = vectorizer.fit_transform(sentences)
            # Sum TF-IDF scores for each sentence
            sentence_scores = np.array(tfidf_matrix.sum(axis=1)).flatten()
            return sentence_scores
        except ValueError:
            print('failed to get tf-idf')
            # Fallback to simple word frequency if TF-IDF fails

    def calculate_position_scores(self, sentences):
        """Calculate position-based scores (first and last sentences are important)"""
        num_sentences = len(sentences)
        position_scores = np.zeros(num_sentences)
        
        for i in range(num_sentences):
            if i == 0:  # First sentence
                position_scores[i] = 0.3
            elif i == num_sentences - 1:  # Last sentence
                position_scores[i] = 0.2
            elif i < num_sentences * 0.1:  # Early sentences
                position_scores[i] = 0.15
            elif i > num_sentences * 0.9:  # Late sentences
                position_scores[i] = 0.1
            else:
                position_scores[i] = 0.05
        
        return position_scores
                
    def calculate_sentence_length_scores(self, sentences):
        """Normalize scores by sentence length to avoid bias toward longer sentences"""
        length_scores = []
        for sentence in sentences:
            words = sentence.split()
            # Optimal sentence length is around 15-25 words
            if 15 <= len(words) <= 25:
                length_scores.append(1.0)
            elif 10 <= len(words) < 15 or 25 < len(words) <= 35:
                length_scores.append(0.8)
            else:
                length_scores.append(0.6)
        
        return np.array(length_scores)

    def remove_redundant_sentences(self, sentences, selected_indices, threshold=0.7):
        """Remove sentences that are too similar to already selected ones"""
        if len(selected_indices) <= 1:
            return selected_indices
        
        # Use simple word overlap for similarity
        filtered_indices = [selected_indices[0]]  # Keep the first (highest scoring)
        
        for idx in selected_indices[1:]:
            current_sentence = sentences[idx]
            current_words = set(word.lower() for word in word_tokenize(current_sentence)
                              if word.lower() not in self.stop_words and word.isalnum())
            
            is_redundant = False
            for selected_idx in filtered_indices:
                selected_sentence = sentences[selected_idx]
                selected_words = set(word.lower() for word in word_tokenize(selected_sentence)
                                   if word.lower() not in self.stop_words and word.isalnum())
                
                # Calculate Jaccard similarity
                if len(current_words) > 0 and len(selected_words) > 0:
                    intersection = current_words.intersection(selected_words)
                    union = current_words.union(selected_words)
                    similarity = len(intersection) / len(union)
                    
                    if similarity > threshold:
                        is_redundant = True
                        break
            
            if not is_redundant:
                filtered_indices.append(idx)
        
        return filtered_indices
    
    def summarize(self, text, ratio=0.3, max_sentences=None):
        """
        Generate extractive summary using improved TF-IDF approach
        
        Args:
            text: Input text to summarize
            ratio: Proportion of sentences to include (0.1 to 0.5)
            max_sentences: Maximum number of sentences (overrides ratio if specified)
        """
        # Preprocess and extract sentences
        text = self.preprocess_text(text)
        sentences = self.extract_sentences(text)
        
        if len(sentences) <= 2:
            return ' '.join(sentences)
        
        # Calculate different scoring components
        tfidf_scores = self.calculate_tfidf_scores(sentences)
        position_scores = self.calculate_position_scores(sentences)
        length_scores = self.calculate_sentence_length_scores(sentences)
        
        # Normalize TF-IDF scores to 0-1 range
        if tfidf_scores.max() > 0:
            tfidf_scores = tfidf_scores / tfidf_scores.max()
        
        # Combine scores with weights
        final_scores = (
            0.7 * tfidf_scores +      # Content importance (70%)
            0.2 * position_scores +   # Position importance (20%)
            0.1 * length_scores       # Length preference (10%)
        )
        
        # Determine number of sentences to select
        if max_sentences:
            num_sentences = min(max_sentences, len(sentences))
        else:
            num_sentences = max(1, int(len(sentences) * ratio))
        
        # Select top sentences
        top_indices = np.argsort(final_scores)[-num_sentences:][::-1]
        
        # Remove redundant sentences
        filtered_indices = self.remove_redundant_sentences(sentences, top_indices)
        
        # Sort by original order in text
        filtered_indices.sort()
        
        # Create summary
        summary_sentences = [sentences[i] for i in filtered_indices]
        summary = ' '.join(summary_sentences)
        
        return summary, {
            'selected_indices': filtered_indices,
            'scores': final_scores,
            'num_original_sentences': len(sentences),
            'num_selected_sentences': len(filtered_indices)
        }

In [None]:
# Example usage

# Load a Ghanaian news article
with open('articles/article.txt', 'r', encoding='utf-8') as f:
    article = f.read()
    
# Print article length
print(f"Article contains {len(sent_tokenize(article))} sentences and {len(article.split())} words")

extractive_summarizer = ExtractiveSummarizer()
generated_summary= extractive_summarizer.summarize(article, ratio=0.3)
print(f"\nExtractive Summary\n{generated_summary}")

In [None]:
# Generate LLM summary

llm_summary = llm_model.zero_shot_summarize(article)
print("\nLLM Zero-Shot Summary:")
print(llm_summary)
print(f"\nLength: {len(llm_summary.split())} words")

###  Quick Analysis
Compare the two summaries above:
1. Which feels more natural to read?
2. Which captures the main ideas better?
3. Which is more concise while retaining key information?

# Evaluation Metrics for Summarization

How do we know if our summaries are good? Let's implement some common evaluation metrics:

## ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- Measures overlap between machine-generated summary and reference summary
- ROUGE-N: N-gram recall
- ROUGE-L: Longest Common Subsequence 

<img src='images/Rouge L.jpeg' width=700>

LCS is the longest set of ordered tokens that occurs in both sequences (Ref, Gen)

## BLEU (Bilingual Evaluation Understudy)
- Originally designed for translation, but used for summarization
- Precision-focused (how many generated n-grams appear in reference)

## BERTScore
- Uses contextual embeddings to compute similarity
- Better semantic understanding than n-gram methods

## Human Evaluation Dimensions
- **Relevance**: How well does the summary capture the main points?
- **Coherence**: Does it flow logically?
- **Fluency**: Is it grammatically correct?
- **Factuality**: Does it contain errors or hallucinations?
- **Accuracy**: Does the summary accurately represent the original content?
- **Readability**: Is the summary well-written and easy to understand?

In [None]:
# Implement custom ROUGE socre
from collections import Counter


# Reference summary for comparison
with open("articles/reference summary.txt", 'r', encoding='utf-8') as f:
    reference_summary = f.read()


def calculate_rouge_n(reference_summary, generated_summary, n):
    """Calculate ROUGE-N score"""
    # Tokenize into words
    ref_tokens = word_tokenize(reference_summary.lower())
    cand_tokens = word_tokenize(generated_summary.lower())

    # Generate n-grams
    ref_ngrams = list(zip(*[ref_tokens[i:] for i in range(n)]))
    cand_ngrams = list(zip(*[cand_tokens[i:] for i in range(n)]))
    
    # Count ngrams
    ref_counter = Counter(ref_ngrams)
    cand_counter = Counter(cand_ngrams)
    
    # Count matches
    overlap = sum((ref_counter & cand_counter).values()) # intersection of the two counters

    # Calculate precision and recall
    precision = overlap / max(1, sum(cand_counter.values()))
    recall = overlap / max(1, sum(ref_counter.values()))
    
    # Calculate F1 score
    f1 = 2 * precision * recall / max(1, precision + recall)
    return f1



In [12]:
# Let's create a simple function to evaluate our summaries for ROUGE-1 and ROUGE-2
def evaluate_summary(reference, candidate):
    """Evaluate a summary using multiple metrics"""
    scores = {
        'ROUGE-1': calculate_rouge_n(reference, candidate, 1),
        'ROUGE-2': calculate_rouge_n(reference, candidate, 2),
    }

    # Add readability metric: average words per sentence
    cand_sentences = sent_tokenize(candidate)
    avg_sentence_length = len(word_tokenize(candidate)) / max(1, len(cand_sentences))
    scores['Avg Words/Sentence'] = avg_sentence_length
    
    return scores

In [None]:
# Let's try evaluating our summary against a reference summary

# Evaluate our extractive summary against the reference
evaluation_scores = evaluate_summary(reference_summary, generated_summary)

print("Evaluation Results:")
for metric, score in evaluation_scores.items():
    print(f"{metric}: {score:.4f}")

# Abstractive Summarization & Control Parameters


## What is Abstractive Summarization?

Abstractive summarization involves:
- Understanding the source content deeply
- Identifying key concepts and relationships
- Generating new text by paraphrasing existing text
- Condensing information in ways that extraction cannot


# Controllable Summarization

One of the major advantages of modern summarization systems is the ability to control various aspects of the generated summaries:

## Common Control Parameters:

1. **Length**: Controlling how long or short the summary should be
2. **Style**: Formal vs. casual, simple vs. technical
3. **Focus**: Emphasizing particular topics or aspects
4. **Structure**: Bullet points, narrative, or question-answering

## How to Implement Controls:

1. **Model-specific parameters**: Using built-in generation controls
2. **Prompt engineering**: Adding instructional prefixes
3. **Output filtering**: Post-processing generated summaries
4. **Fine-tuning**: Training the model with examples of desired style

Prompt Control

In [None]:
def experiment_with_prompts():
    """Experiment with different instruction styles"""
    
    instructions = {
        "basic": "Summarize the following text:",
        "concise": "Provide a concise summary of the main points:",
        "detailed": "Write a comprehensive summary covering all key aspects:",
        "bullet": "Summarize the key points as bullet points:",
        "executive": "Write an executive summary focusing on implications and next steps:"
    }
    
    results = {}
    
    for style, instruction in instructions.items():
        print(f"\n--- {style.upper()} STYLE ---")
        summary = llm_model.zero_shot_summarize(article, instruction)
        results[style] = summary
        print(f"Instruction: {instruction}")
        print(f"Summary: {summary}")
        print(f"Length: {len(summary.split())} words")
    
    return results

instruction_results = experiment_with_prompts()

Fewshot vs Zero shot

In [None]:
# Create few-shot examples for different domains
news_examples = [
    {
        "text": "Scientists at MIT have developed a new battery technology that could revolutionize electric vehicles. The lithium-metal batteries can charge to 80% capacity in just 10 minutes and last for over 500,000 miles. The breakthrough addresses two major concerns about electric vehicles: charging time and battery longevity. Commercial production is expected to begin in 2026.",
        "summary": "MIT scientists created fast-charging, long-lasting batteries for electric vehicles, addressing key concerns about charging time and durability, with commercial production planned for 2026."
    },
    {
        "text": "The Federal Reserve announced a 0.25% interest rate cut today, citing concerns about slowing economic growth and inflation falling below target levels. This marks the third rate cut this year. Stock markets responded positively, with the S&P 500 gaining 2.1% in after-hours trading. Economists predict this could stimulate business investment and consumer spending.",
        "summary": "The Federal Reserve cut interest rates by 0.25% due to economic concerns, prompting positive market reactions and expectations of increased business and consumer activity."
    }
]

research_examples = [
    {
        "text": "This study examined the effects of meditation on stress hormones in 200 participants over 8 weeks. Participants who meditated daily showed a 23% reduction in cortisol levels compared to the control group. The research also found improvements in sleep quality and self-reported well-being. These findings suggest meditation could be an effective intervention for stress management.",
        "summary": "An 8-week study of 200 participants found daily meditation reduced stress hormone levels by 23% and improved sleep and well-being, supporting meditation as a stress management tool."
    }
]

def compare_few_shot_vs_zero_shot(text, examples):
    """Compare zero-shot and few-shot performance"""
    
    # Zero-shot
    zero_shot = llm_model.zero_shot_summarize(text)
    
    # Few-shot
    few_shot = llm_model.few_shot_summarize(text, examples)
    
    print("ZERO-SHOT SUMMARY:")
    print(zero_shot)
    print(f"Length: {len(zero_shot.split())} words\n")
    
    print("FEW-SHOT SUMMARY:")
    print(few_shot)
    print(f"Length: {len(few_shot.split())} words\n")
    
    return zero_shot, few_shot

In [None]:
# Test with a new research article
new_research = """
A comprehensive analysis of social media usage patterns among teenagers reveals concerning trends in mental health outcomes. The study, following 1,500 participants aged 13-18 over two years, found that teens spending more than 3 hours daily on social platforms showed increased rates of anxiety and depression. Particularly concerning was the correlation between late-night social media use and sleep disorders. However, the research also identified positive outcomes, including enhanced social connections and access to mental health resources. The researchers recommend implementing digital wellness programs in schools and encouraging mindful social media usage rather than complete avoidance.
"""

zero_shot_result, few_shot_result = compare_few_shot_vs_zero_shot(new_research, research_examples)

###  Discussion Point
After running the above experiment:
1. Which summary feels more consistent with the examples provided?
2. How did the few-shot examples influence the style and content selection?
3. What are the trade-offs between zero-shot (faster) and few-shot (more controlled)?

# Part II: Advanced Prompting Strategies & Controllable Summarization

## Learning Objectives for Part 2:
- Master advanced prompting techniques (chain-of-thought, role-playing, structured output)
- Implement controllable summarization for different audiences and formats
- Handle long documents and multi-document scenarios
- Build robust evaluation frameworks using modern techniques

## Advanced Prompting Techniques

### Chain-of-Thought (CoT) for Complex Summarization

When dealing with complex documents with multiple topics or intricate relationships, breaking down the summarization process helps LLMs produce better results.

In [None]:
class AdvancedSummarizer(LLMSummarizer):
    """Extended summarizer with advanced prompting capabilities"""
    
    def chain_of_thought_summarize(self, text, focus_areas=None):
        """Use step-by-step reasoning for complex summarization"""
        
        if focus_areas is None:
            focus_areas = ["main topic", "key findings", "implications"]
        
        cot_prompt = f"""
I need to summarize this complex text systematically. Let me break this down step by step:

1. First, I'll identify the main topics and themes
2. Then, I'll extract the key points for each theme
3. Next, I'll identify relationships between different points
4. Finally, I'll synthesize this into a coherent summary

Text to analyze:
{text}

Let me work through this step by step:

Step 1 - Main topics I can identify:
"""
        
        return self.generate(cot_prompt, max_tokens=500)
    
    def role_based_summarize(self, text, role="expert", audience="general"):
        """Summarize from a specific role perspective"""
        
        role_prompts = {
            "expert": f"As a domain expert, provide a technical summary for {audience} audience:",
            "journalist": f"As a journalist, write a news summary for {audience} readers:",
            "teacher": f"As an educator, explain this clearly for {audience} students:",
            "consultant": f"As a business consultant, provide strategic insights for {audience}:",
            "researcher": f"As a researcher, highlight methodology and findings for {audience}:"
        }
        
        prompt = f"{role_prompts.get(role, role_prompts['expert'])}\n\n{text}\n\nSummary:"
        return self.generate(prompt)
    
    def structured_summarize(self, text, structure="executive"):
        """Generate structured summaries with specific formats"""
        
        structures = {
            "executive": """
Create an executive summary with exactly these sections:
- Executive Summary (2-3 sentences)
- Key Findings (3-4 bullet points)
- Recommendations (2-3 bullet points)
- Next Steps (1-2 bullet points)
""",
            "scientific": """
Summarize in scientific paper format:
- Background & Objective (1 sentence)
- Methods (1 sentence)
- Results (2-3 sentences)
- Conclusions (1-2 sentences)
""",
            "news": """
Create a news summary with:
- Lead (Who, What, When, Where in first sentence)
- Key Details (2-3 supporting facts)
- Context/Background (1-2 sentences)
- Impact/Implications (1 sentence)
""",
            "meeting": """
Format as meeting minutes:
- Key Decisions Made
- Action Items Assigned
- Topics Discussed
- Next Meeting/Follow-up
"""
        }
        
        prompt = f"{structures.get(structure, structures['executive'])}\n\nText:\n{text}\n\nStructured Summary:"
        return self.generate(prompt, max_tokens=400)

# Initialize advanced summarizer
advanced_summarizer = AdvancedSummarizer()

###  Experiment 1: Chain-of-Thought vs Direct Summarization

In [None]:
# Complex multi-topic article for testing
with open("articles/complex_article.txt", 'r', encoding='utf-8') as f:
    complex_article = f.read()

def compare_cot_vs_direct(text):
    """Compare chain-of-thought vs direct summarization"""
    
    print("🔗 CHAIN-OF-THOUGHT APPROACH:")
    print("=" * 50)
    cot_summary = advanced_summarizer.chain_of_thought_summarize(text)
    print(cot_summary)
    print(f"\nLength: {len(cot_summary.split())} words")
    
    print("\n📝 DIRECT APPROACH:")
    print("=" * 50)
    direct_summary = llm_model.zero_shot_summarize(
        text, 
        "Provide a comprehensive summary of this article covering all main points:"
    )
    print(direct_summary)
    print(f"\nLength: {len(direct_summary.split())} words")
    
    return cot_summary, direct_summary

cot_result, direct_result = compare_cot_vs_direct(complex_article)



###  Analysis: Which Approach Worked Better?

In [None]:
import pandas as pd

def analyze_summary_quality(cot_summary, direct_summary, original_text):
    """Analyze different aspects of summary quality"""
    
    # Extract key topics from original (simple keyword extraction)
    import re
    
    key_topics = [
        "artificial intelligence", "AI", "healthcare", "hospitals", "diagnostic",
        "privacy", "FDA", "economic", "costs", "personalized medicine", 
        "ethical", "bias", "demographics"
    ]
    
    def count_topic_coverage(summary, topics):
        summary_lower = summary.lower()
        covered = [topic for topic in topics if topic.lower() in summary_lower]
        return len(covered), covered
    
    # Analyze both summaries
    cot_coverage, cot_topics = count_topic_coverage(cot_summary, key_topics)
    direct_coverage, direct_topics = count_topic_coverage(direct_summary, key_topics)
    
    analysis = {
        "Approach": ["Chain-of-Thought", "Direct"],
        "Word Count": [len(cot_summary.split()), len(direct_summary.split())],
        "Topic Coverage": [f"{cot_coverage}/{len(key_topics)}", f"{direct_coverage}/{len(key_topics)}"],
        "Structure Clarity": ["High" if "step" in cot_summary.lower() else "Medium", "Medium"],
        "Reasoning Visible": ["Yes" if "identify" in cot_summary.lower() or "step" in cot_summary.lower() else "No", "No"]
    }
    
    comparison_df = pd.DataFrame(analysis)
    print(" SUMMARY QUALITY COMPARISON:")
    print(comparison_df.to_string(index=False))
    
    print(f"\n Topics covered by CoT: {cot_topics}")
    print(f" Topics covered by Direct: {direct_topics}")
    
    return comparison_df

quality_analysis = analyze_summary_quality(cot_result, direct_result, complex_article)

## Controllable Summarization: Audience and Purpose

### Role-Based Summarization

In [None]:
def demonstrate_role_based_summarization():
    """Show how different roles produce different summaries"""
    
    roles_and_audiences = [
        ("journalist", "general public"),
        ("researcher", "academic peers"),
        ("consultant", "business executives"),
        ("teacher", "high school students")
    ]
    
    results = {}
    
    for role, audience in roles_and_audiences:
        print(f"\n👤 ROLE: {role.upper()} | AUDIENCE: {audience.upper()}")
        print("=" * 60)
        
        summary = advanced_summarizer.role_based_summarize(
            complex_article, 
            role=role, 
            audience=audience
        )
        
        results[f"{role}_{audience}"] = summary
        print(summary)
        print(f"Length: {len(summary.split())} words")
    
    return results

role_based_results = demonstrate_role_based_summarization()

###  Exercise: Audience Adaptation Analysis

In [None]:
import matplotlib.pyplot as plt
def analyze_audience_adaptation(results):
    """Analyze how summaries adapt to different audiences"""
    
    # Define audience-specific characteristics to look for
    characteristics = {
        "technical_terms": ["AI", "algorithm", "diagnostic", "optimization", "regulatory"],
        "business_terms": ["investment", "ROI", "market", "revenue", "cost savings", "efficiency"],
        "simple_language": ["help", "better", "improve", "save money", "easier"],
        "educational_elements": ["learn", "understand", "example", "means", "because"]
    }
    
    analysis_results = []
    
    for summary_key, summary in results.items():
        role, audience = summary_key.split('_', 1)
        summary_lower = summary.lower()
        
        char_counts = {}
        for char_type, terms in characteristics.items():
            count = sum(1 for term in terms if term.lower() in summary_lower)
            char_counts[char_type] = count
        
        analysis_results.append({
            'role': role,
            'audience': audience.replace('_', ' '),
            'word_count': len(summary.split()),
            'technical_terms': char_counts['technical_terms'],
            'business_terms': char_counts['business_terms'],
            'simple_language': char_counts['simple_language'],
            'educational_elements': char_counts['educational_elements']
        })
    
    adaptation_df = pd.DataFrame(analysis_results)
    print("🎭 AUDIENCE ADAPTATION ANALYSIS:")
    print(adaptation_df.to_string(index=False))
    
    # Visualize adaptation patterns
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Summary Characteristics by Role and Audience', fontsize=16)
    
    # Technical terms usage
    axes[0,0].bar(range(len(adaptation_df)), adaptation_df['technical_terms'])
    axes[0,0].set_title('Technical Terms Usage')
    axes[0,0].set_xticks(range(len(adaptation_df)))
    axes[0,0].set_xticklabels([f"{row['role']}\n({row['audience']})" for _, row in adaptation_df.iterrows()], rotation=45)
    
    # Business terms usage
    axes[0,1].bar(range(len(adaptation_df)), adaptation_df['business_terms'])
    axes[0,1].set_title('Business Terms Usage')
    axes[0,1].set_xticks(range(len(adaptation_df)))
    axes[0,1].set_xticklabels([f"{row['role']}\n({row['audience']})" for _, row in adaptation_df.iterrows()], rotation=45)
    
    # Simple language usage
    axes[1,0].bar(range(len(adaptation_df)), adaptation_df['simple_language'])
    axes[1,0].set_title('Simple Language Usage')
    axes[1,0].set_xticks(range(len(adaptation_df)))
    axes[1,0].set_xticklabels([f"{row['role']}\n({row['audience']})" for _, row in adaptation_df.iterrows()], rotation=45)
    
    # Word count comparison
    axes[1,1].bar(range(len(adaptation_df)), adaptation_df['word_count'])
    axes[1,1].set_title('Summary Length')
    axes[1,1].set_xticks(range(len(adaptation_df)))
    axes[1,1].set_xticklabels([f"{row['role']}\n({row['audience']})" for _, row in adaptation_df.iterrows()], rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    return adaptation_df

audience_analysis = analyze_audience_adaptation(role_based_results)

###  Building Structured Summaries

In [None]:
def demonstrate_structured_outputs():
    """Show different structured output formats"""
    
    # Use a business-focused article for this demo
    business_article = """
    TechCorp announced its Q3 earnings today, reporting revenue of $2.8 billion, up 15% from the previous quarter. The company's cloud services division led growth with 32% year-over-year increase, while its traditional software licensing revenue declined by 8%. 
    
    CEO Sarah Johnson highlighted the successful launch of their AI-powered analytics platform, which has already attracted 50,000 enterprise customers in its first month. The platform uses machine learning to help businesses optimize their operations and reduce costs by an average of 18%.
    
    However, the company faces challenges in the competitive landscape, with new entrants offering similar services at lower prices. TechCorp plans to invest $500 million in R&D next year to maintain its technological edge. The company also announced plans to acquire DataInsights Inc., a startup specializing in real-time data processing, for $150 million.
    
    Looking ahead, TechCorp expects Q4 revenue to reach $3.1 billion, driven by increased adoption of cloud services and the holiday shopping season. The company raised its full-year guidance to $11.2 billion, exceeding analyst expectations of $10.8 billion.
    """
    
    structures = ["executive", "news", "meeting"]
    structured_results = {}
    
    for structure in structures:
        print(f"\n📋 {structure.upper()} FORMAT:")
        print("=" * 50)
        
        summary = advanced_summarizer.structured_summarize(business_article, structure)
        structured_results[structure] = summary
        print(summary)
    
    return structured_results

structured_results = demonstrate_structured_outputs()

## Handling Long Documents

### Recursive Summarization Strategy

In [None]:
class LongDocumentSummarizer(AdvancedSummarizer):
    """Specialized summarizer for handling long documents"""
    
    def chunk_text(self, text, chunk_size=1000, overlap=100):
        """Split text into overlapping chunks"""
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), chunk_size - overlap):
            chunk = ' '.join(words[i:i + chunk_size])
            chunks.append(chunk)
            
            if i + chunk_size >= len(words):
                break
        
        return chunks
    
    def recursive_summarize(self, text, target_length=200, max_iterations=3):
        """Recursively summarize long text"""
        
        print(f"📄 Starting recursive summarization...")
        print(f"Original length: {len(text.split())} words")
        print(f"Target length: {target_length} words")
        
        current_text = text
        iteration = 0
        
        while len(current_text.split()) > target_length and iteration < max_iterations:
            iteration += 1
            print(f"\n🔄 Iteration {iteration}:")
            
            # If text is still very long, chunk it first
            if len(current_text.split()) > 2000:
                print("  📝 Chunking large text...")
                chunks = self.chunk_text(current_text, chunk_size=800, overlap=50)
                print(f"  Created {len(chunks)} chunks")
                
                # Summarize each chunk
                chunk_summaries = []
                for i, chunk in enumerate(chunks):
                    print(f"  Processing chunk {i+1}/{len(chunks)}...")
                    chunk_summary = self.zero_shot_summarize(
                        chunk, 
                        "Summarize the key points from this text section:"
                    )
                    chunk_summaries.append(chunk_summary)
                
                # Combine chunk summaries
                current_text = ' '.join(chunk_summaries)
                print(f"  Combined chunks: {len(current_text.split())} words")
            
            # Final summarization pass
            if len(current_text.split()) > target_length:
                print("  📋 Final summarization pass...")
                current_text = self.zero_shot_summarize(
                    current_text,
                    f"Create a comprehensive summary of approximately {target_length} words:"
                )
                print(f"  Result: {len(current_text.split())} words")
        
        print(f"\n✅ Final summary: {len(current_text.split())} words")
        return current_text
    
    def hierarchical_summarize(self, text, levels=["detailed", "medium", "brief"]):
        """Create multiple summary levels"""
        
        level_configs = {
            "detailed": {"length": 400, "instruction": "Provide a detailed summary covering all major points:"},
            "medium": {"length": 200, "instruction": "Create a balanced summary of key points:"},
            "brief": {"length": 100, "instruction": "Write a concise summary of main ideas:"},
            "executive": {"length": 50, "instruction": "Provide an executive summary in 2-3 sentences:"}
        }
        
        summaries = {}
        
        for level in levels:
            if level in level_configs:
                config = level_configs[level]
                print(f"\n📊 Creating {level} summary (target: ~{config['length']} words)...")
                
                summary = self.recursive_summarize(text, target_length=config['length'])
                summaries[level] = summary
                
                print(f"✅ {level.capitalize()} summary ({len(summary.split())} words):")
                print(summary[:200] + "..." if len(summary) > 200 else summary)
        
        return summaries

# Initialize long document summarizer
long_doc_summarizer = LongDocumentSummarizer()

In [None]:
# Test recursive summarization
print("\n🔄 Testing Recursive Summarization:")
recursive_result = long_doc_summarizer.recursive_summarize(long_document, target_length=150)

# Test hierarchical summarization
print("\n📊 Testing Hierarchical Summarization:")
with open("articles/long_document.txt", 'r', encoding='utf-8') as f:
    long_document = f.read()
    
hierarchical_results = long_doc_summarizer.hierarchical_summarize(
    long_document, 
    levels=["detailed", "medium", "brief"]
)

## Multi-Document Summarization

### Comparative and Synthesis Approaches

In [None]:
class MultiDocumentSummarizer(AdvancedSummarizer):
    """Summarizer for handling multiple related documents"""
    
    def comparative_summarize(self, documents, document_labels=None):
        """Compare and contrast multiple documents"""
        
        if document_labels is None:
            document_labels = [f"Document {i+1}" for i in range(len(documents))]
        
        # Create comparative prompt
        comparative_prompt = """
        I need to analyze and compare multiple documents on related topics. 
        Please provide a comparative summary that:
        1. Identifies common themes across documents
        2. Highlights key differences in perspectives or findings
        3. Synthesizes the most important information
        4. Notes any contradictions or conflicting information
        
        Documents to compare:
        """
        
        for label, doc in zip(document_labels, documents):
            comparative_prompt += f"\n\n{label}:\n{doc}"
        
        comparative_prompt += "\n\nComparative Summary:"
        
        return self.generate(comparative_prompt, max_tokens=500)
    
    def synthesis_summarize(self, documents, focus_question=None):
        """Synthesize information from multiple sources"""
        
        if focus_question is None:
            focus_question = "What are the key insights when considering all sources together?"
        
        synthesis_prompt = f"""
        Synthesize information from the following sources to answer: {focus_question}
        
        Please create a synthesis that:
        - Integrates information from all sources
        - Identifies patterns and trends
        - Resolves or notes conflicting information
        - Provides a coherent unified perspective
        
        Sources:
        """
        
        for i, doc in enumerate(documents, 1):
            synthesis_prompt += f"\n\nSource {i}:\n{doc}"
        
        synthesis_prompt += f"\n\nSynthesis addressing '{focus_question}':"
        
        return self.generate(synthesis_prompt, max_tokens=500)

# Initialize multi-document summarizer
multi_doc_summarizer = MultiDocumentSummarizer()

###  Multi-Document Exercise

In [None]:
def test_multi_document_summarization():
    """Test multi-document summarization with different perspectives"""
    
    # Three documents with different perspectives on remote work
    documents = [
        """
        Study A: Remote Work Productivity Analysis
        A comprehensive study of 5,000 employees across 50 companies found that remote workers are 13% more productive than their office counterparts. The research, conducted over 18 months, measured productivity through completed tasks, project deadlines met, and output quality scores. Remote workers reported higher job satisfaction (8.2/10 vs 7.1/10) and better work-life balance. However, the study noted challenges in spontaneous collaboration and team building. Companies with strong digital infrastructure saw the highest productivity gains, while those with poor remote work policies experienced productivity declines.
        """,
        """
        Study B: Remote Work Challenges and Solutions
        Research focusing on management perspectives reveals significant challenges in remote work implementation. Surveys of 1,200 managers indicate concerns about employee oversight (67%), team cohesion (58%), and maintaining company culture (52%). The study found that productivity varies significantly by role type, with creative and collaborative roles showing 8% decreased output while individual contributor roles improved by 12%. Communication frequency increased by 35% in remote teams, but decision-making speed decreased by 23%. Companies investing in management training for remote leadership saw better outcomes.
        """,
        """
        Study C: Economic Impact of Remote Work
        Economic analysis of remote work trends shows substantial cost savings for both employers and employees. Companies report average savings of $11,000 per remote employee annually through reduced office space, utilities, and facility costs. Employees save an average of $4,000 yearly on commuting, work clothing, and meals. However, the analysis reveals increased spending on home office equipment and higher utility bills for workers. The real estate market has been significantly impacted, with commercial office space demand down 30% in major cities while residential markets in suburban areas have seen 15% price increases. The shift has created an estimated $1.2 trillion economic redistribution.
        """
    ]
    
    labels = ["Productivity Study", "Management Perspective", "Economic Analysis"]
    
    print("📑 COMPARATIVE ANALYSIS:")
    print("=" * 60)
    comparative_result = multi_doc_summarizer.comparative_summarize(documents, labels)
    print(comparative_result)
    
    print("\n🔬 SYNTHESIS FOR POLICY MAKERS:")
    print("=" * 60)
    synthesis_result = multi_doc_summarizer.synthesis_summarize(
        documents, 
        "What policy recommendations emerge from considering all three studies?"
    )
    print(synthesis_result)
    
    return comparative_result, synthesis_result

comparative_analysis, synthesis_analysis = test_multi_document_summarization()

## Modern Evaluation Techniques

### Beyond ROUGE: Human-Aligned Evaluation

In [None]:
class ModernEvaluator:
    """Modern evaluation methods for summarization quality"""
    
    def __init__(self, llm_summarizer):
        self.summarizer = llm_summarizer
    
    def llm_as_judge(self, summary, original_text, criteria=None):
        """Use LLM to judge summary quality"""
        
        if criteria is None:
            criteria = ["accuracy", "completeness", "clarity", "conciseness"]
        
        criteria_descriptions = {
            "accuracy": "Contains no false information or hallucinations",
            "completeness": "Covers all important points from the original",
            "clarity": "Easy to understand and well-written",
            "conciseness": "Appropriate length without unnecessary detail",
            "coherence": "Logical flow and good organization",
            "relevance": "Focuses on the most important information"
        }
        
        evaluation_prompt = f"""
        Evaluate this summary based on the criteria below. Rate each criterion from 1-5 (5 being excellent).
        
        Original Text:
        {original_text}
        
        Summary to Evaluate:
        {summary}
        
        Evaluation Criteria:
        """
        
        for criterion in criteria:
            desc = criteria_descriptions.get(criterion, "Quality assessment")
            evaluation_prompt += f"\n- {criterion.capitalize()}: {desc}"
        
        evaluation_prompt += """
        
        Please provide:
        1. A score (1-5) for each criterion
        2. Brief explanation for each score
        3. Overall assessment
        4. Specific suggestions for improvement
        
        Evaluation:
        """
        
        return self.summarizer.generate(evaluation_prompt, max_tokens=400)
    
    def factuality_check(self, summary, source_text):
        """Check summary for factual accuracy"""
        
        factuality_prompt = f"""
        Check if this summary contains any factual errors or information not present in the source.
        
        Source Text:
        {source_text}
        
        Summary to Check:
        {summary}
        
        Please identify:
        1. Any statements that contradict the source
        2. Any information added that's not in the source  
        3. Any numbers, dates, or names that are incorrect
        4. Overall factual accuracy rating (1-5)
        
        Factuality Assessment:
        """
        
        return self.summarizer.generate(factuality_prompt, max_tokens=300)
    

    def comparative_evaluation(self, summaries, source_text, labels=None):
        """Compare multiple summaries of the same source"""
        
        if labels is None:
            labels = [f"Summary {i+1}" for i in range(len(summaries))]
        
        comparison_prompt = f"""
        Compare these summaries of the same source text and rank them by quality.
        
        Source Text:
        {source_text}
        
        Summaries to Compare:
        """
        
        for label, summary in zip(labels, summaries):
            comparison_prompt += f"\n\n{label}:\n{summary}"
        
        comparison_prompt += """
        
        Please provide:
        1. Ranking from best to worst with brief reasoning
        2. Strengths and weaknesses of each summary
        3. Which summary you would recommend and why
        
        Comparative Evaluation:
        """
        
        return self.summarizer.generate(comparison_prompt, max_tokens=400)

# Initialize modern evaluator
modern_evaluator = ModernEvaluator(advanced_summarizer)

##  Hands-On Evaluation Exercise

In [None]:
def comprehensive_evaluation_demo():
    """Demonstrate modern evaluation techniques"""
    
    # Test article for evaluation
    eval_test_article = """
    The Mars Perseverance rover has successfully collected its 20th rock sample from the Martian surface, marking a significant milestone in the search for ancient microbial life. The sample, dubbed "Beartown," was extracted from a sedimentary rock formation in Jezero Crater that scientists believe was formed in an ancient lake bed approximately 3.6 billion years ago.
    
    NASA's analysis indicates the rock contains high levels of silica and phosphate, minerals that on Earth are associated with biological processes. The rover's PIXL instrument detected organic compounds within the sample, though scientists caution that these could have non-biological origins. Dr. Sarah Chen, lead geologist on the mission, stated that the findings are "extraordinarily promising" for future analysis when the samples are returned to Earth.
    
    The sample collection process took three days due to the rock's unusual hardness, requiring multiple drilling attempts. This brings the total sample collection to 20 tubes, with plans to collect 10 more before the sample return mission launches in 2028. The European Space Agency will provide the Earth Return Orbiter, while NASA handles the Sample Return Lander, in a collaborative effort estimated to cost $7 billion.
    """
    
    # Generate different summaries to evaluate
    test_summaries = {
        "Basic": advanced_summarizer.zero_shot_summarize(eval_test_article),
        "Chain-of-Thought": advanced_summarizer.chain_of_thought_summarize(eval_test_article),
        "News Style": advanced_summarizer.structured_summarize(eval_test_article, "news"),
        "Scientific": advanced_summarizer.structured_summarize(eval_test_article, "scientific")
    }
    
    print("📝 GENERATED SUMMARIES FOR EVALUATION:")
    print("=" * 60)
    for style, summary in test_summaries.items():
        print(f"\n{style} Summary:")
        print(summary)
        print(f"Length: {len(summary.split())} words")
    
    return test_summaries, eval_test_article

evaluation_summaries, eval_source = comprehensive_evaluation_demo()

### LLM-as-Judge Evaluation

In [None]:
def run_llm_judge_evaluation(summaries, source_text):
    """Run comprehensive LLM-based evaluation"""
    
    print("\n🏛️ LLM-AS-JUDGE EVALUATION:")
    print("=" * 60)
    
    # Evaluate each summary
    evaluations = {}
    
    for style, summary in summaries.items():
        print(f"\n📊 Evaluating {style} Summary...")
        evaluation = modern_evaluator.llm_as_judge(
            summary, 
            source_text,
            criteria=["accuracy", "completeness", "clarity", "conciseness"]
        )
        evaluations[style] = evaluation
        print(evaluation)
    
    # Comparative evaluation
    print(f"\n🥇 COMPARATIVE RANKING:")
    print("=" * 40)
    comparative_eval = modern_evaluator.comparative_evaluation(
        list(summaries.values()),
        source_text,
        list(summaries.keys())
    )
    print(comparative_eval)
    
    return evaluations, comparative_eval

judge_evaluations, comparative_ranking = run_llm_judge_evaluation(evaluation_summaries, eval_source)

### Factuality Assessment

In [None]:
def run_factuality_checks(summaries, source_text):
    """Check factual accuracy of summaries"""
    
    print("\n🔍 FACTUALITY ASSESSMENT:")
    print("=" * 50)
    
    factuality_results = {}
    
    for style, summary in summaries.items():
        print(f"\n🧪 Checking {style} Summary for factual accuracy...")
        factuality_check = modern_evaluator.factuality_check(summary, source_text)
        factuality_results[style] = factuality_check
        print(factuality_check)
    
    return factuality_results

factuality_assessments = run_factuality_checks(evaluation_summaries, eval_source)

# Part III

# Multimodal Summarization


Multimodal summarization involves generating concise text that captures information from:
- Text documents
- Images
- Tables and charts
- Audio recordings
- Video content

## Approaches to Multimodal Summarization:

1. **Pipeline Approach**: Process each modality separately, then combine
2. **Unified Models**: Use multimodal models (like CLIP or GPT-4) that understand multiple modalities
3. **Extraction + Description**: Extract elements from non-text modalities and describe them in text

## Challenges:

- Aligning information across modalities
- Handling inconsistencies between modalities 
- Determining relative importance of different modalities
- Technical complexity of processing multiple formats

In [None]:
# Implementing a multimodal summarizer for text + image data 

import requests
import base64
import os
import json
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

class MultimodalSummarizer:
    """Multimodal summarizer using Llama-4 Vision capabilities"""
    
    def __init__(self, api_key=None):
        """Initialize the multimodal summarizer with an OPENROUTER API key"""
        # Get API key from environment variable if not provided
        self.api_key = api_key or os.environ.get("OPENROUTER_API_KEY", "")
        if not self.api_key:
            print("Warning: No OPENROUTER API key provided. Please set your OPENROUTER_API_KEY.")
        
        self.api_url = "https://openrouter.ai/api/v1"
        self.model = "meta-llama/llama-4-maverick:free"
        
    def encode_image(self, image_path):
        """Encode an image to base64 for API submission"""
        # Check if it's a URL or local path
        if image_path.startswith(('http://', 'https://')):
            response = requests.get(image_path)
            image = Image.open(BytesIO(response.content))
            buffered = BytesIO()
            image.save(buffered, format="JPEG")
            return base64.b64encode(buffered.getvalue()).decode('utf-8')
        else:
            with open(image_path, "rb") as image_file:
                return base64.b64encode(image_file.read()).decode('utf-8')
    
    def create_payload(self, text, image_paths, max_tokens=500):
        """Create the API payload with text and images"""
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant that creates concise summaries from text and images."
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Please create a comprehensive summary that combines information from the following text and images. Focus on integrating visual information with the text content.\n\nTEXT: {text}"}
                ]
            }
        ]
        
        # Add images to the content
        for img_path in image_paths:
            try:
                base64_image = self.encode_image(img_path)
                messages[1]["content"].append(
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"
                        }
                    }
                )
            except Exception as e:
                print(f"Error processing image {img_path}: {e}")
        
        return {
            "model": self.model,
            "messages": messages,
            "max_tokens": max_tokens
        }
    
    def summarize_multimodal(self, text, image_paths, max_tokens=500):
        """Generate a summary from text and images using GPT-4"""
        if not self.api_key:
            return {"error": "No API key provided. Please set your OPENROUTER_API_KEY."}
        
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}"
        }
        
        payload = self.create_payload(text, image_paths, max_tokens)
        
        try:
            print("Making API call to model...")

            response = requests.post(self.api_url, headers=headers, data=payload)
            result = response.json()
            
            return {
                'text_source': text,
                'image_paths': image_paths,
                'combined_summary': result
            }
            
        except Exception as e:
            return {"error": str(e)}
    
    def display_images(self, image_paths):
        """Display the images used in the multimodal summary"""
        num_images = len(image_paths)
        
        if num_images == 0:
            return
        
        fig, axes = plt.subplots(1, num_images, figsize=(5*num_images, 5))
        
        if num_images == 1:
            axes = [axes]  
            
        for i, img_path in enumerate(image_paths):
            try:
                # Handle both URLs and local paths
                if img_path.startswith(('http://', 'https://')):
                    response = requests.get(img_path)
                    img = Image.open(BytesIO(response.content))
                else:
                    img = Image.open(img_path)
                
                axes[i].imshow(img)
                axes[i].set_title(f"Image {i+1}")
                axes[i].axis('off')
            except Exception as e:
                axes[i].text(0.5, 0.5, f"Error loading image: {e}", 
                             ha='center', va='center', transform=axes[i].transAxes)
        
        plt.tight_layout()
        plt.savefig('multimodal_input.png')
        plt.close()
        
        from IPython.display import Image
        return Image('multimodal_input.png')

In [None]:
image_paths = [
    # "images/image1.jpg",
    # "images/image2.jpg",
    # "images/image3.png"
]

multimodal_summarizer = MultimodalSummarizer()

# Generate a multimodal summary
multimodal_result = multimodal_summarizer.summarize_multimodal(
    #article,
    image_paths,
    max_tokens=300
)

# Practical Exercise: Building Your Custom Summarization System

Now it's your turn to build a complete summarization system by combining techniques we've explored.

## Exercise Goals:
1. Create a pipeline that combines multiple approaches
2. Customize control parameters for your specific needs
3. Evaluate results using advanced metrics
4. Compare performance across different text types

## Project Ideas:
1. **News Summarizer Bot**: Create a system that retrieves and summarizes news articles on specific topics
2. **Meeting Minutes Generator**: Transcribe and summarize meeting audio recordings
3. **Research Paper Summarizer**: Generate summaries of academic papers with focus on methodology and results
4. **Medical Conversation Summarizer**: Summarize doctor-patient conversations, creating dual summaries (technical for doctors, simplified for patients)
5. **EHR Summarizer**: Create a system that generates longitudinal patient summaries from fragmented electronic health records, retrieving and synthesizing information across multiple visits, lab results, and clinical notes


### Code and Libraries:
- [🤗 Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [BART Model Card](https://huggingface.co/facebook/bart-large-cnn)
- [T5 Model Card](https://huggingface.co/t5-base)
- [Text Generation Parameters](https://huggingface.co/blog/mlabonne/decoding-strategies)

### Papers:
- "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension"
- "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
- "Neural Abstractive Text Summarization with Sequence-to-Sequence Models"


## Useful Resources:

### Datasets:
- [CNN/Daily Mail Dataset](https://huggingface.co/datasets/cnn_dailymail)
- [XSum Dataset](https://huggingface.co/datasets/xsum)
- [Multi-News](https://huggingface.co/datasets/multi_news)
- [BBC News Summary Dataset](https://www.kaggle.com/datasets/pariza/bbc-news-summary)

### Evaluation Tools:
- [ROUGE Implementation in Python](https://github.com/google-research/google-research/tree/master/rouge)
- [BERTScore](https://github.com/Tiiiger/bert_score)


In [None]:
import whisper

def transcribe_meeting(audio_file):
    model = whisper.load_model("base")
    result = model.transcribe(audio_file)
    return result["text"]

# **Facilitator(s) Details**

**Facilitator(s):**

*   Name: Nana Sam Yeboah                       
*   Email: nanayeb34@gmail.com
*   LinkedIn: [Nana Sam Yeboah](https://www.linkedin.com/in/nana-sam-yeboah-0b664484)

# 

*   Name: Audrey Eyram Agbeve
*   Email: audreyagbeve02@gmail.com
*   LinkedIn: [Audrey (Eyram) Agbeve](https://www.linkedin.com/in/audreyagbeve02/)

### Please rate this Tutorial

<img src="images/Day1_feedback.png" height=500 width=500  >