# Text Summarization - Traditional NLP
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of extractive summarization techniques, with emphasis on faithfulness, production considerations, and banking document requirements.

**Interview Signal**: This notebook shows you can condense long documents while maintaining accuracy - critical for compliance and executive reporting in banking.

## 1. Business Context (Banking Lens)

### Why Text Summarization Exists in Retail Banking

Banking generates enormous volumes of long-form text that needs to be digested quickly by decision-makers, analysts, and regulators.

| Use Case | Input | Output | Business Value |
|----------|-------|--------|---------------|
| **Earnings Call Digests** | 60-page transcript | 1-page summary | Analyst efficiency |
| **Regulatory Filing Summaries** | 100+ page 10-K | Key risk highlights | Compliance monitoring |
| **Customer Interaction Summaries** | 30-min call transcript | 3-sentence recap | CRM updates |
| **Contract Clause Extraction** | 50-page agreement | Key terms summary | Risk assessment |
| **Complaint Aggregation** | 1000 complaints | Theme summary | Executive dashboards |

### The Business Problem

> "Our risk team reviews 500 regulatory filings per week. How do we help them focus on what matters?"

**Without summarization**: Analysts skim documents, miss critical information  
**With summarization**: Key information surfaced, consistent coverage, audit trail

### Real Banking Example

**Input** (earnings call excerpt, 500 words):
"In Q3, we saw continued strength in our consumer banking segment with deposits growing 12% year-over-year. Net interest income increased by $340 million, driven primarily by higher interest rates. However, provision for credit losses increased to $1.2 billion, reflecting our conservative approach to the uncertain macroeconomic environment. On the commercial side, we saw some softening in loan demand, particularly in commercial real estate where we've maintained disciplined underwriting standards..."

**Summary** (50 words):
"Q3 showed strong consumer banking with 12% deposit growth and $340M NII increase from higher rates. Credit provisions rose to $1.2B due to macro uncertainty. Commercial lending softened, especially in CRE, with disciplined underwriting maintained."

### Interview Framing

```
"Summarization in banking requires a different mindset than news summarization. We can't 
afford hallucinations - if a summary says the provision for credit losses is $1.5B when 
the actual number is $1.2B, that's a compliance failure. That's why I prefer extractive 
methods for factual documents and reserve abstractive approaches for less critical use cases."
```

## 2. Problem Definition

### Task Type: Document Compression

| Aspect | Description |
|--------|-------------|
| **Input** | Long document (100-10,000+ words) |
| **Output** | Shorter summary (10-500 words) |
| **Key Metric** | Compression ratio + information retention |
| **Core Challenge** | What to keep vs. what to drop |

### Extractive vs. Abstractive Summarization

| Approach | Method | Faithfulness | Fluency | Banking Use |
|----------|--------|--------------|---------|-------------|
| **Extractive** | Select existing sentences | High (verbatim) | Lower | Regulatory docs, contracts |
| **Abstractive** | Generate new text | Risk of errors | Higher | Executive briefings |

### Why Extractive Before LLMs

1. **No hallucination risk**: Selected sentences exist in source
2. **Attributable**: Can point to exact source location
3. **Deterministic**: Same input → same output
4. **Fast**: No generation, just selection
5. **Works without training data**: Unsupervised methods available

### Mathematical Formulation

**Sentence Selection Problem**:
Given document $D = \{s_1, s_2, ..., s_n\}$ (sentences), select subset $S \subset D$ that:
- Maximizes coverage: $\text{Coverage}(S, D)$
- Minimizes redundancy: $\text{Redundancy}(S)$
- Satisfies length constraint: $|S| \leq k$ sentences or $\sum_{s \in S} |s| \leq L$ words

## 3. Dataset

### Dataset Selection

For this demo, we'll create sample financial documents. In practice, you'd use:
- **CNN/DailyMail**: News articles with highlights
- **XSum**: Extreme summarization (single sentence)
- **Financial reports**: SEC filings, earnings transcripts

In [None]:
# Install required packages
# !pip install scikit-learn nltk networkx numpy pandas matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# For TextRank
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries loaded successfully")

In [None]:
# Sample banking documents for summarization
sample_documents = [
    {
        "title": "Q3 2024 Earnings Call Transcript (Excerpt)",
        "text": """Good morning and welcome to our third quarter 2024 earnings call. 
        
I'm pleased to report that we delivered strong results this quarter with revenue of $32.4 billion, up 8% year-over-year. Our consumer banking division showed particular strength with deposits growing 12% compared to the same period last year. This growth was driven by successful marketing campaigns and competitive interest rates on savings products.

Net interest income increased by $340 million to reach $14.2 billion, primarily due to higher interest rates. Our net interest margin expanded by 15 basis points to 2.85%. We continue to benefit from the rate environment while managing our funding costs carefully.

However, we did increase our provision for credit losses to $1.2 billion this quarter, up from $800 million in Q2. This reflects our conservative approach to the uncertain macroeconomic environment. We are seeing some early signs of stress in our credit card portfolio, particularly in the subprime segment, but overall credit quality remains strong.

On the commercial banking side, we saw some softening in loan demand, especially in commercial real estate. CRE loans declined by 3% as we maintained our disciplined underwriting standards. We remain cautious about office space lending given the ongoing shift to hybrid work arrangements.

Our wealth management division delivered record revenue of $5.8 billion, driven by strong asset inflows and higher transaction volumes. Assets under management grew to $4.2 trillion, up 15% year-over-year.

Looking ahead, we expect continued pressure on net interest margin as deposit competition intensifies. We are investing heavily in digital capabilities to improve customer experience and reduce operating costs. Our technology spend will increase by 12% next year.

We remain committed to returning capital to shareholders. This quarter, we repurchased $2.5 billion of common stock and paid dividends of $1.1 billion. Our CET1 ratio stands at 12.8%, well above regulatory requirements.

In summary, we delivered solid results despite a challenging environment. Our diversified business model continues to serve us well, and we are well positioned for the quarters ahead.""",
        "type": "earnings_call"
    },
    {
        "title": "Risk Management Policy Update",
        "text": """This document outlines updates to our enterprise risk management framework effective January 1, 2025.

The Board of Directors has approved several enhancements to our risk appetite statement. Our credit risk appetite for commercial real estate has been reduced from 15% to 12% of total loans. This change reflects current market conditions and our assessment of sector-specific risks.

Market risk limits have been updated to reflect increased volatility in interest rate markets. The Value-at-Risk limit for the trading book has been increased from $50 million to $65 million to accommodate larger hedging positions. However, stress testing requirements have been enhanced with three additional scenarios.

Operational risk management procedures have been strengthened following recent industry incidents. All critical systems must now have recovery time objectives of less than four hours, reduced from the previous eight-hour standard. Cyber security investments will increase by 25% in the coming year.

Model risk governance has been enhanced with the creation of a dedicated Model Validation team reporting directly to the Chief Risk Officer. All pricing and credit risk models must be independently validated annually, with challenger models required for Tier 1 applications.

Liquidity risk management now requires maintaining a liquidity coverage ratio of 120%, up from the regulatory minimum of 100%. Additional stress scenarios for deposit outflows have been implemented based on recent regional bank experiences.

Compliance risk procedures have been updated to address new regulatory requirements. Anti-money laundering transaction monitoring thresholds have been lowered, and suspicious activity report filing timelines have been shortened from 30 to 15 days.

The Risk Committee will review these policies quarterly and report to the Board on implementation progress.""",
        "type": "policy"
    }
]

print(f"Loaded {len(sample_documents)} sample documents")

In [None]:
# Analyze documents
for doc in sample_documents:
    sentences = sent_tokenize(doc['text'])
    words = word_tokenize(doc['text'])
    print(f"\nDocument: {doc['title']}")
    print(f"  Sentences: {len(sentences)}")
    print(f"  Words: {len(words)}")
    print(f"  Avg words/sentence: {len(words)/len(sentences):.1f}")

## 4. Traditional NLP Pipeline

### 4.1 Text Preprocessing for Summarization

In [None]:
class SummarizationPreprocessor:
    """
    Preprocessor for extractive summarization.
    
    Key considerations:
    - Sentence boundaries must be preserved (we select sentences)
    - Keep structure for ranking, but normalize for comparison
    - Handle financial text patterns (numbers, percentages)
    """
    
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
    
    def split_sentences(self, text):
        """Split text into sentences while handling edge cases."""
        # Clean up whitespace
        text = re.sub(r'\s+', ' ', text.strip())
        
        # Use NLTK sentence tokenizer
        sentences = sent_tokenize(text)
        
        # Filter very short sentences
        sentences = [s.strip() for s in sentences if len(s.split()) >= 5]
        
        return sentences
    
    def normalize_sentence(self, sentence):
        """Normalize sentence for comparison (lowercase, remove stopwords)."""
        words = word_tokenize(sentence.lower())
        words = [w for w in words if w.isalnum() and w not in self.stop_words]
        return ' '.join(words)
    
    def preprocess_document(self, text):
        """Preprocess document for summarization."""
        sentences = self.split_sentences(text)
        normalized = [self.normalize_sentence(s) for s in sentences]
        
        return {
            'original_sentences': sentences,
            'normalized_sentences': normalized,
            'num_sentences': len(sentences)
        }

preprocessor = SummarizationPreprocessor()

# Demo
doc = sample_documents[0]
processed = preprocessor.preprocess_document(doc['text'])

print(f"PREPROCESSING DEMO")
print(f"="*60)
print(f"Original sentences: {processed['num_sentences']}")
print(f"\nFirst sentence (original):")
print(f"  {processed['original_sentences'][0]}")
print(f"\nFirst sentence (normalized):")
print(f"  {processed['normalized_sentences'][0]}")

### 4.2 Extractive Summarization Methods

#### Method 1: TF-IDF Based Scoring

**Intuition**: Important sentences contain important words (high TF-IDF)

In [None]:
class TFIDFSummarizer:
    """
    TF-IDF based extractive summarizer.
    
    Sentence importance = sum of TF-IDF scores of words in sentence
    """
    
    def __init__(self):
        self.vectorizer = TfidfVectorizer(stop_words='english')
        self.preprocessor = SummarizationPreprocessor()
    
    def summarize(self, text, num_sentences=3):
        """Extract top sentences based on TF-IDF scores."""
        processed = self.preprocessor.preprocess_document(text)
        original_sentences = processed['original_sentences']
        
        if len(original_sentences) <= num_sentences:
            return ' '.join(original_sentences)
        
        # Calculate TF-IDF matrix
        tfidf_matrix = self.vectorizer.fit_transform(original_sentences)
        
        # Score each sentence by sum of TF-IDF values
        sentence_scores = np.asarray(tfidf_matrix.sum(axis=1)).flatten()
        
        # Get top sentence indices (maintain original order)
        top_indices = sentence_scores.argsort()[-num_sentences:][::-1]
        top_indices = sorted(top_indices)  # Maintain document order
        
        # Extract sentences
        summary_sentences = [original_sentences[i] for i in top_indices]
        
        return {
            'summary': ' '.join(summary_sentences),
            'selected_indices': top_indices,
            'scores': [(i, sentence_scores[i]) for i in top_indices]
        }

# Test TF-IDF summarizer
tfidf_summarizer = TFIDFSummarizer()

result = tfidf_summarizer.summarize(sample_documents[0]['text'], num_sentences=4)

print("TF-IDF SUMMARIZATION")
print("="*60)
print(f"\nSummary ({len(result['summary'].split())} words):")
print(result['summary'])

#### Method 2: TextRank (Graph-Based)

**Intuition**: Important sentences are similar to many other important sentences (like PageRank for web pages)

In [None]:
class TextRankSummarizer:
    """
    TextRank algorithm for extractive summarization.
    
    Based on the PageRank algorithm:
    1. Build similarity graph between sentences
    2. Run PageRank to find important sentences
    3. Select top-ranked sentences
    """
    
    def __init__(self, similarity_threshold=0.1):
        self.similarity_threshold = similarity_threshold
        self.vectorizer = TfidfVectorizer(stop_words='english')
        self.preprocessor = SummarizationPreprocessor()
    
    def build_similarity_matrix(self, sentences):
        """Build sentence similarity matrix using cosine similarity."""
        tfidf_matrix = self.vectorizer.fit_transform(sentences)
        similarity_matrix = cosine_similarity(tfidf_matrix)
        
        # Apply threshold to create sparse graph
        similarity_matrix[similarity_matrix < self.similarity_threshold] = 0
        
        # Remove self-loops
        np.fill_diagonal(similarity_matrix, 0)
        
        return similarity_matrix
    
    def summarize(self, text, num_sentences=3):
        """Extract top sentences using TextRank."""
        processed = self.preprocessor.preprocess_document(text)
        original_sentences = processed['original_sentences']
        
        if len(original_sentences) <= num_sentences:
            return ' '.join(original_sentences)
        
        # Build similarity matrix
        similarity_matrix = self.build_similarity_matrix(original_sentences)
        
        # Create graph and run PageRank
        graph = nx.from_numpy_array(similarity_matrix)
        scores = nx.pagerank(graph, max_iter=100)
        
        # Convert to array
        sentence_scores = np.array([scores.get(i, 0) for i in range(len(original_sentences))])
        
        # Get top sentence indices (maintain original order)
        top_indices = sentence_scores.argsort()[-num_sentences:][::-1]
        top_indices = sorted(top_indices)
        
        # Extract sentences
        summary_sentences = [original_sentences[i] for i in top_indices]
        
        return {
            'summary': ' '.join(summary_sentences),
            'selected_indices': top_indices,
            'scores': [(i, sentence_scores[i]) for i in top_indices]
        }

# Test TextRank summarizer
textrank_summarizer = TextRankSummarizer(similarity_threshold=0.1)

result = textrank_summarizer.summarize(sample_documents[0]['text'], num_sentences=4)

print("TEXTRANK SUMMARIZATION")
print("="*60)
print(f"\nSummary ({len(result['summary'].split())} words):")
print(result['summary'])

#### Method 3: Lead Bias (Position-Based)

**Intuition**: Important information often appears at the beginning (news articles, executive summaries)

In [None]:
class LeadSummarizer:
    """
    Lead-based summarizer with position decay.
    
    Combines content importance with position bias:
    Score = content_score * position_weight
    """
    
    def __init__(self, position_decay=0.9):
        self.position_decay = position_decay
        self.vectorizer = TfidfVectorizer(stop_words='english')
        self.preprocessor = SummarizationPreprocessor()
    
    def summarize(self, text, num_sentences=3):
        """Extract sentences with position bias."""
        processed = self.preprocessor.preprocess_document(text)
        original_sentences = processed['original_sentences']
        
        if len(original_sentences) <= num_sentences:
            return ' '.join(original_sentences)
        
        # Calculate content scores (TF-IDF)
        tfidf_matrix = self.vectorizer.fit_transform(original_sentences)
        content_scores = np.asarray(tfidf_matrix.sum(axis=1)).flatten()
        content_scores = content_scores / content_scores.max()  # Normalize
        
        # Calculate position weights (exponential decay)
        position_weights = np.array([
            self.position_decay ** i 
            for i in range(len(original_sentences))
        ])
        
        # Combined scores
        combined_scores = content_scores * position_weights
        
        # Get top sentence indices
        top_indices = combined_scores.argsort()[-num_sentences:][::-1]
        top_indices = sorted(top_indices)
        
        summary_sentences = [original_sentences[i] for i in top_indices]
        
        return {
            'summary': ' '.join(summary_sentences),
            'selected_indices': top_indices,
            'scores': [(i, combined_scores[i]) for i in top_indices]
        }

# Test Lead summarizer
lead_summarizer = LeadSummarizer(position_decay=0.95)

result = lead_summarizer.summarize(sample_documents[0]['text'], num_sentences=4)

print("LEAD-BIASED SUMMARIZATION")
print("="*60)
print(f"\nSummary ({len(result['summary'].split())} words):")
print(result['summary'])

#### Method 4: MMR (Maximal Marginal Relevance)

**Intuition**: Balance relevance with diversity - don't select redundant sentences

In [None]:
class MMRSummarizer:
    """
    Maximal Marginal Relevance summarizer.
    
    MMR = λ * Sim(s, query) - (1-λ) * max(Sim(s, s_selected))
    
    Balances:
    - Relevance to document (coverage)
    - Novelty compared to already selected sentences (diversity)
    """
    
    def __init__(self, lambda_param=0.7):
        self.lambda_param = lambda_param
        self.vectorizer = TfidfVectorizer(stop_words='english')
        self.preprocessor = SummarizationPreprocessor()
    
    def summarize(self, text, num_sentences=3):
        """Extract sentences using MMR."""
        processed = self.preprocessor.preprocess_document(text)
        original_sentences = processed['original_sentences']
        
        if len(original_sentences) <= num_sentences:
            return ' '.join(original_sentences)
        
        # Calculate TF-IDF representations
        tfidf_matrix = self.vectorizer.fit_transform(original_sentences)
        
        # Document centroid (query proxy)
        doc_centroid = tfidf_matrix.mean(axis=0)
        
        # Similarity to document
        doc_similarity = cosine_similarity(tfidf_matrix, doc_centroid).flatten()
        
        # Pairwise sentence similarity
        sentence_similarity = cosine_similarity(tfidf_matrix)
        
        # Greedy MMR selection
        selected_indices = []
        remaining_indices = list(range(len(original_sentences)))
        
        for _ in range(num_sentences):
            best_idx = None
            best_score = float('-inf')
            
            for idx in remaining_indices:
                # Relevance to document
                relevance = doc_similarity[idx]
                
                # Maximum similarity to already selected
                if selected_indices:
                    redundancy = max(sentence_similarity[idx][j] for j in selected_indices)
                else:
                    redundancy = 0
                
                # MMR score
                mmr_score = self.lambda_param * relevance - (1 - self.lambda_param) * redundancy
                
                if mmr_score > best_score:
                    best_score = mmr_score
                    best_idx = idx
            
            if best_idx is not None:
                selected_indices.append(best_idx)
                remaining_indices.remove(best_idx)
        
        # Sort to maintain document order
        selected_indices = sorted(selected_indices)
        summary_sentences = [original_sentences[i] for i in selected_indices]
        
        return {
            'summary': ' '.join(summary_sentences),
            'selected_indices': selected_indices
        }

# Test MMR summarizer
mmr_summarizer = MMRSummarizer(lambda_param=0.7)

result = mmr_summarizer.summarize(sample_documents[0]['text'], num_sentences=4)

print("MMR SUMMARIZATION")
print("="*60)
print(f"\nSummary ({len(result['summary'].split())} words):")
print(result['summary'])

## 5. Model Training & Inference

In [None]:
# Compare all methods on the same document
def compare_summarizers(text, num_sentences=4):
    """Compare all summarization methods."""
    
    summarizers = {
        'TF-IDF': TFIDFSummarizer(),
        'TextRank': TextRankSummarizer(similarity_threshold=0.1),
        'Lead-Biased': LeadSummarizer(position_decay=0.95),
        'MMR': MMRSummarizer(lambda_param=0.7)
    }
    
    results = {}
    for name, summarizer in summarizers.items():
        result = summarizer.summarize(text, num_sentences)
        results[name] = {
            'summary': result['summary'],
            'word_count': len(result['summary'].split()),
            'selected_indices': result['selected_indices']
        }
    
    return results

# Compare on earnings call
doc = sample_documents[0]
original_word_count = len(doc['text'].split())

print(f"SUMMARIZATION COMPARISON")
print(f"="*70)
print(f"Document: {doc['title']}")
print(f"Original length: {original_word_count} words")
print(f"Target: 4 sentences\n")

results = compare_summarizers(doc['text'], num_sentences=4)

for name, result in results.items():
    compression = 100 * (1 - result['word_count'] / original_word_count)
    print(f"\n--- {name} ---")
    print(f"Sentences selected: {result['selected_indices']}")
    print(f"Words: {result['word_count']} (compression: {compression:.1f}%)")
    print(f"Summary: {result['summary'][:200]}...")

In [None]:
# Production summarization pipeline
class ProductionSummarizer:
    """
    Production-ready summarizer with multiple strategy options.
    """
    
    def __init__(self, default_method='mmr'):
        self.methods = {
            'tfidf': TFIDFSummarizer(),
            'textrank': TextRankSummarizer(),
            'lead': LeadSummarizer(),
            'mmr': MMRSummarizer()
        }
        self.default_method = default_method
        self.preprocessor = SummarizationPreprocessor()
    
    def summarize(self, text, target_words=None, target_sentences=None, 
                  method=None, min_sentence_length=10):
        """
        Generate summary with specified constraints.
        
        Args:
            text: Input document
            target_words: Approximate word count target
            target_sentences: Number of sentences to select
            method: Summarization method to use
            min_sentence_length: Minimum words per sentence
        """
        method = method or self.default_method
        
        # Validate input
        if not text or len(text.strip()) < 50:
            return {
                'status': 'error',
                'message': 'Input text too short'
            }
        
        # Preprocess to get sentence count
        processed = self.preprocessor.preprocess_document(text)
        total_sentences = processed['num_sentences']
        total_words = len(text.split())
        
        # Determine number of sentences to extract
        if target_words:
            # Estimate based on average sentence length
            avg_sentence_length = total_words / total_sentences
            num_sentences = max(1, int(target_words / avg_sentence_length))
        elif target_sentences:
            num_sentences = target_sentences
        else:
            # Default: ~30% compression
            num_sentences = max(1, int(total_sentences * 0.3))
        
        # Cap at reasonable limits
        num_sentences = min(num_sentences, total_sentences - 1)
        num_sentences = max(1, num_sentences)
        
        # Generate summary
        summarizer = self.methods[method]
        result = summarizer.summarize(text, num_sentences)
        
        # Calculate metrics
        summary_words = len(result['summary'].split())
        compression_ratio = 1 - (summary_words / total_words)
        
        return {
            'status': 'success',
            'summary': result['summary'],
            'method': method,
            'metrics': {
                'original_words': total_words,
                'summary_words': summary_words,
                'compression_ratio': compression_ratio,
                'sentences_selected': len(result['selected_indices']),
                'sentences_total': total_sentences
            },
            'source_indices': result['selected_indices']
        }

# Test production summarizer
prod_summarizer = ProductionSummarizer(default_method='mmr')

result = prod_summarizer.summarize(
    sample_documents[0]['text'],
    target_words=100,
    method='mmr'
)

print("PRODUCTION SUMMARIZER OUTPUT")
print("="*60)
print(f"Status: {result['status']}")
print(f"Method: {result['method']}")
print(f"\nMetrics:")
for key, value in result['metrics'].items():
    if isinstance(value, float):
        print(f"  {key}: {value:.2%}")
    else:
        print(f"  {key}: {value}")
print(f"\nSummary:\n{result['summary']}")

## 6. Evaluation Strategy

### Why Standard Metrics Are Tricky for Summarization

Unlike classification, there's no single "correct" summary. Multiple valid summaries exist.

### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

| Metric | Measures | Formula |
|--------|----------|---------|
| **ROUGE-1** | Unigram overlap | overlap / reference_length |
| **ROUGE-2** | Bigram overlap | Better for fluency |
| **ROUGE-L** | Longest common subsequence | Sentence-level structure |

In [None]:
def calculate_rouge_scores(summary, reference):
    """
    Calculate ROUGE scores (simplified implementation).
    
    In production, use the 'rouge-score' library.
    """
    def get_ngrams(text, n):
        words = word_tokenize(text.lower())
        return set(tuple(words[i:i+n]) for i in range(len(words)-n+1))
    
    def lcs_length(s1, s2):
        """Longest common subsequence."""
        words1 = word_tokenize(s1.lower())
        words2 = word_tokenize(s2.lower())
        m, n = len(words1), len(words2)
        dp = [[0] * (n + 1) for _ in range(m + 1)]
        
        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if words1[i-1] == words2[j-1]:
                    dp[i][j] = dp[i-1][j-1] + 1
                else:
                    dp[i][j] = max(dp[i-1][j], dp[i][j-1])
        
        return dp[m][n]
    
    # ROUGE-1
    summary_unigrams = get_ngrams(summary, 1)
    reference_unigrams = get_ngrams(reference, 1)
    overlap_1 = len(summary_unigrams & reference_unigrams)
    rouge1_precision = overlap_1 / max(1, len(summary_unigrams))
    rouge1_recall = overlap_1 / max(1, len(reference_unigrams))
    rouge1_f1 = 2 * rouge1_precision * rouge1_recall / max(0.001, rouge1_precision + rouge1_recall)
    
    # ROUGE-2
    summary_bigrams = get_ngrams(summary, 2)
    reference_bigrams = get_ngrams(reference, 2)
    overlap_2 = len(summary_bigrams & reference_bigrams)
    rouge2_precision = overlap_2 / max(1, len(summary_bigrams))
    rouge2_recall = overlap_2 / max(1, len(reference_bigrams))
    rouge2_f1 = 2 * rouge2_precision * rouge2_recall / max(0.001, rouge2_precision + rouge2_recall)
    
    # ROUGE-L
    lcs = lcs_length(summary, reference)
    summary_words = len(word_tokenize(summary))
    reference_words = len(word_tokenize(reference))
    rougel_precision = lcs / max(1, summary_words)
    rougel_recall = lcs / max(1, reference_words)
    rougel_f1 = 2 * rougel_precision * rougel_recall / max(0.001, rougel_precision + rougel_recall)
    
    return {
        'rouge1': {'precision': rouge1_precision, 'recall': rouge1_recall, 'f1': rouge1_f1},
        'rouge2': {'precision': rouge2_precision, 'recall': rouge2_recall, 'f1': rouge2_f1},
        'rougeL': {'precision': rougel_precision, 'recall': rougel_recall, 'f1': rougel_f1}
    }

# Example evaluation
reference_summary = """Q3 showed strong revenue growth of 8% to $32.4 billion. Consumer banking 
deposits grew 12% and net interest income increased $340 million. Credit provisions rose to 
$1.2 billion due to economic uncertainty. Commercial real estate lending softened with 
disciplined underwriting maintained."""

generated_summary = result['summary']

scores = calculate_rouge_scores(generated_summary, reference_summary)

print("ROUGE EVALUATION")
print("="*40)
for metric, values in scores.items():
    print(f"\n{metric.upper()}:")
    print(f"  Precision: {values['precision']:.4f}")
    print(f"  Recall: {values['recall']:.4f}")
    print(f"  F1: {values['f1']:.4f}")

### Banking-Specific Evaluation Criteria

| Criterion | Description | How to Measure |
|-----------|-------------|----------------|
| **Factual Accuracy** | Numbers, names, dates correct | Manual review / NER comparison |
| **Key Information** | Critical facts included | Checklist coverage |
| **No Hallucination** | Everything in summary exists in source | Attribution check |
| **Coherence** | Summary reads naturally | Human rating |
| **Compression** | Significant length reduction | Word count ratio |

## 7. Production Readiness Checklist

```
INPUT HANDLING
[ ] Document length limits (min/max)
[ ] Character encoding normalization
[ ] Language detection
[ ] Multi-document support (if needed)

SUMMARIZATION QUALITY
[ ] Factual consistency checking
[ ] Key entity preservation (names, numbers)
[ ] Attribution to source sentences
[ ] Readability scoring

OUTPUT REQUIREMENTS
[ ] Configurable length (words/sentences)
[ ] Sentence boundary preservation
[ ] Source attribution (which sentences selected)
[ ] Confidence/quality score

BANKING-SPECIFIC
[ ] Numerical accuracy verification
[ ] Key metric extraction (revenue, profit, etc.)
[ ] Sentiment preservation
[ ] Regulatory term handling

MONITORING
[ ] Compression ratio tracking
[ ] Processing time monitoring
[ ] Error rate tracking
[ ] Human feedback loop

GOVERNANCE
[ ] Audit trail (input → output mapping)
[ ] Version control of algorithm
[ ] Human review for critical documents
[ ] Disclosure that summary is machine-generated
```

## 8. Modern LLM-Based Approach

### Abstractive Summarization with LLMs

**Option 1: Fine-tuned T5/BART**
```python
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(text, max_length=150, min_length=50)
```

**Option 2: GPT/Claude Prompting**
```python
prompt = f"""
Summarize the following earnings call transcript in 3-4 sentences.
Focus on: revenue, profit metrics, key business highlights, and forward guidance.
Use exact numbers from the text.

Transcript:
{text}

Summary:
"""
```

**Option 3: Hybrid (Extract then Abstract)**
```python
# Step 1: Extract key sentences (traditional)
key_sentences = extractive_summarize(text, n=10)

# Step 2: Rewrite for fluency (LLM)
prompt = f"Rewrite these key points as a coherent paragraph:\n{key_sentences}"
```

In [None]:
# Example prompt for LLM summarization
def create_summarization_prompt(text, doc_type='earnings_call', max_words=100):
    """
    Create prompt for LLM-based summarization.
    
    Banking considerations:
    - Emphasize factual accuracy
    - Request specific metrics
    - Avoid speculation/inference
    """
    
    type_instructions = {
        'earnings_call': """Focus on:
- Revenue and profit figures (use exact numbers)
- Key business segment performance
- Forward guidance and outlook
- Material risks or concerns mentioned""",
        'regulatory': """Focus on:
- New requirements or changes
- Compliance deadlines
- Impact on operations
- Required actions""",
        'contract': """Focus on:
- Parties involved
- Key terms and conditions
- Financial obligations
- Important dates and deadlines"""
    }
    
    instructions = type_instructions.get(doc_type, "Identify the key points.")
    
    prompt = f"""You are summarizing a banking document. Accuracy is critical.

Document Type: {doc_type}

{instructions}

Rules:
1. Use EXACT numbers from the document - do not round or approximate
2. Do not add information not in the document
3. Do not speculate or make inferences
4. Keep summary under {max_words} words
5. Maintain neutral, factual tone

Document:
---
{text[:3000]}  # Truncate for context window
---

Summary:"""
    
    return prompt

# Example
prompt = create_summarization_prompt(
    sample_documents[0]['text'],
    doc_type='earnings_call',
    max_words=100
)

print("LLM SUMMARIZATION PROMPT")
print("="*60)
print(prompt[:1500] + "...")

## 9. Traditional vs LLM Decision Matrix

| Dimension | Extractive (TextRank/MMR) | Abstractive (T5/BART) | LLM (GPT-4/Claude) |
|-----------|--------------------------|----------------------|--------------------|
| **Faithfulness** | 100% (verbatim) | 90-95% | 85-95% |
| **Fluency** | Lower (sentence fragments) | High | Very high |
| **Hallucination Risk** | None | Low | Medium |
| **Latency** | <100ms | 500ms-2s | 1-5s |
| **Cost** | ~$0 | $0.001-0.01 | $0.01-0.05 |
| **Long docs** | Excellent | Limited (context) | Limited (context) |
| **Training data** | None needed | Required | Few examples |

### When to Use Each Approach

**Use Extractive**:
- Legal/regulatory documents where verbatim accuracy required
- Audit trail needed (which exact sentences were selected)
- Very long documents (hundreds of pages)
- High volume, low latency requirements

**Use Abstractive (fine-tuned)**:
- Customer-facing summaries needing polish
- Consistent output format needed
- Moderate accuracy requirements

**Use LLM**:
- Executive briefings where fluency matters
- Low volume, high quality requirements
- Complex synthesis across document sections
- **Always with human review for banking**

## 10. Interview Soundbites

### Ready-to-Say Statements

**On Extractive vs Abstractive:**
> "For regulatory documents and financial statements, I default to extractive summarization. When a summary says 'revenue grew 12%' I need to be able to point to the exact sentence that says that. Abstractive models can subtly change numbers or merge facts incorrectly - that's a compliance risk."

**On TextRank:**
> "TextRank is elegant because it treats sentences like web pages. Important sentences are connected to many other important sentences - it's PageRank for documents. No training data needed, works across domains, and gives surprisingly good results."

**On MMR:**
> "I prefer MMR over pure relevance ranking because it explicitly penalizes redundancy. In a 100-sentence document, the top 5 by relevance might all say the same thing differently. MMR's diversity term ensures we cover more ground."

**On Evaluation:**
> "ROUGE scores are useful but limited. A summary could have perfect ROUGE but miss the most important fact. For banking, I complement ROUGE with key fact checklists - did we capture the revenue number, the provision change, the forward guidance? That's what matters."

**On Hallucination:**
> "Hallucination in summarization is a dealbreaker for banking. If an LLM summarizes '12% growth' as '15% growth' that's not a rounding error - that's material misrepresentation. For high-stakes documents, I use extractive as a safety layer even if the output is less polished."

**On Long Documents:**
> "LLMs have context limits. For a 100-page 10-K filing, I use a two-stage approach: extractive to select key passages within context limits, then optionally LLM to synthesize. This combines the coverage of extractive with the fluency of generative."

**On Production:**
> "Every machine-generated summary in banking needs attribution. I include the source sentence indices so reviewers can verify. And we're transparent with users that it's AI-generated - no pretending summaries are human-written."

---

### Common Interview Questions

**Q: How do you handle multi-document summarization?**
> Aggregate sentences from all documents, run similarity-based de-duplication, then apply standard extractive methods. For MMR, the redundancy penalty naturally handles cross-document overlap.

**Q: What if the document has multiple distinct topics?**
> First segment by topic (using topic modeling), summarize each segment, then combine. Or use query-focused summarization where each topic becomes a query and we extract relevant sentences per topic.

**Q: How do you ensure numbers are preserved correctly?**
> Extract numerical entities first, track their source sentences, and ensure at least one source sentence for each key number is selected. Post-process to verify number presence in final summary.

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Text Summarization                                        ║
║  Approach: Traditional NLP (Extractive Methods)                  ║
║  Banking Use: Document digests, compliance summaries             ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. Extractive = no hallucination (verbatim sentences)           ║
║  2. TextRank (graph-based) works without training                ║
║  3. MMR balances relevance and diversity                         ║
║  4. ROUGE + fact checklist for evaluation                        ║
║  5. Attribution required for banking compliance                  ║
╚══════════════════════════════════════════════════════════════════╝
""")