# Text Summarization using LLM - Complete Demonstration

**Assignment**: News Article Summarization with Production-Ready Models

**Author**: Your Name  
**Date**: January 2026

---

## üìã Overview

This notebook demonstrates:
1. ‚úÖ Model Selection (BART-CNN for production)
2. ‚úÖ Data Preprocessing (XSum dataset)
3. ‚úÖ Pipeline Implementation
4. ‚úÖ Model Comparison (BART vs PEGASUS vs LED)
5. ‚úÖ Performance Evaluation

## 1. Setup & Installation

In [None]:
# Install required packages (run once)
!pip install -q transformers datasets torch pandas accelerate sentencepiece

In [None]:
# Import libraries
import torch
from transformers import pipeline
from datasets import load_dataset
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Setup complete!")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

## 2. Model Selection & Justification

### Primary Model: facebook/bart-large-cnn

**Why BART-CNN? (Production-Focused)**

| Criterion | BART-CNN | PEGASUS-XSum | LED-16K |
|-----------|----------|--------------|----------|
| **Training** | CNN/DailyMail | XSum (BBC) | Multi-domain |
| **Output** | 3-4 sentences | 1 sentence | 3-4 sentences |
| **Industry Use** | ‚úÖ Very High | ‚ö†Ô∏è Niche | ‚úÖ Specialized |
| **Business Value** | ‚úÖ High | ‚ùå Low | ‚úÖ Medium |
| **Speed** | Fast (1-2s) | Fast (1-2s) | Slower (3-5s) |
| **Production Ready** | ‚úÖ Yes | ‚ö†Ô∏è Limited | ‚úÖ Yes (long docs) |

**Decision**: BART-CNN is the industry standard for news summarization.

In [None]:
# Model configurations
MODELS = {
    "BART-CNN": {
        "name": "facebook/bart-large-cnn",
        "max_len": 142,
        "min_len": 56,
        "description": "Production standard - informative 3-4 sentence summaries"
    },
    "PEGASUS-XSum": {
        "name": "google/pegasus-xsum",
        "max_len": 64,
        "min_len": 10,
        "description": "Academic - single sentence extreme summarization"
    },
    "LED-16K": {
        "name": "allenai/led-base-16384",
        "max_len": 142,
        "min_len": 56,
        "description": "Specialized - handles very long documents"
    }
}

print("üìä Available Models:")
for name, config in MODELS.items():
    print(f"\n{name}:")
    print(f"  Model: {config['name']}")
    print(f"  Description: {config['description']}")

## 3. Data Loading & Preprocessing

### Library Choices and Justification:

**Why `datasets` (HuggingFace) over alternatives?**
- **Native Integration**: Seamlessly works with HuggingFace models and tokenizers
- **Efficient Loading**: Lazy loading and caching reduce memory footprint
- **Built-in Support**: Direct access to XSum dataset without manual download/processing
- **Alternative Considered**: Manual CSV/JSON loading - rejected due to complexity and lack of optimization

**Why `pandas` for data manipulation?**
- **Structured Data**: Excellent for tabular data operations and analysis
- **DataFrame Operations**: Easy filtering, selection, and transformation
- **Integration**: Works well with datasets library output
- **Alternative Considered**: NumPy arrays - rejected as they lack structured data capabilities

**Why `torch` (PyTorch)?**
- **Model Backend**: Required by HuggingFace transformers library
- **GPU Acceleration**: Automatic CUDA support for faster inference
- **Tensor Operations**: Efficient numerical computations
- **Alternative Considered**: TensorFlow - rejected as HuggingFace defaults to PyTorch for these models

In [None]:
# Load XSum dataset
print("üì• Loading XSum dataset...")
dataset = load_dataset("xsum", split="test")
samples = dataset.select(range(50))

print(f"‚úÖ Loaded {len(samples)} samples")
print(f"\nDataset structure: {samples.column_names}")
print(f"First example keys: {samples[0].keys()}")

In [None]:
# Inspect sample data
print("üìä Sample Data Inspection:\n")
print("=" * 80)

for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Article (first 200 chars): {samples[i]['document'][:200]}...")
    print(f"XSum Reference: {samples[i]['summary']}")
    print("-" * 80)

## 4. Pipeline Implementation

### Pipeline Abstraction:

The HuggingFace `pipeline()` function abstracts away the following complexities:

```
Input Text ‚Üí Tokenization ‚Üí Model Inference ‚Üí Decoding ‚Üí Summary Output
```

**What complexities are abstracted?**

1. **Tokenization**: 
   - Manual handling of tokenizers, special tokens, padding, truncation
   - Without pipeline: Need to manually call `tokenizer(text, return_tensors="pt", padding=True, truncation=True)`
   - With pipeline: Automatically handles all tokenization steps

2. **Model Loading**:
   - Model weights, configuration, and tokenizer initialization
   - Without pipeline: Need to load model, tokenizer, and config separately
   - With pipeline: Single function call handles everything

3. **Device Management**:
   - CPU/GPU device placement and tensor movement
   - Without pipeline: Manual `.to(device)` calls and device management
   - With pipeline: Automatic device detection and optimization

4. **Decoding**:
   - Converting token IDs back to text, handling special tokens
   - Without pipeline: Manual `tokenizer.decode()` with cleanup
   - With pipeline: Clean text output automatically

5. **Batch Processing**:
   - Efficient batching, attention masks, padding
   - Without pipeline: Complex batching logic required
   - With pipeline: Simple list input, automatic batching

**Why use pipeline over manual implementation?**
- **Simplicity**: Reduces code from ~50 lines to 1 line
- **Reliability**: Battle-tested implementation with error handling
- **Flexibility**: Easy to switch between models without code changes
- **Maintenance**: Updates to HuggingFace automatically improve our code

In [None]:
# Initialize BART-CNN (Primary Model)
print("üöÄ Loading BART-CNN model...")
print("‚ö†Ô∏è  First run downloads ~2GB - subsequent runs are fast\n")

device = 0 if torch.cuda.is_available() else -1

bart_summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=device
)

print("‚úÖ BART-CNN ready!")
print(f"   Device: {'GPU' if device == 0 else 'CPU'}")

## 5. Single Text Summarization Demo

In [None]:
test_text = samples[0]['document']
xsum_ref = samples[0]['summary']

print("üìÑ Input Article (first 400 chars):")
print("=" * 80)
print(test_text[:400] + "...\n")

# Generate summary
print("‚è≥ Generating summary...")
start = time.time()

result = bart_summarizer(
    test_text,
    max_length=142,
    min_length=56,
    do_sample=False
)

elapsed = time.time() - start
summary = result[0]['summary_text']

print("\n" + "=" * 80)
print("üìù BART-CNN Summary (3-4 sentences):")
print("=" * 80)
print(summary)

print("\n" + "=" * 80)
print("üéØ XSum Reference (1 sentence):")
print("=" * 80)
print(xsum_ref)

print("\n" + "=" * 80)
print("üìä Metrics:")
print("=" * 80)
print(f"Input length: {len(test_text.split())} words")
print(f"BART summary: {len(summary.split())} words")
print(f"XSum reference: {len(xsum_ref.split())} words")
print(f"Inference time: {elapsed:.2f} seconds")
print(f"Compression ratio: {len(summary.split())/len(test_text.split())*100:.1f}%")

## 6. Model Comparison (BART vs PEGASUS vs LED)

This demonstrates understanding of different model characteristics.

In [None]:
# Load all models for comparison
print("üîÑ Loading models for comparison...\n")

models_loaded = {}

for name, config in MODELS.items():
    print(f"Loading {name}...")
    models_loaded[name] = pipeline(
        "summarization",
        model=config['name'],
        device=device
    )
    print(f"  ‚úÖ {name} ready")

print("\n‚úÖ All models loaded!")

In [None]:
# Compare on same text
comparison_results = []

print("‚öñÔ∏è  Comparing Models on Same Article\n")
print("=" * 80)

for model_name, model in models_loaded.items():
    config = MODELS[model_name]
    
    print(f"\n{model_name}:")
    print("-" * 80)
    
    start = time.time()
    result = model(
        test_text,
        max_length=config['max_len'],
        min_length=config['min_len'],
        do_sample=False
    )
    elapsed = time.time() - start
    
    summary = result[0]['summary_text']
    
    print(f"Summary: {summary}")
    print(f"Length: {len(summary.split())} words")
    print(f"Time: {elapsed:.2f}s")
    
    comparison_results.append({
        'Model': model_name,
        'Summary': summary,
        'Words': len(summary.split()),
        'Time (s)': f"{elapsed:.2f}",
        'Description': config['description']
    })

print("\n" + "=" * 80)

In [None]:
comparison_df = pd.DataFrame(comparison_results)

print("\nüìä Model Comparison Table:")
print("=" * 80)
print(comparison_df.to_string(index=False))

print("\nüí° Key Insights:")
print("=" * 80)
print("‚úÖ BART-CNN: Longest, most informative (production choice)")
print("‚ö° PEGASUS-XSum: Shortest, single sentence (headline style)")
print("üìö LED-16K: Similar to BART, better for very long docs")

## 7. Batch Processing

Demonstrates efficient processing of multiple documents.

In [None]:
print("üì¶ Batch Processing Demo\n")

batch_docs = [samples[i]['document'] for i in range(10)]
batch_refs = [samples[i]['summary'] for i in range(10)]

print(f"Processing {len(batch_docs)} documents...\n")

start = time.time()
batch_results = bart_summarizer(
    batch_docs,
    max_length=142,
    min_length=56,
    do_sample=False,
    batch_size=4  # Process 4 at a time
)
total_time = time.time() - start

batch_summaries = [r['summary_text'] for r in batch_results]

print("‚úÖ Batch processing complete!\n")
print(f"Total time: {total_time:.2f}s")
print(f"Avg per document: {total_time/len(batch_docs):.2f}s")
print(f"Throughput: {len(batch_docs)/total_time:.2f} docs/second")

In [None]:
batch_df = pd.DataFrame({
    'Article': [d[:100] + '...' for d in batch_docs[:5]],
    'BART Summary': [s[:100] + '...' for s in batch_summaries[:5]],
    'XSum Ref': batch_refs[:5],
    'BART Words': [len(s.split()) for s in batch_summaries[:5]],
    'Ref Words': [len(r.split()) for r in batch_refs[:5]]
})

print("\nüìã Sample Results (first 5):")
print("=" * 80)
print(batch_df.to_string(index=False))

## 8. Performance Analysis

In [None]:
from statistics import mean

metrics = {
    'Average BART Length': mean([len(s.split()) for s in batch_summaries]),
    'Average XSum Length': mean([len(r.split()) for r in batch_refs]),
    'Compression Ratio': mean([
        len(batch_summaries[i].split()) / len(batch_docs[i].split()) * 100
        for i in range(len(batch_docs))
    ])
}

print("\nüìä Performance Metrics:")
print("=" * 80)
for metric, value in metrics.items():
    print(f"{metric}: {value:.2f}")

print("\nüí° Analysis:")
print("=" * 80)
print(f"‚úÖ BART produces {metrics['Average BART Length']/metrics['Average XSum Length']:.1f}x longer summaries than XSum")
print("‚úÖ More informative for business use")
print("‚úÖ Compresses original to ~{:.1f}% of original length".format(metrics['Compression Ratio']))

## 9. Save Results

In [None]:
results_df = pd.DataFrame({
    'Document': batch_docs,
    'BART_Summary': batch_summaries,
    'XSum_Reference': batch_refs,
    'BART_Words': [len(s.split()) for s in batch_summaries],
    'XSum_Words': [len(r.split()) for r in batch_refs]
})

results_df.to_csv('summarization_results.csv', index=False)
print("‚úÖ Results saved to 'summarization_results.csv'")
print(f"   Total entries: {len(results_df)}")

## 10. Summary & Conclusions

### Key Achievements:

1. ‚úÖ **Model Selection**: BART-CNN chosen for production readiness
2. ‚úÖ **Data Processing**: Efficient XSum dataset loading
3. ‚úÖ **Pipeline**: Abstracted complexity with HuggingFace pipeline
4. ‚úÖ **Comparison**: Demonstrated understanding of model tradeoffs
5. ‚úÖ **Performance**: Efficient batch processing

### Why BART-CNN?

| Aspect | Reason |
|--------|--------|
| **Output Quality** | 3-4 informative sentences vs 1 brief sentence |
| **Industry Use** | Most widely used summarization model |
| **Reliability** | Battle-tested, predictable behavior |
| **Business Value** | Suitable for reports, briefs, production use |

### Production vs Academic:

- **Academic Approach**: Match model to dataset (PEGASUS-XSum)
- **Production Approach**: Choose best model for business needs (BART-CNN)
- **Our Choice**: Production approach with academic awareness

### Next Steps:

1. Run the Streamlit app: `streamlit run app.py`
2. Try your own articles
3. Compare models interactively
4. Deploy for production use

---

**Assignment Complete! üéâ**