# Text Summarization Demo

This notebook demonstrates the text summarization capabilities of the NLP toolkit, including:
- Both abstractive and extractive summarization techniques
- Using pre-trained transformer models for summarization
- Fine-tuning summarization models
- Evaluating summaries with ROUGE metrics
- Applying summarization to real-world documents

In [None]:
# Setup path to allow importing from the src directory
import sys
import os
from pathlib import Path

# Add parent directory to path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

# Import toolkit modules
from src.data.preprocessing import TextPreprocessor
from src.data.data_loader import get_summarization_loader
from src.models.summarizer import TextSummarizer, ExtractiveSummarizer
from src.training.metrics import rouge_metrics

# Import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer
import torch
from IPython.display import display, HTML

## 1. Configuration and Setup

In [None]:
# Configuration
TASK = "summarization"
# Models for abstractive and extractive summarization
ABSTRACTIVE_MODEL = "facebook/bart-large-cnn"
DATASET_NAME = "cnn_dailymail"
DATASET_VERSION = "3.0.0"
MAX_INPUT_LENGTH = 512
MAX_OUTPUT_LENGTH = 128
BATCH_SIZE = 4
NUM_EPOCHS = 1  # Using just 1 epoch for demonstration purposes

# Output directory for model and results
OUTPUT_DIR = os.path.join(project_root, "models", "demo_summarizer")
os.makedirs(OUTPUT_DIR, exist_ok=True)

## 2. Exploring Extractive Summarization

In [None]:
# Sample long text for summarization
sample_text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by humans and animals. AI research has been defined as the field of study of intelligent agents, which refers to any system that perceives its environment and takes actions that maximize its chance of achieving its goals.

The term "artificial intelligence" had previously been used to describe machines that mimic and display "human" cognitive skills that are associated with the human mind, such as "learning" and "problem-solving". This definition has since been rejected by major AI researchers who now describe AI in terms of rationality and acting rationally, which does not limit how intelligence can be articulated.

AI applications include advanced web search engines, recommendation systems (used by YouTube, Amazon and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (such as Tesla), automated decision-making and competing at the highest level in strategic game systems (such as chess and Go).

As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.

Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. AI research has tried and discarded many different approaches during its lifetime, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge and imitating animal behavior. In the first decades of the 21st century, highly mathematical-statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.

The various sub-fields of AI research are centered around particular goals and the use of particular tools. The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception, and the ability to move and manipulate objects. General intelligence (the ability to solve an arbitrary problem) is among the field's long-term goals. To solve these problems, AI researchers have adapted and integrated a wide range of problem-solving techniques, including search and mathematical optimization, formal logic, artificial neural networks, and methods based on statistics, probability and economics. AI also draws upon computer science, psychology, linguistics, philosophy, and many other fields.
"""

# Initialize extractive summarizers with different methods
textrank_summarizer = ExtractiveSummarizer(method="textrank")
lsa_summarizer = ExtractiveSummarizer(method="lsa")
luhn_summarizer = ExtractiveSummarizer(method="luhn")

# Generate summaries
textrank_summary = textrank_summarizer.summarize(sample_text, ratio=0.3)
lsa_summary = lsa_summarizer.summarize(sample_text, ratio=0.3)
luhn_summary = luhn_summarizer.summarize(sample_text, ratio=0.3)

# Display summaries
print("Original Text Length:", len(sample_text.split()))
print("\nTextRank Summary:")
print(textrank_summary)
print("Length:", len(textrank_summary.split()))

print("\nLSA Summary:")
print(lsa_summary)
print("Length:", len(lsa_summary.split()))

print("\nLuhn Summary:")
print(luhn_summary)
print("Length:", len(luhn_summary.split()))

In [None]:
# Compare extractive methods with a visualization
def highlight_common_sentences(original_text, summaries, summary_names):
    """Highlight sentences in original text based on which extractive methods selected them."""
    # Split text into sentences
    import re
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', original_text)
    sentences = [s.strip() for s in sentences if s.strip()]
    
    # Create a dataframe with sentences and which methods included them
    data = []
    for i, sentence in enumerate(sentences):
        row = {"id": i, "sentence": sentence}
        for name, summary in zip(summary_names, summaries):
            # Check if sentence is in summary (approximately)
            row[name] = any(sentence in s for s in re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', summary))
        data.append(row)
    
    df = pd.DataFrame(data)
    
    # Create HTML with highlighted sentences
    html = '<style>'
    for i, name in enumerate(summary_names):
        hue = i * (360 // len(summary_names))
        html += f'.{name} {{ background-color: hsla({hue}, 100%, 90%, 0.5); }}\n'
    html += '</style>'
    
    html += '<div style="line-height: 1.5">'
    for _, row in df.iterrows():
        sentence = row['sentence']
        classes = [name for name in summary_names if row[name]]
        
        if classes:
            class_str = ' '.join(classes)
            methods_str = ', '.join(classes)
            html += f'<span class="{class_str}" title="Methods: {methods_str}">{sentence}</span> '
        else:
            html += f'{sentence} '
    
    html += '</div>'
    
    # Create a legend
    html += '<div style="margin-top: 20px">Legend: '
    for name in summary_names:
        html += f'<span class="{name}" style="padding: 2px 5px; margin-right: 10px;">{name}</span>'
    html += '</div>'
    
    return html

# Create and display the visualization
summaries = [textrank_summary, lsa_summary, luhn_summary]
summary_names = ["TextRank", "LSA", "Luhn"]
html_output = highlight_common_sentences(sample_text, summaries, summary_names)
display(HTML(html_output))

## 3. Exploring Abstractive Summarization

In [None]:
# Initialize a pre-trained abstractive summarizer
abstractive_summarizer = TextSummarizer(model_name=ABSTRACTIVE_MODEL)

# Print model information
print(f"Model: {ABSTRACTIVE_MODEL}")
print(f"Number of parameters: {abstractive_summarizer.get_model_size():,}")

In [None]:
# Generate abstractive summary
tokenizer = AutoTokenizer.from_pretrained(ABSTRACTIVE_MODEL)
abstractive_summary = abstractive_summarizer.summarize_text(
    texts=[sample_text],
    tokenizer=tokenizer,
    max_length=150,
    min_length=50,
    do_sample=False
)[0]

print("Abstractive Summary:")
print(abstractive_summary)
print("Length:", len(abstractive_summary.split()))

In [None]:
# Compare all summaries
print("Comparison of All Summaries:")
print("-" * 50)
print(f"Original Text: {len(sample_text.split())} words")
print("-" * 50)
print(f"TextRank Summary: {len(textrank_summary.split())} words")
print(f"LSA Summary: {len(lsa_summary.split())} words")
print(f"Luhn Summary: {len(luhn_summary.split())} words")
print(f"Abstractive Summary: {len(abstractive_summary.split())} words")

# Calculate ROUGE scores between abstractive and extractive summaries
# This gives us a measure of how similar these approaches are
from src.training.metrics import rouge_metrics

extractive_summaries = [textrank_summary, lsa_summary, luhn_summary]
for i, summary in enumerate(extractive_summaries):
    method = ["TextRank", "LSA", "Luhn"][i]
    rouge_scores = rouge_metrics([abstractive_summary], [summary])
    print(f"\nROUGE scores between Abstractive and {method}:")
    print(f"  ROUGE-1: {rouge_scores['rouge1']:.4f}")
    print(f"  ROUGE-2: {rouge_scores['rouge2']:.4f}")
    print(f"  ROUGE-L: {rouge_scores['rougeL']:.4f}")

## 4. Data Loading for Summarization

In [None]:
# Initialize tokenizer and preprocessor
tokenizer = AutoTokenizer.from_pretrained(ABSTRACTIVE_MODEL)
preprocessor = TextPreprocessor()

# Create dataset loader
dataset_loader = get_summarization_loader(
    tokenizer=tokenizer,
    preprocessor=preprocessor,
    max_input_length=MAX_INPUT_LENGTH,
    max_output_length=MAX_OUTPUT_LENGTH
)

In [None]:
# Load a small subset of the CNN/DailyMail dataset
# We'll limit to a few examples to keep notebook running time reasonable
print("Loading dataset...")
from datasets import load_dataset
dataset = load_dataset(DATASET_NAME, DATASET_VERSION, split="train[:100]")
validation_dataset = load_dataset(DATASET_NAME, DATASET_VERSION, split="validation[:50]")

# Combine into expected format
combined_dataset = {
    "train": dataset,
    "validation": validation_dataset
}

# Display dataset information
print(f"Dataset: {DATASET_NAME}")
print(f"Number of training examples: {len(combined_dataset['train'])}")
print(f"Number of validation examples: {len(combined_dataset['validation'])}")

# Show example data
print("\nExample data:")
example = combined_dataset["train"][0]
print(f"Article: {example['article'][:200]}...")
print(f"Highlights (Summary): {example['highlights']}")

In [None]:
# Preprocess the dataset
def preprocess_cnn_dailymail(examples):
    """Preprocess CNN/DailyMail dataset examples for summarization."""
    # Convert to expected format
    inputs = [doc for doc in examples["article"]]
    targets = [summary for summary in examples["highlights"]]
    
    # Tokenize inputs and targets
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, padding="max_length", truncation=True)
    
    # Tokenize targets with special handling
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=MAX_OUTPUT_LENGTH, padding="max_length", truncation=True)
    
    # Replace pad token id with -100 for loss calculation
    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs

# Apply preprocessing
print("Preprocessing data...")
tokenized_train_dataset = combined_dataset["train"].map(
    preprocess_cnn_dailymail, batched=True, 
    remove_columns=["article", "highlights", "id"]
)
tokenized_val_dataset = combined_dataset["validation"].map(
    preprocess_cnn_dailymail, batched=True, 
    remove_columns=["article", "highlights", "id"]
)

# Create PyTorch DataLoaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_train_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=True
)
val_dataloader = DataLoader(
    tokenized_val_dataset, 
    batch_size=BATCH_SIZE
)

print(f"Training batches: {len(train_dataloader)}")
print(f"Validation batches: {len(val_dataloader)}")