# BART Text Summarization Project

This notebook implements a complete BART (Bidirectional and Auto-Regressive Transformers) text summarization pipeline following the methodology from the midterm presentation slides.

## Methodology Overview
1. **Collect and Select Dataset** - CNN/DailyMail dataset
2. **Preprocess text** - Tokenization and cleaning
3. **Fine-tune BART** - Configure hyperparameters (max_len, min_len, temp, num_beams, length_penalty)
4. **Generate summaries** - Use fine-tuned model to create summaries
5. **Evaluate** - Use ROUGE scores to assess model performance


## Section 1: Setup and Imports


In [5]:
# Import necessary libraries
import torch
import pandas as pd
import numpy as np
from transformers import (
    BartForConditionalGeneration,
    BartTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
from datasets import Dataset
from rouge_score import rouge_scorer
import os
import random
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

# Set up device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")


  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


## Section 2: Dataset Loading and Exploration


In [6]:
# Load datasets from CSV files
print("Loading datasets...")
train_df = pd.read_csv('cnn_dailymail/train.csv')
val_df = pd.read_csv('cnn_dailymail/validation.csv')
test_df = pd.read_csv('cnn_dailymail/test.csv')

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")

# Display dataset structure
print("\nDataset columns:", train_df.columns.tolist())
print("\nFirst few rows:")
train_df.head()


Loading datasets...
Training samples: 287113
Validation samples: 13368
Test samples: 11490

Dataset columns: ['id', 'article', 'highlights']

First few rows:


Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [7]:
# Explore dataset characteristics
def get_text_lengths(text):
    """Get word and character counts for text"""
    if pd.isna(text):
        return 0, 0
    words = text.split()
    return len(words), len(text)

# Calculate statistics
train_article_lengths = train_df['article'].apply(lambda x: get_text_lengths(x)[0])
train_summary_lengths = train_df['highlights'].apply(lambda x: get_text_lengths(x)[0])

print("Training Set Statistics:")
print(f"Article length - Mean: {train_article_lengths.mean():.1f}, Median: {train_article_lengths.median():.1f}, Max: {train_article_lengths.max()}")
print(f"Summary length - Mean: {train_summary_lengths.mean():.1f}, Median: {train_summary_lengths.median():.1f}, Max: {train_summary_lengths.max()}")

# Display sample article and summary
print("\n" + "="*80)
print("Sample Article and Summary:")
print("="*80)
sample_idx = 0
print(f"\nArticle ID: {train_df.iloc[sample_idx]['id']}")
print(f"\nArticle (first 500 chars):\n{train_df.iloc[sample_idx]['article'][:500]}...")
print(f"\nSummary:\n{train_df.iloc[sample_idx]['highlights']}")


Training Set Statistics:
Article length - Mean: 691.9, Median: 632.0, Max: 2347
Summary length - Mean: 51.6, Median: 48.0, Max: 1296

Sample Article and Summary:

Article ID: 0001d1afc246a7964130f43ae940af6bc6c57f01

Article (first 500 chars):
By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in N...

Summary:
Bishop John Folda, of North Dakota, is taking time off after being diagnosed .
He contracted the infection through contaminated food in Italy .
Church members in Fargo, Grand Forks and Jamestown could have been exposed .


## Section 3: Text Preprocessing


In [8]:
# Load BART tokenizer
model_name = 'facebook/bart-large-cnn'  # Pre-trained on CNN/DailyMail
print(f"Loading tokenizer for {model_name}...")
tokenizer = BartTokenizer.from_pretrained(model_name)

# Display tokenizer info
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Max model length: {tokenizer.model_max_length}")


Loading tokenizer for facebook/bart-large-cnn...
Vocabulary size: 50265
Max model length: 1000000000000000019884624838656


In [9]:
# Text cleaning function
def clean_text(text):
    """Clean and normalize text"""
    if pd.isna(text):
        return ""
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

# Preprocessing function for tokenization
def preprocess_function(examples, max_input_length=1024, max_target_length=142):
    """
    Preprocess articles and highlights for BART
    - max_input_length: Maximum length for articles (BART's limit is 1024)
    - max_target_length: Maximum length for summaries (typical for CNN/DailyMail is 142)
    """
    # Clean text
    articles = [clean_text(article) for article in examples['article']]
    highlights = [clean_text(highlight) for highlight in examples['highlights']]
    
    # Tokenize inputs (articles)
    model_inputs = tokenizer(
        articles,
        max_length=max_input_length,
        truncation=True,
        padding='max_length'
    )
    
    # Tokenize targets (highlights/summaries)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            highlights,
            max_length=max_target_length,
            truncation=True,
            padding='max_length'
        )
    
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

print("Preprocessing function defined.")


Preprocessing function defined.


In [10]:
# Convert DataFrames to HuggingFace Datasets
print("Converting to HuggingFace Datasets format...")
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

# Apply preprocessing
print("Preprocessing datasets (this may take a few minutes)...")
train_dataset = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names
)

val_dataset = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names
)

# For test set, we'll keep the original text for evaluation
test_dataset_processed = test_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=test_dataset.column_names
)

print("Preprocessing complete!")
print(f"Train dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")
print(f"Test dataset size: {len(test_dataset_processed)}")


Converting to HuggingFace Datasets format...
Preprocessing datasets (this may take a few minutes)...


Map: 100%|██████████| 287113/287113 [12:37<00:00, 379.21 examples/s]
Map: 100%|██████████| 13368/13368 [00:36<00:00, 369.93 examples/s]
Map: 100%|██████████| 11490/11490 [00:31<00:00, 366.04 examples/s]

Preprocessing complete!
Train dataset size: 287113
Validation dataset size: 13368
Test dataset size: 11490





In [11]:
# Create data collator for sequence-to-sequence tasks
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model_name,
    padding=True
)

print("Data collator created.")


Data collator created.


## Section 4: BART Model Fine-tuning

### Hyperparameter Configuration
Based on the methodology, we'll configure:
- `max_length`: Maximum length for generated summaries
- `min_length`: Minimum length for generated summaries  
- `temperature`: Sampling temperature
- `num_beams`: Number of beams for beam search
- `length_penalty`: Length penalty for beam search


In [12]:
# Configuration for generation hyperparameters
GENERATION_CONFIG = {
    'max_length': 142,      # Maximum summary length (typical for CNN/DailyMail)
    'min_length': 56,       # Minimum summary length
    'temperature': 1.0,     # Temperature for sampling (1.0 = deterministic)
    'num_beams': 4,         # Number of beams for beam search
    'length_penalty': 2.0,  # Length penalty (higher = longer summaries)
    'no_repeat_ngram_size': 3,  # Prevent repetition
    'early_stopping': True
}

print("Generation hyperparameters configured:")
for key, value in GENERATION_CONFIG.items():
    print(f"  {key}: {value}")


Generation hyperparameters configured:
  max_length: 142
  min_length: 56
  temperature: 1.0
  num_beams: 4
  length_penalty: 2.0
  no_repeat_ngram_size: 3
  early_stopping: True


In [13]:
# Load pre-trained BART model
print(f"Loading BART model: {model_name}...")
model = BartForConditionalGeneration.from_pretrained(model_name)
model.to(device)
print(f"Model loaded on {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")


Loading BART model: facebook/bart-large-cnn...


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Model loaded on cpu
Model parameters: 406.3M


In [14]:
# Configure training arguments
training_args = TrainingArguments(
    output_dir='./bart-cnn-dailymail',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,  # Adjust based on GPU memory
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16
    warmup_steps=500,
    learning_rate=3e-5,
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    save_total_limit=2,
    prediction_loss_only=True,
    remove_unused_columns=False,
    report_to="none"  # Disable wandb/tensorboard
)

print("Training arguments configured:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")


Training arguments configured:
  Epochs: 3
  Batch size: 4
  Gradient accumulation: 4
  Effective batch size: 16
  Learning rate: 3e-05


In [15]:
# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer
)

print("Trainer created. Ready to start fine-tuning!")


Trainer created. Ready to start fine-tuning!


In [None]:
# Start training
print("Starting fine-tuning...")
print("This may take several hours depending on your hardware.")
trainer.train()

# Save the final model
trainer.save_model('./bart-cnn-dailymail/final_model')
tokenizer.save_pretrained('./bart-cnn-dailymail/final_model')
print("\nFine-tuning complete! Model saved to './bart-cnn-dailymail/final_model'")


Starting fine-tuning...
This may take several hours depending on your hardware.


Epoch,Training Loss,Validation Loss


### Load Fine-tuned Model (if resuming)

If you've already fine-tuned the model and want to load it:


In [None]:
# Uncomment to load a previously fine-tuned model
# model_path = './bart-cnn-dailymail/final_model'
# model = BartForConditionalGeneration.from_pretrained(model_path)
# tokenizer = BartTokenizer.from_pretrained(model_path)
# model.to(device)
# print(f"Loaded fine-tuned model from {model_path}")


## Section 5: Summary Generation


In [None]:
def generate_summary(model, tokenizer, article, generation_config):
    """
    Generate a summary for a given article
    
    Args:
        model: Fine-tuned BART model
        tokenizer: BART tokenizer
        article: Input article text
        generation_config: Dictionary with generation hyperparameters
    
    Returns:
        Generated summary text
    """
    # Clean article text
    article = clean_text(article)
    
    # Tokenize input
    inputs = tokenizer(
        article,
        max_length=1024,
        truncation=True,
        padding='max_length',
        return_tensors='pt'
    ).to(device)
    
    # Generate summary
    with torch.no_grad():
        summary_ids = model.generate(
            inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=generation_config['max_length'],
            min_length=generation_config['min_length'],
            num_beams=generation_config['num_beams'],
            length_penalty=generation_config['length_penalty'],
            temperature=generation_config['temperature'],
            no_repeat_ngram_size=generation_config['no_repeat_ngram_size'],
            early_stopping=generation_config['early_stopping']
        )
    
    # Decode summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

print("Summary generation function defined.")


In [None]:
# Generate summaries for a sample of test articles
print("Generating summaries for test set...")
print("This may take a while depending on the number of test samples...")

# Use a subset for demonstration (you can change this to use the full test set)
num_test_samples = min(100, len(test_df))  # Generate for first 100 test samples
test_subset = test_df.head(num_test_samples)

generated_summaries = []
reference_summaries = []

for idx, row in tqdm(test_subset.iterrows(), total=len(test_subset), desc="Generating summaries"):
    article = row['article']
    reference = row['highlights']
    
    # Generate summary
    generated = generate_summary(model, tokenizer, article, GENERATION_CONFIG)
    
    generated_summaries.append(generated)
    reference_summaries.append(reference)

print(f"\nGenerated {len(generated_summaries)} summaries!")


In [None]:
# Display sample generated summaries
print("="*80)
print("Sample Generated Summaries")
print("="*80)

num_samples_to_show = 3
for i in range(min(num_samples_to_show, len(generated_summaries))):
    print(f"\n{'='*80}")
    print(f"Sample {i+1}")
    print(f"{'='*80}")
    print(f"\nArticle (first 300 chars):\n{test_subset.iloc[i]['article'][:300]}...")
    print(f"\n{'─'*80}")
    print(f"Reference Summary:\n{reference_summaries[i]}")
    print(f"\n{'─'*80}")
    print(f"Generated Summary:\n{generated_summaries[i]}")
    print(f"\n{'='*80}\n")


## Section 6: Evaluation Metrics

### ROUGE Scores

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores measure summary quality against reference texts. We'll calculate:
- **ROUGE-1**: Overlap of unigrams (single words)
- **ROUGE-2**: Overlap of bigrams (word pairs)
- **ROUGE-L**: Longest common subsequence


In [None]:
# Initialize ROUGE scorer
rouge_scorer_obj = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def calculate_rouge_scores(generated_summaries, reference_summaries):
    """
    Calculate ROUGE-1, ROUGE-2, and ROUGE-L scores
    
    Args:
        generated_summaries: List of generated summary texts
        reference_summaries: List of reference summary texts
    
    Returns:
        Dictionary with average ROUGE scores
    """
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []
    
    for gen, ref in zip(generated_summaries, reference_summaries):
        scores = rouge_scorer_obj.score(ref, gen)
        rouge1_scores.append(scores['rouge1'].fmeasure)
        rouge2_scores.append(scores['rouge2'].fmeasure)
        rougeL_scores.append(scores['rougeL'].fmeasure)
    
    return {
        'rouge1': {
            'precision': np.mean([rouge_scorer_obj.score(ref, gen)['rouge1'].precision 
                                  for gen, ref in zip(generated_summaries, reference_summaries)]),
            'recall': np.mean([rouge_scorer_obj.score(ref, gen)['rouge1'].recall 
                              for gen, ref in zip(generated_summaries, reference_summaries)]),
            'fmeasure': np.mean(rouge1_scores)
        },
        'rouge2': {
            'precision': np.mean([rouge_scorer_obj.score(ref, gen)['rouge2'].precision 
                                  for gen, ref in zip(generated_summaries, reference_summaries)]),
            'recall': np.mean([rouge_scorer_obj.score(ref, gen)['rouge2'].recall 
                              for gen, ref in zip(generated_summaries, reference_summaries)]),
            'fmeasure': np.mean(rouge2_scores)
        },
        'rougeL': {
            'precision': np.mean([rouge_scorer_obj.score(ref, gen)['rougeL'].precision 
                                  for gen, ref in zip(generated_summaries, reference_summaries)]),
            'recall': np.mean([rouge_scorer_obj.score(ref, gen)['rougeL'].recall 
                              for gen, ref in zip(generated_summaries, reference_summaries)]),
            'fmeasure': np.mean(rougeL_scores)
        }
    }

print("ROUGE calculation function defined.")


In [None]:
# Calculate ROUGE scores
print("Calculating ROUGE scores...")
rouge_scores = calculate_rouge_scores(generated_summaries, reference_summaries)

# Display results
print("\n" + "="*80)
print("ROUGE Evaluation Results")
print("="*80)
print(f"\nNumber of samples evaluated: {len(generated_summaries)}")
print(f"\n{'Metric':<15} {'Precision':<12} {'Recall':<12} {'F-Measure':<12}")
print("-" * 80)
print(f"{'ROUGE-1':<15} {rouge_scores['rouge1']['precision']:<12.4f} {rouge_scores['rouge1']['recall']:<12.4f} {rouge_scores['rouge1']['fmeasure']:<12.4f}")
print(f"{'ROUGE-2':<15} {rouge_scores['rouge2']['precision']:<12.4f} {rouge_scores['rouge2']['recall']:<12.4f} {rouge_scores['rouge2']['fmeasure']:<12.4f}")
print(f"{'ROUGE-L':<15} {rouge_scores['rougeL']['precision']:<12.4f} {rouge_scores['rougeL']['recall']:<12.4f} {rouge_scores['rougeL']['fmeasure']:<12.4f}")
print("="*80)


### Human Evaluation

As mentioned in the presentation slides, human evaluation is another important metric for assessing subjective quality in summarization tasks. While automated metrics like ROUGE provide quantitative measures, human evaluation can assess:
- **Coherence**: How well the summary flows and makes sense
- **Fluency**: How natural and readable the summary is
- **Relevance**: How well the summary captures the key information
- **Coverage**: How comprehensively the summary covers the article

For this project, we focus on automated ROUGE evaluation. Human evaluation would typically involve having human annotators rate the generated summaries.


## Summary

This notebook has implemented a complete BART text summarization pipeline:

1. ✅ **Dataset Loading**: Loaded and explored the CNN/DailyMail dataset
2. ✅ **Preprocessing**: Tokenized and prepared the data for training
3. ✅ **Fine-tuning**: Fine-tuned BART with configurable hyperparameters
4. ✅ **Generation**: Generated summaries using the fine-tuned model
5. ✅ **Evaluation**: Calculated ROUGE scores to assess model performance

The fine-tuned model can now be used to generate summaries for new articles!
