# LLM Text Summarizer ‚Äî Fine-Tuning T5-small on CNN/DailyMail

**Goal:** Fine-tune a pre-trained T5-small model to perform abstractive text summarization, evaluating performance using the ROUGE metric.

**Environment:** Google Colab

---
## 1.1 Install Dependencies & Set Global Variables
The following packages are essential for the NLP pipeline: `transformers` for the model, `datasets` for data handling, and `evaluate` for metrics.

In [None]:
# Install packages
!pip install -U transformers datasets evaluate accelerate sentencepiece rouge-score nltk


In [None]:
#Setup and Reproducibility
import os
from pathlib import Path
import random
import numpy as np
import torch
from datasets import load_dataset
from transformers import AutoTokenizer

In [None]:
# Global Project Setup
PROJECT_DIR = Path('/content/text_summarizer_project_final')
PROJECT_DIR.mkdir(exist_ok=True)

In [None]:
# Reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

In [None]:
# Data sizes
TRAIN_SIZE = 500
VAL_SIZE = 200

print('Project dir:', PROJECT_DIR)
print('TRAIN_SIZE=', TRAIN_SIZE, 'VAL_SIZE=', VAL_SIZE)
print('Using GPU:', torch.cuda.is_available(), 'Device count:', torch.cuda.device_count())

# Data Preparation and Preprocessing

## Load and Subset the Dataset
We use the **CNN/DailyMail** dataset, a gold-standard benchmark for summarization, which contains news articles and corresponding professional human-written summaries (highlights). We select a small subset for rapid fine-tuning.

In [None]:
#Download and Prepare Dataset
print('Downloading CNN/DailyMail (may take a few minutes)...')
raw = load_dataset('cnn_dailymail', '3.0.0')
train_ds = raw['train'].select(range(TRAIN_SIZE))
val_ds = raw['validation'].select(range(VAL_SIZE))
print('Loaded subsets: train=', len(train_ds), 'val=', len(val_ds))
print('Sample keys:', train_ds.column_names)

## Tokenization Strategy
We load the `t5-small` tokenizer. T5 is a Sequence-to-Sequence model that requires a task prefix, **`summarize:`**, to be prepended to the input text. This conditions the model to perform the intended task. We also set limits to prevent excessive GPU memory usage.

In [None]:
#Tokenizer setup
MODEL_NAME = 't5-small'

print('Loading tokenizer:', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [None]:
# Preprocessing function
def preprocess_function(examples, max_input_length=512, max_target_length=128):
    inputs = examples['article']
    # Add prefix for T5 models (e.g., 'summarize: ')
    inputs = [f"summarize: {text}" for text in inputs]

    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['highlights'], max_length=max_target_length, truncation=True)

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

In [None]:
# Quick EDA
def avg_len(ds, key='article'):
    return np.mean([len(x.split()) for x in ds[key]])

print('Avg article words (train):', avg_len(train_ds, 'article'))
print('Avg summary words (train):', avg_len(train_ds, 'highlights'))

In [None]:
#Tokenize Datasets
print('Tokenizing...')
train_tok = train_ds.map(lambda x: preprocess_function(x), batched=True, remove_columns=train_ds.column_names)
val_tok = val_ds.map(lambda x: preprocess_function(x), batched=True, remove_columns=val_ds.column_names)
print('Tokenization complete. Columns:', train_tok.column_names)

# Model Fine-Tuning and Evaluation

## Configure Trainer and Metrics
We initialize the `t5-small` model and configure the `Seq2SeqTrainer` with specific arguments:
* **`eval_strategy='epoch'`**: Evaluation is run at the end of each epoch to track progress. (This is the fixed argument name).
* **`fp16=True`**: Uses 16-bit precision for faster training and reduced memory footprint on modern GPUs.
* **`report_to='none'`**: Prevents errors by disabling automatic logging to external tools like Weights & Biases.
* **ROUGE Metric**: The `compute_metrics` function handles post-processing (decoding tokens and handling padding) and calculates ROUGE scores, which measure the overlap between the generated summary and the reference summary.

In [None]:
#Model and Trainer Setup
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import evaluate

MODEL_NAME = 't5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

In [None]:
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
# Use fp16 only if CUDA available
use_fp16 = torch.cuda.is_available()

training_args = Seq2SeqTrainingArguments(
    output_dir=str(PROJECT_DIR / 'results'),
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    logging_steps=50,
    # üåü FIX 1: Renamed from 'evaluation_strategy' to 'eval_strategy'
    eval_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs=2,
    save_total_limit=1,
    fp16=use_fp16,
    learning_rate=5e-5,
    weight_decay=0.01,
    # üåü FIX 2: Added to prevent wandb/external logging errors
    report_to='none',
)

rouge = evaluate.load('rouge')

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # convert labels to numpy and replace -100
    labels = np.array(labels)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    # Format the results as percentages
    return {k: round(v * 100, 4) for k, v in result.items()}

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print('Trainer ready. To start training run the next cell:')

## Execute Fine-Tuning
Training the model on the 500-sample subset for 2 epochs. The results of the final evaluation will be printed.

In [None]:
#Start Training
train_result = trainer.train()
trainer.save_model(str(PROJECT_DIR / 'saved_t5_small'))
print('Training finished. Model saved to', PROJECT_DIR / 'saved_t5_small')

# Final Evaluation and Output

#### Generate Example Summaries for Review
We load the final fine-tuned model and generate summaries for a few validation articles using **Beam Search (`num_beams=4`)**. This step is crucial for visual inspection and demonstrating model quality. The results are saved to `examples.csv`.

In [None]:
#Inference and Examples CSV
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# For inference use the saved model if available; otherwise use pretrained
try:
    # Use the saved model and tokenizer after training
    tok_inf = AutoTokenizer.from_pretrained(str(PROJECT_DIR / 'saved_t5_small'))
    model_inf = AutoModelForSeq2SeqLM.from_pretrained(str(PROJECT_DIR / 'saved_t5_small'))
    print('Loaded saved model for inference.')
except Exception:
    print('Saved model not found, using pretrained', MODEL_NAME)
    tok_inf = AutoTokenizer.from_pretrained(MODEL_NAME)
    model_inf = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

examples = val_ds.select(range(8))
rows = []
for e in examples:
    # IMPORTANT: Prepend 'summarize: ' for T5 inference
    input_text = f"summarize: {e['article']}"

    inputs = tok_inf(input_text, return_tensors='pt', truncation=True, max_length=1024)
    # Move inputs to GPU if model is on GPU
    if torch.cuda.is_available():
        inputs = {k: v.to(model_inf.device) for k, v in inputs.items()}

    out = model_inf.generate(**inputs, max_length=130, min_length=30, num_beams=4)
    pred = tok_inf.decode(out[0], skip_special_tokens=True)

    # Only save first 400 chars of the article to keep the CSV clean
    rows.append({'article': e['article'][:400] + '...', 'reference': e['highlights'], 'prediction': pred})

df = pd.DataFrame(rows)
examples_csv = PROJECT_DIR / 'examples.csv'
df.to_csv(str(examples_csv), index=False)
print('Saved examples to', examples_csv)

## Compute Final ROUGE Score on Validation Set
Calculating the final ROUGE score on a larger subset (50 samples) provides the key metric for the project write-up.

In [None]:
#Evaluate ROUGE on 50 validation samples
dataset_50 = raw['validation'].select(range(50))
print('Computing predictions for 50 samples...')
preds = []
refs = []

In [None]:
# Ensure the model is in eval mode and on the correct device
model_inf.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_inf.to(device)

for s in dataset_50:
    # IMPORTANT: Prepend 'summarize: ' for T5 inference
    input_text = f"summarize: {s['article']}"

    inputs = tok_inf(input_text, return_tensors='pt', truncation=True, max_length=1024)
    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        out = model_inf.generate(**inputs, max_length=130, min_length=30, num_beams=4)
        preds.append(tok_inf.decode(out[0], skip_special_tokens=True))
        refs.append(s['highlights'])

res = rouge.compute(predictions=preds, references=refs)
# Print ROUGE scores in the desired percentage format
print('ROUGE results (percent):', {k: round(v*100, 4) for k, v in res.items()})

## Prepare Final Output Folder
This final step organizes the necessary portfolio files‚Äîthe notebook and the `examples.csv`‚Äîinto a single folder while **EXCLUDING** the large model weights. This is best practice for GitHub submission.

In [None]:
# CELL 10: Prepare Minimal Final Outputs
# ----------------------------------------------------------------------
import shutil
final_folder = Path('/content/final_outputs')
final_folder.mkdir(exist_ok=True)

# Define the expected notebook path in Colab
notebook_name = 'Project1_Text_Summarizer_COLAB_RUNABLE_FOR_GITHUB.ipynb'
notebook_path_in_runtime = '/content/' + notebook_name

if Path(notebook_path_in_runtime).exists():
    shutil.copy(notebook_path_in_runtime, final_folder)
    print('Copied notebook to', final_folder)
else:
    # Fallback to copy the current running notebook file (if the name is consistent)
    try:
        # This only works in some Colab environments if you know the exact notebook name
        shutil.copy('/content/Project1_Text_Summarizer_COLAB_RUNABLE.ipynb', final_folder)
        print('Copied notebook (via fallback name) to', final_folder)
    except FileNotFoundError:
        print(f'Notebook path not found. Please rename your uploaded file to "{notebook_name}" or download the current notebook manually from File->Download .ipynb')

In [None]:
# Copy examples.csv
examples_src = PROJECT_DIR / 'examples.csv'
if examples_src.exists():
    shutil.copy(str(examples_src), final_folder)
    print('Copied examples.csv to', final_folder)
else:
    print('examples.csv not found; run inference cell first to create it')

print('\n' + '-'*60)
print('FINAL STEP: Download the /content/final_outputs folder (right-click in Colab files pane) and upload contents to GitHub.')
print('-'*60)


## Model Deployment Preparation (Hugging Face Hub)

As the final step, we ensure the fine-tuned model and tokenizer are accessible via the Hugging Face Model Hub. This step validates the entire MLOps workflow and is required to deploy a free, interactive demo using Hugging Face Spaces.

In [None]:
#Hugging Face Login
!pip install -q huggingface_hub

from huggingface_hub import notebook_login
# This will prompt you to enter your Hugging Face token
notebook_login()

In [None]:
#Push Model and Tokenizer to the Hub
hf_repo_name = "benjose51/t5-small-cnn-dailymail-summarizer"

# Push the model weights
model.push_to_hub(hf_repo_name)

# Push the tokenizer config
tokenizer.push_to_hub(hf_repo_name)

print(f"‚úÖ Deployment preparation complete.")
print(f"Your model is now accessible on the Hub at: https://huggingface.co/{hf_repo_name}")

# üèÜ Project Conclusion & Business Impact

The primary objective of this project was to produce high-quality, abstractive text summaries. By successfully fine-tuning a T5-small model on the CNN/DailyMail dataset, we developed a solution that is both highly accurate and computationally efficient, demonstrating an end-to-end MLOps capability.

---

## Key Performance Metrics (ROUGE Score)

The model's performance was evaluated using **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**, the standard for summarization. The scores below reflect the metrics obtained after **2 epochs** of fine-tuning.

| Metric | **Score (Fine-Tuned Model)** | Technical Implication |
| :--- | :--- | :--- |
| **ROUGE-L** | **$23.2637\%$** | Measures the longest common subsequence, indicating fluency and content preservation. |
| **ROUGE-2** | **$14.2222\%$** | Measures bigram overlap, indicating phrase structure and better coherence. |
| **ROUGE-1** | **$31.5125\%$** | Measures unigram overlap, indicating the presence of key topics and keywords. |

---

## üöÄ Deployment and Business Achievements

### 1. Model Deployment
The final model is deployed and publicly accessible, validating the project's production readiness:

* **Platform:** Hugging Face Spaces (using Gradio)
* **Live Demo Link:** [T5-CNN-DailyMail-Summarizer-Demo](https://huggingface.co/spaces/benjose51/T5-CNN-DailyMail-Summarizer-Demo)

### 2. Computational Efficiency
* **Model Choice:** Selected the **T5-small** model to prioritize fast iteration and cost-effective deployment, maximizing **Return on Investment (ROI)** for high-volume inference.
* **Resource Optimization:** Utilized **Mixed Precision Training ($\text{fp}16$)** for faster training and reduced memory footprint on the GPU.

### 3. Technical Mastery
* Implemented an end-to-end Hugging Face `transformers` pipeline, including correct setup of the `Seq2SeqTrainer` and integration of the `evaluate` library.
* Employed **Beam Search ($\text{num\_beams}=4$)** during inference to ensure generated summaries are high-quality, fluent, and avoid repetitive text.

---
## üßë‚Äçüíª Author and Contact Information

This project was developed by:

* **Name:** BEN JOSE
* **LinkedIn:** [https://www.linkedin.com/in/ben-jose-aa9537190/](https://www.linkedin.com/in/ben-jose-aa9537190/)
* **GitHub:** [https://github.com/BENJOSE51](https://github.com/BENJOSE51)
* **Email:** benjose51@gmail.com