<a href="https://colab.research.google.com/github/ShraddhaSharma24/Natural-Language-Processing/blob/main/Abstractive_Text_Summarization_using_T5_BART_on_XSum_with_ROUGE_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Summary: Abstractive Text Summarization with Pre-trained Transformers**

In this project, an abstractive text summarization pipeline was implemented using pre-trained transformer models (T5 and BART). The XSum dataset was leveraged from HuggingFace, enabling the generation of concise, human-like summaries for single-document news articles. The summaries were evaluated using ROUGE metrics, and fine-tuning capabilities were explored for performance optimization. This mini-project serves as a foundational component of knowledge-enhanced text generation in high-stakes NLP applications.

In [2]:
!pip install datasets transformers bert-score evaluate



Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from 

In [3]:
from datasets import load_dataset

# Trust remote code (for XSum custom loader)
xsum = load_dataset("xsum", split="train[:500]", trust_remote_code=True)

# Convert to list of articles and summaries
articles = xsum["document"]
summaries = xsum["summary"]



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

xsum.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

(…)SUM-EMNLP18-Summary-Data-Original.tar.gz:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [4]:
from transformers import pipeline
from textwrap import wrap

# Load a summarization model (T5-small = fast)
summarizer = pipeline("summarization", model="t5-small", tokenizer="t5-small")

def summarize_article(article):
    max_input_length = 512
    min_summary_ratio = 0.3  # summary should be 30% of the input
    max_summary_ratio = 0.5  # cap the max summary size at 50% of the input

    # Chunk long articles
    from textwrap import wrap
    chunks = wrap(article, 500)

    full_summary = []
    for chunk in chunks:
        input_length = len(chunk.split())

        # Ensure summary length is shorter than input, but not too small
        max_len = max(10, min(int(input_length * max_summary_ratio), 200))
        min_len = max(5, int(input_length * min_summary_ratio))

        # Guard against summary being longer than input
        if input_length < max_len:
            max_len = input_length - 1

        summary = summarizer(chunk, max_length=max_len, min_length=min_len, do_sample=False, truncation=True)[0]["summary_text"]
        full_summary.append(summary)

    return " ".join(full_summary)



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cuda:0


In [6]:
# Step 1: Install Dependencies
!pip install datasets transformers evaluate bert_score --quiet

# Step 2: Import Libraries
import pandas as pd
from datasets import load_dataset
from transformers import pipeline
from tqdm import tqdm

# Step 3: Load XSum dataset (trust custom code)
xsum = load_dataset("xsum", split="train[:50]", trust_remote_code=True)  # small subset for now

# Step 4: Convert to DataFrame
df = pd.DataFrame(xsum)
df = df[['document', 'summary']]
df.rename(columns={"document": "article", "summary": "reference_summary"}, inplace=True)

# Step 5: Initialize Summarizer
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", tokenizer="facebook/bart-large-cnn")

# Step 6: Define Summarizer with Length Handling
def summarize_article(article):
    if len(article.split()) < 50:
        return summarizer(article, max_length=30, min_length=10, do_sample=False)[0]['summary_text']
    elif len(article.split()) < 100:
        return summarizer(article, max_length=60, min_length=20, do_sample=False)[0]['summary_text']
    elif len(article.split()) < 250:
        return summarizer(article, max_length=120, min_length=40, do_sample=False)[0]['summary_text']
    else:
        return summarizer(article[:1024], max_length=150, min_length=50, do_sample=False, truncation=True)[0]['summary_text']

# Step 7: Apply Summarizer with Progress Bar
tqdm.pandas()
df["generated_summary"] = df["article"].progress_apply(summarize_article)

# Step 8: Save Final CSV
import os
os.makedirs("datasets", exist_ok=True)
df.to_csv("datasets/summarization.csv", index=False)
print("✅ Summarization CSV saved at datasets/summarization.csv")


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
  0%|          | 0/50 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 50/50 [00:44<00:00,  1.14it/s]

✅ Summarization CSV saved at datasets/summarization.csv





In [7]:
df.head()[["article", "reference_summary", "generated_summary"]]


Unnamed: 0,article,reference_summary,generated_summary
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,"The full cost of damage in Newton Stewart, one..."
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,Fire alarm went off at the Holiday Inn in Hope...
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,Sebastian Vettel will start third ahead of tea...
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,"John Edward Bates faces a total of 22 charges,..."
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,Patients and staff evacuated from Cerahpasa ho...


In [9]:
# Install evaluation libraries
!pip install evaluate bert_score --quiet
!pip install evaluate rouge-score --quiet

# Imports
import pandas as pd
import evaluate
from bert_score import score

# Load your CSV
df = pd.read_csv("datasets/summarization.csv")

# Prepare reference and generated summaries
references = df["reference_summary"].tolist()
generated = df["generated_summary"].tolist()

# --- 1️⃣ ROUGE Evaluation ---
rouge = evaluate.load("rouge")
rouge_result = rouge.compute(predictions=generated, references=references)
print("🔍 ROUGE Evaluation")
for k, v in rouge_result.items():
    print(f"{k}: {v:.4f}")

# --- 2️⃣ BERTScore Evaluation ---
print("\n🔍 BERTScore Evaluation (may take 1-2 mins)...")
P, R, F1 = score(generated, references, lang="en", verbose=True)
print(f"Precision: {P.mean():.4f}")
print(f"Recall:    {R.mean():.4f}")
print(f"F1 Score:  {F1.mean():.4f}")


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
🔍 ROUGE Evaluation
rouge1: 0.1944
rouge2: 0.0326
rougeL: 0.1291
rougeLsum: 0.1287

🔍 BERTScore Evaluation (may take 1-2 mins)...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.05 seconds, 47.64 sentences/sec
Precision: 0.8450
Recall:    0.8709
F1 Score:  0.8576
