# Task 2: Text Summarization

**Objective**: Create a system that summarizes lengthy articles, blogs, or news into concise summaries using extractive and abstractive techniques.

**Dataset**: CNN/Daily Mail Dataset

In [3]:
!pip install --upgrade datasets fsspec spacy transformers evaluate -q

In [2]:

!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m102.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
from datasets import load_dataset
import spacy

dataset = load_dataset("cnn_dailymail", "3.0.0")
nlp = spacy.load("en_core_web_sm")

article = dataset['train'][0]['article']
reference_summary = dataset['train'][0]['highlights']
print("📰 Article:\n", article[:500])
print("\n📋 Summary:\n", reference_summary)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

📰 Article:
 LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s

📋 Summary:
 Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


## Extractive Summarization using spaCy

In [5]:
def extractive_summary(text, num_sentences=3):
    doc = nlp(text)
    sentences = list(doc.sents)
    sorted_sentences = sorted(sentences, key=lambda s: len(s), reverse=True)
    selected = sorted_sentences[:num_sentences]
    return " ".join([s.text.strip() for s in selected])

extractive = extractive_summary(article)
print("\n[Extractive Summary]\n", extractive)


[Extractive Summary]
 Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart.


## Abstractive Summarization using Transformers (BART)

In [6]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
abstractive = summarizer(article, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
print("\n[Abstractive Summary]\n", abstractive)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu



[Abstractive Summary]
 Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund.


## Fine-Tuning BART (Toy Example)

In [7]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, BartTokenizer, BartForConditionalGeneration
from datasets import Dataset
import torch

fine_tune_data = Dataset.from_dict({
    'text': [dataset['train'][i]['article'] for i in range(50)],
    'summary': [dataset['train'][i]['highlights'] for i in range(50)]
})

model_name = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

def tokenize(batch):
    inputs = tokenizer(batch['text'], max_length=1024, truncation=True, padding="max_length")
    targets = tokenizer(batch['summary'], max_length=128, truncation=True, padding="max_length")
    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["labels"] = targets.input_ids
    return batch

fine_tune_data = fine_tune_data.map(tokenize, batched=True)
fine_tune_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    logging_steps=10,
    save_steps=10,
    logging_dir="./logs",
    report_to="none"
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=fine_tune_data
)

# trainer.train()  # Optional
print("\n[Fine-tuning Step Ready]\nFine-tuning setup completed. Training skipped for demo purposes.")

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]


[Fine-tuning Step Ready]
Fine-tuning setup completed. Training skipped for demo purposes.


## Evaluation

In [10]:
# import evaluate
# rouge = evaluate.load("rouge")

# def evaluate_summary(reference, generated):
#     return rouge.compute(predictions=[generated], references=[reference])

# print("\n[Evaluation - Extractive Summary]\n", evaluate_summary(reference_summary, extractive))
# print("\n[Evaluation - Abstractive Summary]\n", evaluate_summary(reference_summary, abstractive))

# Install rouge_score first
!pip install rouge_score -q

# Run evaluation
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def evaluate_summary(reference, generated):
    return scorer.score(reference, generated)

# Show results
print("\n[Evaluation - Extractive Summary]\n", evaluate_summary(reference_summary, extractive))
print("\n[Evaluation - Abstractive Summary]\n", evaluate_summary(reference_summary, abstractive))


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone

[Evaluation - Extractive Summary]
 {'rouge1': Score(precision=0.226890756302521, recall=0.6923076923076923, fmeasure=0.34177215189873417), 'rouge2': Score(precision=0.16101694915254236, recall=0.5, fmeasure=0.24358974358974356), 'rougeL': Score(precision=0.14285714285714285, recall=0.4358974358974359, fmeasure=0.21518987341772156)}

[Evaluation - Abstractive Summary]
 {'rouge1': Score(precision=0.6756756756756757, recall=0.6410256410256411, fmeasure=0.6578947368421052), 'rouge2': Score(precision=0.4444444444444444, recall=0.42105263157894735, fmeasure=0.43243243243243246), 'rougeL': Score(precision=0.6486486486486487, recall=0.6153846153846154, fmeasure=0.631578947368421)}


## Test on Real-world Article

In [11]:
new_article = dataset['test'][1]['article']
real_world_summary = summarizer(new_article, max_length=130, min_length=30, do_sample=False)[0]['summary_text']

print("\n[Test on Real-world Article]\n", real_world_summary)
print("\n[Final Concise Summary Output]\n", abstractive)


[Test on Real-world Article]
 Theia, a one-year-old bully breed mix, was hit by a car and buried in a field. She managed to stagger to a nearby farm, dirt-covered and emaciated. She suffered a dislocated jaw, leg injuries and a caved-in sinus cavity.

[Final Concise Summary Output]
 Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund.
