# **Task 2: Text Summarization**
**Objective:**Create a system that summarizes lengthy articles, blogs, or news into concise summaries.

**● Dataset:** CNN/Daily Mail Dataset

**● Steps:**
1. Preprocess textual data for summarization.
2. Implement extractive summarization using libraries like spaCy.
3. Implement abstractive summarization using pre-trained models like BERT or
GPT with HuggingFace's transformers.
4. Fine-tune models to improve the quality of summaries.
5. Test the model on real-world articles and evaluate summary coherence.

**● Outcome:** A summarization model capable of generating concise summaries from long texts.

# **Installations**

In [21]:
!pip install datasets transformers spacy rouge_score evaluate -q
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# **1. Load and Preprocess Data**

Preprocess textual data for summarization.

In [32]:
from datasets import load_dataset
import spacy

# **Load dataset**

CNN/Daily Mail Dataset

In [22]:
dataset = load_dataset("cnn_dailymail", "3.0.0")


In [23]:
# Load spaCy model
nlp = spacy.load("en_core_web_sm")
# Extract a sample article for demonstration
article = dataset['train'][0]['article']
reference_summary = dataset['train'][0]['highlights']

# **2. Extractive Summarization using spaCy**
Implement extractive summarization using libraries like spaCy.

In [24]:
def extractive_summary(text, num_sentences=3):
    doc = nlp(text)
    sentences = list(doc.sents)
    sorted_sentences = sorted(sentences, key=lambda s: len(s), reverse=True)
    selected = sorted_sentences[:num_sentences]
    return " ".join([s.text.strip() for s in selected])

extractive = extractive_summary(article)
print("\n[Extractive Summary]\n", extractive)



[Extractive Summary]
 Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart.


# **3. Abstractive Summarization using Transformers (BART)**

Implement abstractive summarization using pre-trained models like BERT or
GPT with HuggingFace's transformers.

In [36]:
from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Run abstractive summarization
abstractive = summarizer(article, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
print("\n[Abstractive Summary]\n", abstractive)




Device set to use cpu



[Abstractive Summary]
 Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund.


# **4. Fine-Tuning BART on CNN/DailyMail (Toy Example)**
Fine-tune models to improve the quality of summaries.

In [37]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, BartTokenizer, BartForConditionalGeneration
from datasets import Dataset
import torch

# Prepare small dataset for fine-tuning
fine_tune_data = Dataset.from_dict({
    'text': [dataset['train'][i]['article'] for i in range(50)],
    'summary': [dataset['train'][i]['highlights'] for i in range(50)]
})

model_name = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Tokenization function
def tokenize(batch):
    inputs = tokenizer(batch['text'], max_length=1024, truncation=True, padding="max_length")
    targets = tokenizer(batch['summary'], max_length=128, truncation=True, padding="max_length")
    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["labels"] = targets.input_ids
    return batch

fine_tune_data = fine_tune_data.map(tokenize, batched=True)
fine_tune_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Training arguments (corrected)
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    logging_steps=10,
    save_steps=10,
    logging_dir="./logs",
    report_to="none"
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=fine_tune_data
)

# Optional: Uncomment to train (requires resources)
# trainer.train()
print("\n[Fine-tuning Step Ready]\nFine-tuning setup completed. Training skipped for demo purposes.")


Map:   0%|          | 0/50 [00:00<?, ? examples/s]


[Fine-tuning Step Ready]
Fine-tuning setup completed. Training skipped for demo purposes.


In [39]:
#  Evaluation using ROUGE
import evaluate
rouge = evaluate.load("rouge")

def evaluate_summary(reference, generated):
    return rouge.compute(predictions=[generated], references=[reference])

print("\n[Evaluation - Extractive Summary]\n", evaluate_summary(reference_summary, extractive))
print("\n[Evaluation - Abstractive Summary]\n", evaluate_summary(reference_summary, abstractive))



[Evaluation - Extractive Summary]
 {'rouge1': np.float64(0.3291139240506329), 'rouge2': np.float64(0.24358974358974356), 'rougeL': np.float64(0.21518987341772156), 'rougeLsum': np.float64(0.3291139240506329)}

[Evaluation - Abstractive Summary]
 {'rouge1': np.float64(0.6578947368421052), 'rouge2': np.float64(0.43243243243243246), 'rougeL': np.float64(0.631578947368421), 'rougeLsum': np.float64(0.631578947368421)}


# **5. Test on New Real-World Article**

Test the model on real-world articles and evaluate summary coherence.

In [40]:
new_article = dataset['test'][1]['article']
real_world_summary = summarizer(new_article, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
print("\n[Test on Real-world Article]\n", real_world_summary)

# Final Output: Concise Summary
print("\n[Final Concise Summary Output]\n")
print(abstractive)


[Test on Real-world Article]
 Theia, a one-year-old bully breed mix, was hit by a car and buried in a field. She managed to stagger to a nearby farm, dirt-covered and emaciated. She suffered a dislocated jaw, leg injuries and a caved-in sinus cavity.

[Final Concise Summary Output]

Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund.
