### **Objective:** Create a system that summarizes lengthy articles, blogs, or news into concise summaries.

Task 2: Text Summarization

● Dataset: CNN/Daily Mail Dataset  
● Steps:
1. Preprocess textual data for summarization.
2. Implement extractive summarization using libraries like spaCy.
3. Implement abstractive summarization using pre-trained models like BERT or
GPT with HuggingFace's transformers.
4. Fine-tune models to improve the quality of summaries.
5. Test the model on real-world articles and evaluate summary coherence.  
● Outcome: A summarization model capable of generating
concise summaries from long texts.

In [1]:
# %pip install seaborn
# %pip install matplotlib.pyplot
# %pip install pandas
# %pip install nltk
# %pip install sklearn
# %pip install spacy
# %pip install scipy
# %pip install transformers
# %pip install torch


In [2]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LEO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### **Loading dataset**

In [4]:
df = pd.read_csv("../datasets/cnn_dailymail/train.csv")
df.head()

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [5]:
print("Columns in dataframe:",df.columns)
print("Shape of dataframe:",df.shape)

Columns in dataframe: Index(['id', 'article', 'highlights'], dtype='object')
Shape of dataframe: (287113, 3)


### **Preprocessing Data**

In [6]:
df_sample = df.sample(1, random_state=123)
article = df_sample['article'].values[0]
highlight = df_sample['highlights'].values[0]

In [7]:
# nltk.download('punkt_tab')

In [8]:
def preprocess_text(text):
    text = text.lower()
    #remove extra spaces
    text = re.sub(r'\s+',' ', text)
    #remove brackets and the content inside them
    text = re.sub(r'\[[^]]*\]','', text)
    #remove parentheses and the content inside them
    text = re.sub(r'\([^)]*\)', '', text)
    #remove special characters
    text = re.sub(r'[^a-zA-Z0-9.,?!\'\s]', '', text)
    return text.strip()
preprocess_article = preprocess_text(article)
sentences = sent_tokenize(preprocess_article)
print("\nPreprocessed Article:", preprocess_article)
print("\nHighlight:", highlight)
print("\nNumber of sentences in the article:", len(sentences))
print("\nFirst 3 sentences:", sentences[:3])


Preprocessed Article: a group of brazilian convicts who tunnelled out of jail to freedom filmed the whole escape on a mobile phone ... before posting the videos online to taunt their hoodwinked guards. in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, 13 prisoners dug a small hole under the wall between their cell and the prison yard in rio verde, goias state, before darting to freedom under cover of darkness. but unlike the famous movie, starring tim robbins and morgan freeman, which is set in the 1940s, their escape had very modern twist. rather than focus purely on getting out of jail without detection, they felt compelled to record the whole experience on a mobile phone and post it on the internet for all to see. scroll down for video . brazen the thirteen prisoners filmed their escape on mobile phones, laughing as they did so, before later posting the video online . like a movie in a daring breakout that echoed 1994 hollywood blockbuster the sh

### **Sentence Tokenization & Vectorization Using spaCy**

In [9]:
import spacy
import numpy as np

In [10]:
# !python -m spacy download en_core_web_md

In [11]:
#loading spaCy english model
nlp = spacy.load("en_core_web_md")

preprocess_article = preprocess_text(article)
sentences = sent_tokenize(preprocess_article)

# Store spaCy docs for each sentence
sentences_docs = [nlp(sentence) for sentence in sentences]   
# Filter out or Removing very short or empty sentences
filtered_docs = [doc for doc in sentences_docs if len(doc.text.split()) > 20]


In [12]:
print(f"📄 Total sentences: {len(sentences)}")
print(f"🧹 Sentences after filtering: {len(filtered_docs)}")
print("\n📌 First 3 sentences:")
for i, doc in enumerate(filtered_docs[:3]):
    print(f"{i+1}. {doc.text}")

📄 Total sentences: 23
🧹 Sentences after filtering: 9

📌 First 3 sentences:
1. a group of brazilian convicts who tunnelled out of jail to freedom filmed the whole escape on a mobile phone ... before posting the videos online to taunt their hoodwinked guards.
2. in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, 13 prisoners dug a small hole under the wall between their cell and the prison yard in rio verde, goias state, before darting to freedom under cover of darkness.
3. but unlike the famous movie, starring tim robbins and morgan freeman, which is set in the 1940s, their escape had very modern twist.


### **Compute Sentence Similarity Matrix**

In [13]:
#using cosine similarity
n = len(filtered_docs)
similarity_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(n):
        # skipping self-similarity( will always be 1 or we can set it to 0 depending on the use case)
        if i != j:
            similarity_matrix[i][j] = filtered_docs[i].similarity(filtered_docs[j])
       

In [14]:
# print("📌 Similarity Matrix Shape:", similarity_matrix.shape)
# print("🔢 Similarity Matrix (first 3x3):\n", similarity_matrix[:3, :3])

### **Build Sentence Graph & Rank with PageRank (TextRank)**


In [15]:
import networkx as nx

In [16]:
print(filtered_docs)


[a group of brazilian convicts who tunnelled out of jail to freedom filmed the whole escape on a mobile phone ... before posting the videos online to taunt their hoodwinked guards., in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, 13 prisoners dug a small hole under the wall between their cell and the prison yard in rio verde, goias state, before darting to freedom under cover of darkness., but unlike the famous movie, starring tim robbins and morgan freeman, which is set in the 1940s, their escape had very modern twist., rather than focus purely on getting out of jail without detection, they felt compelled to record the whole experience on a mobile phone and post it on the internet for all to see., brazen the thirteen prisoners filmed their escape on mobile phones, laughing as they did so, before later posting the video online ., like a movie in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, the prisoners dug a s

In [17]:
# building a graph
sentence_graph = nx.from_numpy_array(similarity_matrix)

# using PageRank algorithm to rank sentences
scores = nx.pagerank(sentence_graph)

# sorting sentences by their scores
ranked_sentences = sorted(((scores[i], doc) for i, doc in enumerate(filtered_docs)), reverse=True)

In [18]:
scores

{0: 0.11279044533029339,
 1: 0.1138739171253475,
 2: 0.11135033999279662,
 3: 0.1112975260118318,
 4: 0.11305956129914166,
 5: 0.11461131919072506,
 6: 0.10932991895446192,
 7: 0.10692392443004176,
 8: 0.10676304766536027}

### **Generate Summary using top ranked sentences**

In [None]:
top_sent = sorted(ranked_sentences[:3], key=lambda x:filtered_docs.index(x[1]), reverse=True)

extracted_summary = '\n'.join([sent.text for _, sent in top_sent])

print("📄 Extractive Summary:\n")
print(extracted_summary)

print("\n📝 Original Human-Written Summary:\n")
print(highlight)

📄 Extractive Summary:

like a movie in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, the prisoners dug a small hole under the wall that separated their cell and the prison yard in rio verde, goias state, before darting to freedom under cover of darkness .
in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, 13 prisoners dug a small hole under the wall between their cell and the prison yard in rio verde, goias state, before darting to freedom under cover of darkness.
brazen the thirteen prisoners filmed their escape on mobile phones, laughing as they did so, before later posting the video online .

📝 Original Human-Written Summary:

13 prisoners escaped from Rio Grande prison after digging hole under wall .
The escape had shades of 1994 blockbuster The Shawshank Redemption .
But in a modern twist, they filmed escape on smuggled mobile phones .
They then posted the video online which showed them goading guards .


### **Abstractive Summarization with BART**

In [30]:
summarizer = pipeline("summarization", model = "facebook/bart-large-cnn")

#truncating due to the bart's input size limit
text = df_sample['article'].iloc[0]
if len(text.split()) > 1024:
    text = " ".join(text.split()[:1024])

summary = summarizer(df_sample['article'].iloc[0], max_length = 100, min_length = 50, do_sample=False)
print("\nBART Generated Summary:\n", summary)
print("\nActual Summary:\n",df_sample['highlights'].iloc[0])


Device set to use cpu



BART Generated Summary:
 [{'summary_text': "13 prisoners dug a small hole under the wall between their cell and the prison yard in Rio Verde, Goias State, before darting to freedom under cover of darkness. Rather than focus purely on getting out of jail without detection, they felt compelled to record the whole experience on a mobile phone. Police deputy Jacqueline Camargo Machado admitted that the prisoners' cells had not been inspected for two months."}]

Actual Summary:
 13 prisoners escaped from Rio Grande prison after digging hole under wall .
The escape had shades of 1994 blockbuster The Shawshank Redemption .
But in a modern twist, they filmed escape on smuggled mobile phones .
They then posted the video online which showed them goading guards .


In [31]:

print("Extractive Summary:\n", extracted_summary)
print("\nAbstractive Summary:\n", summary)
print("\nHuman Summary:\n", highlight)


Extractive Summary:
 like a movie in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, the prisoners dug a small hole under the wall that separated their cell and the prison yard in rio verde, goias state, before darting to freedom under cover of darkness .
in a daring breakout that echoed 1994 hollywood blockbuster the shawshank redemption, 13 prisoners dug a small hole under the wall between their cell and the prison yard in rio verde, goias state, before darting to freedom under cover of darkness.
brazen the thirteen prisoners filmed their escape on mobile phones, laughing as they did so, before later posting the video online .

Abstractive Summary:
 [{'summary_text': "13 prisoners dug a small hole under the wall between their cell and the prison yard in Rio Verde, Goias State, before darting to freedom under cover of darkness. Rather than focus purely on getting out of jail without detection, they felt compelled to record the whole experience on a m