
# Extractive Summarization

## Steps to follow

-Installing libraries and their dependencies 

-Import libraries 

-Load the dataset with news articles

-Text cleaning

-Sentence Tokenization

-Word

-Word-Frequence Table

-Exatractive summarization

In [None]:
# Installing libs by removing the comments and run the cell
# !pip install pandas
# !pip install -U spacy
# !python -m spacy download en_core_web_sm 
# !pip install rouge-score

In [1]:
# Import all the necessary libraries
import pandas as pd 
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [2]:
# Load data from the specified path
data_path = './data/news_summary.csv'
df = pd.read_csv(data_path, encoding='latin-1')
df.shape

(4514, 7)

In [3]:
# Printing few rows from the datasets
df.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext,link = https://www.kaggle.com/sunnysai12345/news-summary
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...,
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo...",
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...,
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...,


In [53]:
# lets see the first article
# Line below enables to see the entire text instead of truncated text
pd.set_option('display.max_colwidth',-1)
article_number = 0
print("First article from the dataset:\n",df.loc[article_number,'ctext'])

First article from the dataset:
 The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7. In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues,? the order, issued on August 1 by Gurpreet Singh, deputy secretary (personnel), had said.To ensure that no one skipped office, an attendance report was to be sent to the government the next evening.The two notifications ? one mandating the celebration of Rakshabandhan (left) and the other withdrawing the mandate (

In [54]:
# checking if the specified article is None or have valid text
not pd.isna(df.loc[article_number,'ctext'])

True

In [55]:
# Get all the stop words from spacy
stop_words = list(STOP_WORDS)
print('Few stop words:\n', stop_words[:10])

Few stop words:
 ['about', 'any', 'ourselves', 'elsewhere', 'most', 'over', 'then', 'beforehand', 'all', 'name']


##### Lets get started with SpaCy

In [56]:
# Load the Spacy model using en_core_web_sm
nlp = spacy.load('en_core_web_sm')

In [57]:
# Load the article in the spacy compatible format
doc = nlp(df.loc[article_number,'ctext'])

In [58]:
# Print all the tokens of this saample text from the dataset
tokens = [token.text for token in doc]
print("Tokens after processing the article using spacy-\n",tokens)

Tokens after processing the article using spacy-
 ['The', 'Daman', 'and', 'Diu', 'administration', 'on', 'Wednesday', 'withdrew', 'a', 'circular', 'that', 'asked', 'women', 'staff', 'to', 'tie', 'rakhis', 'on', 'male', 'colleagues', 'after', 'the', 'order', 'triggered', 'a', 'backlash', 'from', 'employees', 'and', 'was', 'ripped', 'apart', 'on', 'social', 'media', '.', 'The', 'union', 'territory?s', 'administration', 'was', 'forced', 'to', 'retreat', 'within', '24', 'hours', 'of', 'issuing', 'the', 'circular', 'that', 'made', 'it', 'compulsory', 'for', 'its', 'staff', 'to', 'celebrate', 'Rakshabandhan', 'at', 'workplace.?It', 'has', 'been', 'decided', 'to', 'celebrate', 'the', 'festival', 'of', 'Rakshabandhan', 'on', 'August', '7', '.', 'In', 'this', 'connection', ',', 'all', 'offices/', 'departments', 'shall', 'remain', 'open', 'and', 'celebrate', 'the', 'festival', 'collectively', 'at', 'a', 'suitable', 'time', 'wherein', 'all', 'the', 'lady', 'staff', 'shall', 'tie', 'rakhis', 'to',

Note - Above we can obeserve that there are a lot of undesired text is available in the form of punctuations, special symbols etc. 

In [59]:
# Creating full list of punctuations and speical symbols to be removed 
punctuation = punctuation + '\n'
print("Punctuations:",punctuation)

Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~






In [60]:
# Removing stopwords, next line '\n' etc. to clean the text and get their word frequencies
word_freq = {}
for word in doc: 
    if word.text.lower() not in stop_words:
        if word.text.lower() not in punctuation:
            if word.text.lower() not in word_freq.keys():
#                 word_freq[word.text] = 1
                word_freq[word.text.lower()] = 1
            else:
#                 word_freq[word.text] += 1
                word_freq[word.text.lower()] += 1
print("Word Frequencies in the article: \n",word_freq)

Word Frequencies in the article: 
 {'daman': 3, 'diu': 3, 'administration': 3, 'wednesday': 1, 'withdrew': 1, 'circular': 4, 'asked': 2, 'women': 2, 'staff': 3, 'tie': 3, 'rakhis': 2, 'male': 1, 'colleagues': 2, 'order': 3, 'triggered': 1, 'backlash': 1, 'employees': 1, 'ripped': 1, 'apart': 2, 'social': 1, 'media': 1, 'union': 1, 'territory?s': 1, 'forced': 1, 'retreat': 1, '24': 1, 'hours': 1, 'issuing': 1, 'compulsory': 1, 'celebrate': 4, 'rakshabandhan': 4, 'workplace.?it': 1, 'decided': 1, 'festival': 5, 'august': 2, '7': 1, 'connection': 1, 'offices/': 1, 'departments': 1, 'shall': 2, 'remain': 1, 'open': 1, 'collectively': 1, 'suitable': 1, 'time': 1, 'lady': 1, 'issued': 4, '1': 1, 'gurpreet': 1, 'singh': 1, 'deputy': 1, 'secretary': 1, 'personnel': 2, 'said': 3, 'ensure': 1, 'skipped': 1, 'office': 1, 'attendance': 1, 'report': 1, 'sent': 1, 'government': 3, 'evening': 2, 'notifications': 1, 'mandating': 1, 'celebration': 2, 'left': 1, 'withdrawing': 1, 'mandate': 1, 'right': 

In [61]:
# To get the normalize frequency, first get the maximum frequency
max_freq = max(word_freq.values())
print("Maximum freuqency %d "% max_freq) 

Maximum freuqency 5 


In [62]:
# divide all frequency with maximum one 
for word in word_freq.keys():
    word_freq[word] = word_freq[word]/max_freq
print("Normalized Word Frequencies in the article:\n",word_freq)

Normalized Word Frequencies in the article:
 {'daman': 0.6, 'diu': 0.6, 'administration': 0.6, 'wednesday': 0.2, 'withdrew': 0.2, 'circular': 0.8, 'asked': 0.4, 'women': 0.4, 'staff': 0.6, 'tie': 0.6, 'rakhis': 0.4, 'male': 0.2, 'colleagues': 0.4, 'order': 0.6, 'triggered': 0.2, 'backlash': 0.2, 'employees': 0.2, 'ripped': 0.2, 'apart': 0.4, 'social': 0.2, 'media': 0.2, 'union': 0.2, 'territory?s': 0.2, 'forced': 0.2, 'retreat': 0.2, '24': 0.2, 'hours': 0.2, 'issuing': 0.2, 'compulsory': 0.2, 'celebrate': 0.8, 'rakshabandhan': 0.8, 'workplace.?it': 0.2, 'decided': 0.2, 'festival': 1.0, 'august': 0.4, '7': 0.2, 'connection': 0.2, 'offices/': 0.2, 'departments': 0.2, 'shall': 0.4, 'remain': 0.2, 'open': 0.2, 'collectively': 0.2, 'suitable': 0.2, 'time': 0.2, 'lady': 0.2, 'issued': 0.8, '1': 0.2, 'gurpreet': 0.2, 'singh': 0.2, 'deputy': 0.2, 'secretary': 0.2, 'personnel': 0.4, 'said': 0.6, 'ensure': 0.2, 'skipped': 0.2, 'office': 0.2, 'attendance': 0.2, 'report': 0.2, 'sent': 0.2, 'govern

In [63]:
# sentence tokenization to get the best summary from the article seperated by comma
sentence = [s for s in doc.sents]
print("Sentence Tokens:\n",sentence)

Sentence Tokens:
 [The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media., The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7., In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues,?, the order, issued on August 1 by Gurpreet Singh, deputy secretary (personnel), had said., To ensure that no one skipped office, an attendance report was to be sent to the government the next evening., The two notifications ?, one mandating the celebration of Rakshabandhan (left) and the other withdrawing the mandate (right

In [64]:
# Calculating sentence score for every sentence
sentence_score = {}
for s in sentence:
    for word in s:
        if word.text.lower() in word_freq.keys():
            if s not in sentence_score.keys():
                sentence_score[s] = word_freq[word.text.lower()]
            else:
                sentence_score[s] += word_freq[word.text.lower()]
print("Sentences score based on normalized word frequencies:\n",sentence_score)

Sentences score based on normalized word frequencies:
 {The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.: 8.2, The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7.: 8.8, In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues,?: 6.400000000000001, the order, issued on August 1 by Gurpreet Singh, deputy secretary (personnel), had said.: 3.8000000000000003, To ensure that no one skipped office, an attendance report was to be sent to the government the next evening.: 2.1999999999999997, The two notificat

In [65]:
# for summary lets get some percentage of the total sentences with maximum score to generate the summary
# The above we can achieve using nlargest from the heapq package mentioned below
from heapq import nlargest
sent_perc = 0.1
sent_len = int(len(sentence)*sent_perc)
print('Total number of sentences in the article: %d\nSentences describing the summary would be: %d'%(len(sentence),sent_len)) 

Total number of sentences in the article: 22
Sentences describing the summary would be: 2


In [66]:
summary = nlargest(sent_len, sentence_score, key=sentence_score.get)
print('Summary based on sentences % used to generate summary-\n',summary)

Summary based on sentences % used to generate summary-
 [The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7., The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.]


In [67]:
# Combine all these sentences together to form the actual summary of the article 
predicted_summary = [word.text for word in summary]
predicted_summary = ' '.join(predicted_summary)
print('Final predicted summary of the article is as follows:\n',predicted_summary)

Final predicted summary of the article is as follows:
 The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7. The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.


In [68]:
# lets see the targeted summary 
pd.set_option('display.max_colwidth',-1)
df.loc[article_number,'text']

'The Administration of Union Territory Daman and Diu has revoked its order that made it compulsory for women to tie rakhis to their male colleagues on the occasion of Rakshabandhan on August 7. The administration was forced to withdraw the decision within 24 hours of issuing the circular after it received flak from employees and was slammed on social media.'

#### Evaluation - ROUGE score 
Ref - https://stats.stackexchange.com/questions/301626/interpreting-rouge-scores

ROUGE doesn't try to assess how fluent the summary: ROUGE only tries to assess the adequacy, by simply counting how many n-grams in your generated summary matches the n-grams in your reference summary (or summaries)

How to interpret the results in short and approximately:

ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.

###### Code Snippet below to evaluate the extractive summarization model 

In [69]:
# Evaluate predicted summary with targeted summary using ROUGE metrics
# Ref - https://pypi.org/project/rouge-score/
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(df.loc[article_number,'text'],predicted_summary)
scores

{'rouge1': Score(precision=0.6710526315789473, recall=0.85, fmeasure=0.7499999999999999),
 'rougeL': Score(precision=0.35526315789473684, recall=0.45, fmeasure=0.39705882352941174)}

The above model has performed better w.r.t. targeted summary provided in the dataset for this article. ROUGE score values for precision and recall are good (more than 60% are consider good)

The above results are achieved using 10% of the total sentences to describe the summary of the article. This number can be fine-tuned by running more number of examples.

#### Note: 
Extractive summary is the extract from the original article and would not completely match with targeted summary as it is generated manually.  

#### Further improvements 
the ROUGE score used as evaluation metrics, can be achieved using Abstractive summary using T5, BART pretrained models.