**Important Note: This notebook is run on Google Colab due to dependcies issues with anaconda!**

# **Imports**

In [None]:
!pip install gensim



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.porter import PorterStemmer
import math
import numpy as np
import pickle
from gensim.models import Word2Vec
from scipy import spatial
import networkx as nx
import time

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
# Get the English stop words
stop_words = set(stopwords.words('english')) 
my_stopwords = stopwords.words()
# Create the stemmer
stemmer = PorterStemmer()

## **Importing dataset**

In [None]:
data = pd.read_csv('news_summary.csv', encoding='latin-1') #check encodings types

# **Exploring the dataset**

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4514 entries, 0 to 4513
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   author     4514 non-null   object
 1   date       4514 non-null   object
 2   headlines  4514 non-null   object
 3   read_more  4514 non-null   object
 4   text       4514 non-null   object
 5   ctext      4396 non-null   object
dtypes: object(6)
memory usage: 211.7+ KB


In [None]:
data.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [None]:
duplicateRows1 = data[data.duplicated(subset=['ctext'])]
print('complete text duplicates')
print(duplicateRows1)

complete text duplicates
                  author  ...                                              ctext
42          Chhavi Tyagi  ...  The Daman and Diu administration on Wednesday ...
190         Chhavi Tyagi  ...  Charges and counter charges flew in the Lok Sa...
231   Niharika Prabhakar  ...                                                NaN
286        Saloni Tandon  ...                                                NaN
368         Chhavi Tyagi  ...  Bihar chief minister Nitish Kumar comfortably ...
...                  ...  ...                                                ...
4381        Chhavi Tyagi  ...  Rounding off a day of hectic electioneering in...
4423      Mansha Mahajan  ...                                                NaN
4454     Abhishek Bansal  ...                                                NaN
4500      Mansha Mahajan  ...                                                NaN
4508        Tarun Khanna  ...                                                NaN

[1

In [None]:
print(str(data[4283:4284]['text']))
print(str(data[4285:4286]['text']))

4283    Elections in Goa ended up in a hung Assembly, ...
Name: text, dtype: object
4285    Uttar Pradesh Chief Minister Akhilesh Yadav on...
Name: text, dtype: object


In [None]:
duplicateRows2 = data[data.duplicated(subset=['text'])]
print('summary text duplicates',duplicateRows2)

summary text duplicates Empty DataFrame
Columns: [author, date, headlines, read_more, text, ctext]
Index: []


In [None]:
duplicateRows3 = data[data.duplicated(subset=['headlines'])]
print('headlines duplicates',duplicateRows3)

headlines duplicates Empty DataFrame
Columns: [author, date, headlines, read_more, text, ctext]
Index: []


In [None]:
selected_features = data[['headlines','text']]
selected_features.head()

Unnamed: 0,headlines,text
0,Daman & Diu revokes mandatory Rakshabandhan in...,The Administration of Union Territory Daman an...
1,Malaika slams user who trolled her for 'divorc...,Malaika Arora slammed an Instagram user who tr...
2,'Virgin' now corrected to 'Unmarried' in IGIMS...,The Indira Gandhi Institute of Medical Science...
3,Aaj aapne pakad liya: LeT man Dujana before be...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Hotel staff to get training to spot signs of s...,Hotels in Maharashtra will train their staff t...


In [None]:
selected_features.isnull().values.any()


False

In [None]:
X = selected_features['text'].values
Y = selected_features['headlines'].values
type(X)

numpy.ndarray

**This part (train_test_split) is mainly for the abstractive models and nothing to do with the extracctive models.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 25)
X_train.shape, X_test.shape

((3611,), (903,))

# **Extractive Approach : TF-IDF**

## **Main Functions**

In [None]:
def apply_preprocessing(text):
    original_sentences = sent_tokenize(text.lower())
    sentences = []
    # Sentences Pre-processing
    for sent in original_sentences:
        tokens = [item for item in word_tokenize(sent) if item not in my_stopwords and item != '.' and item.isalpha()] # Preprocessing
        sentences.append(' '.join([stemmer.stem(token) for token in tokens])) # Stemming
    return original_sentences, sentences

def word_tf(word,sentence):
    #tf_score =  sentence.count(word)/ len(sentence)
    return sentence.count(word)/ len(sentence)

def word_idf(word, sentences):
    return math.log10(' '.join(sentences).count(word)/ len(sentences))

def word_in_sentence_tfidf(word, sentence, sentences):
    return word_tf(word,sentence)*word_idf(word, sentences)

def sentence_score(sentence, sentences):
    return sum([word_in_sentence_tfidf(word, sentence, sentences) for word in sentence])

def sentences_scores(sentences):
    return [sentence_score(sent, sentences) for sent in sentences]

def get_best_k_sentences_indecies(k, sentences, original_sentences):
    #lista = sorted(list(np.argsort(sentences_scores(sentences))[-k:-1]))
    return sorted(list(np.argsort(sentences_scores(sentences))[-k:]))

def get_best_k_sentences(k, sentences, original_sentences):
    return [original_sentences[idx] for idx in get_best_k_sentences_indecies(k, sentences, original_sentences)]

def summarize(text, k):
    # Apply preprocessing
    original_sentences, sentences = apply_preprocessing(text)
    # Summariez
    return ' '.join(get_best_k_sentences(k, sentences, original_sentences))

## **Experiments & Results**

In [None]:
ext_summaries = []
n_sentences = 1
start_time = time.time()
for txt in X_train:
    ext_summaries.append(summarize(txt, n_sentences))
print('Elapsed time is {:.3f} seconds'.format(time.time()-start_time))

Elapsed time is 22.816 seconds


In [None]:
#Take a look at the output
print('the news: ', X_train[19])
print('===========================================')
print('the generated summary: ', ext_summaries[19])

the news:  Tamil Nadu Milk and Dairy Products Development Minister Rajenthra Bhalaji has alleged that products of private milk producers are adulterated. In a press briefing, Bhalaji held milk products by Nestle and Reliance, affirming that he had laboratory results which show that these are contaminated. He further alleged that there are contents of caustic soda and bleaching powder in the products.
the generated summary:  tamil nadu milk and dairy products development minister rajenthra bhalaji has alleged that products of private milk producers are adulterated.


# **Extractive Approach : TextRank**

## **Main Functions**

In [None]:
def apply_preprocessing(text):
    original_sentences = sent_tokenize(text.lower())
    sentences = []
    # Sentences Pre-processing
    for sent in original_sentences:
        tokens = [item for item in word_tokenize(sent) if item not in my_stopwords and item != '.' and item.isalpha()] # Preprocessing
        sentences.append(' '.join([stemmer.stem(token) for token in tokens])) # Stemming
    return original_sentences, sentences

def textrank_summarize(text, k):
  # apply pre-procssing for each training example
  original_sentences, sentences = apply_preprocessing(text)
  # Use the word embedding for each in each sentence sentence created with a 1-D vector!
  word_embeddings = Word2Vec(sentences, min_count=1, size=1, iter=1000)
  # Concatenate the embeddings for each sentence
  sentence_embeddings=[[word_embeddings[word][0] for word in words] for words in sentences]
  # Get the lenght of the longest sentence in the input text to used later while padding
  max_length=max([len(tokens) for tokens in sentences]) 
  sentence_embeddings=[np.pad(embedding,(0,max_length-len(embedding)),'constant') for embedding in sentence_embeddings]
  #Use the cosine similarity metric to compute the distances
  similarity_matrix = np.zeros([len(sentences), len(sentences)])
  for i,row_embedding in enumerate(sentence_embeddings):
    for j,column_embedding in enumerate(sentence_embeddings):
      similarity_matrix[i][j]=1-spatial.distance.cosine(row_embedding,column_embedding)
  #transforming the similarity matrix to a graph ,So it could be passed to the built in function of PageRank to retreive each sentence score
  nx_graph = nx.from_numpy_array(similarity_matrix)
  scores = nx.pagerank(nx_graph)
  # Creating a dictionary with score of each sentence as a key,then retreiving desired summary size (number of sentences) from top of the dict to bottom
  top_sentence={sentence:scores[index] for index,sentence in enumerate(sentences)}
  top=dict(sorted(top_sentence.items(), key=lambda x: x[1], reverse=True)[:k])
  for sent in sentences:
    if sent in top.keys():
     return original_sentences[sentences==sent]
  return ''

## **Experiments & Results**

In [None]:
txt_summaries = []
n_sentences = 1
start_time = time.time()
for txt in X_train[0:250]:
    txt_summaries.append(textrank_summarize(txt, n_sentences))
print('Elapsed time is {:.3f} seconds'.format(time.time()-start_time))

  app.launch_new_instance()


Elapsed time is 314.064 seconds


In [None]:
#Take a look at the output
print('the news: ', X_train[19])
print('===========================================')
print('the generated summary: ', txt_summaries[19])

the news:  Tamil Nadu Milk and Dairy Products Development Minister Rajenthra Bhalaji has alleged that products of private milk producers are adulterated. In a press briefing, Bhalaji held milk products by Nestle and Reliance, affirming that he had laboratory results which show that these are contaminated. He further alleged that there are contents of caustic soda and bleaching powder in the products.
the generated summary:  tamil nadu milk and dairy products development minister rajenthra bhalaji has alleged that products of private milk producers are adulterated.


**To evaluate this approach, we're evaluating its output with the output of TF-IDF based extractive summarization in the next section**

# **Evaluation**

**As we're using Colab, we're comparing the performance of TextRank approach with the IF-IDF approach.**

In [None]:
precision = sum([1 for i in range(len(txt_summaries)) if ext_summaries[i] == txt_summaries[i]])/len(txt_summaries)
print(precision)

0.736


**Let's see the differences!!**

In [None]:
for (x,y) in zip(ext_summaries, txt_summaries):
    if x != y:
        print('TF-IDF: ', x)
        print('TextRank: ', y)
        print('=========')

TF-IDF:  hindustan times wrote the film is not "an all-out-war against patriarchy" but "a subdued conversation starter".
TextRank:  'lipstick under my burkha', which released on friday, is "worth the hype...a must watch", wrote the quint.
TF-IDF:  the film also stars jackky bhagnani and prachi desai and focuses on environmental issues like global warming and climate change.
TextRank:  nawazuddin siddiqui's first look from the upcoming short film 'carbon' has been released.
TF-IDF:  while gs road has been renamed as mahapurush srimanta sankerdev path, beltola-khanapara road is renamed as peer azan fakir road.
TextRank:  the bjp-led assam government on thursday changed the names of all the major roads of guwahati.
TF-IDF:  police have arrested two suspects in relation to the first rape incident and have launched an investigation into the second attack.
TextRank:  a 14-year-old girl was raped twice in one night in uk's birmingham by different attackers, according to reports.
TF-IDF:  whil