### Using Text rank to preform text summarization

- Text rank based summarization is essentially the same as Vector based but with a different ranking algorithm

- **Text rank** is based on google's **Page rank** algorithm used to rank a webpage's poluparity.

#### Steps
1. Split the document into sentences
2. Compute tf idf score for each sentence
3. Compute cosine similarity between each sentence
4. Sort by score 
5. Print N top, N% top, >N top sentences


In [None]:
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

In [13]:
# import libraries
import random
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dimitris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dimitris\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
#import dataset
df = pd.read_csv('bbc_text_cls.csv')

# helper function to remove newline
def strip_nl(s):
    return s.replace("\n"," ")

# remove newline
df['text'] = df['text'].apply(strip_nl)

# get random article
article = df['text'].iloc[random.randint(0,len(df['text'])-1)]

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
# Split the document in sentences
sentences = nltk.sent_tokenize(article)

# calculate TF-IDF
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'),norm='l1')
tfidf_matrix = vectorizer.fit_transform(sentences)

#compute cosine similarity between each sentence
# res is a tfidf[0] * tfidf[0] matrix
S = cosine_similarity(tfidf_matrix)    



In [19]:
len(sentences) == len(S[0])

True

In [21]:
# Normalize similarity matrix
S /= S.sum(axis = 1, keepdims=True)
S[0].sum()

1.0

In [24]:
# uniform transition matrix
U = np.ones_like(S)/len(S)
# Smoothed similarity matrix
factor = 0.15
S = (1-factor) *S +factor * U

S[0].sum()

0.9999999999999999

In [28]:
# find limting/ stationary distribution
eigenvals, eigenvecs = np.linalg.eig(S.T)
eigenvals

array([1.        , 0.21907674, 0.7225    , 0.68056288, 0.28483241,
       0.62196215, 0.60569211, 0.34382012, 0.3505369 , 0.36377105,
       0.56879925, 0.40248621, 0.538296  , 0.43750048, 0.46529162,
       0.49965687, 0.48755798])

In [30]:
# eigen vects are stored in columns
eigenvecs[:,0]


array([-0.32586662, -0.24131864, -0.23087772, -0.22113222, -0.25114315,
       -0.2279602 , -0.2218763 , -0.25083325, -0.24869245, -0.21984717,
       -0.23612786, -0.21835121, -0.25630691, -0.23618841, -0.23064255,
       -0.2335762 , -0.25167594])

In [32]:
eigenvecs[:,0].dot(S)

array([-0.32586662, -0.24131864, -0.23087772, -0.22113222, -0.25114315,
       -0.2279602 , -0.2218763 , -0.25083325, -0.24869245, -0.21984717,
       -0.23612786, -0.21835121, -0.25630691, -0.23618841, -0.23064255,
       -0.2335762 , -0.25167594])

In [33]:
eigenvecs[:,0] /eigenvecs[:,0].sum()

array([0.07943284, 0.05882353, 0.05627847, 0.05390291, 0.06121834,
       0.05556729, 0.05408429, 0.0611428 , 0.06062096, 0.05358967,
       0.05755823, 0.05322502, 0.06247705, 0.05757299, 0.05622114,
       0.05693624, 0.06134821])

In [34]:
limiting_dist = np.ones(len(S)) / len(S)
threshold = 1e-8
delta = float('inf')
iters = 0

while delta> threshold:
    iters +=1 
    
    #Markov_transition
    p = limiting_dist.dot(S)
    
    # compute change in dist
    delta = np.abs(p-limiting_dist).sum()
    
    # update limiting dist
    limiting_dist = p
    


In [37]:
limiting_dist

array([0.07943284, 0.05882353, 0.05627847, 0.05390291, 0.06121834,
       0.0555673 , 0.05408429, 0.0611428 , 0.06062096, 0.05358967,
       0.05755823, 0.05322502, 0.06247705, 0.05757299, 0.05622114,
       0.05693624, 0.06134821])

In [39]:
limiting_dist.sum()

0.9999999999999986

In [41]:
np.abs(eigenvecs[:,0] / eigenvecs[:,0].sum() - limiting_dist).sum()

1.4747766015343888e-08

In [43]:
scores = limiting_dist
sort_idx = np.argsort(-scores)

In [44]:
# pick top N sentences

N = 5
for i in sort_idx[:N]:
    print(f"{scores[i]} {sentences[i]}")

0.0794328417788865 Immigration to be election issue  Immigration and asylum have normally been issues politicians from the big parties have tiptoed around at election time.
0.06247705245183829 But, while all the parties appear to agree the time has come to properly debate and address the issue, there are already signs they will run into precisely the same problems as before.
0.06134821345017068 The challenge for the big parties is to ensure they can engage in the debate during the cut and thrust of a general election while also avoiding that trap.
0.06121833970252338 That was also true at the last general election and the issue did briefly become a campaigning issue.
0.06114279958487467 The Tories are already committed to imposing annual limits on immigration, with a quota for asylum seekers and with applications processed outside the UK.
