# TextRank Summarization method

**Using the same dataset of BBC News articles, you can use TextRank methodology on a text document to create a summary.**

**Starting with tokenization and vectorization, you apply more advanced calculations to the TF-IDF matrix that require knowledge of linear algebra and the Markov model.**

In [1]:
import pandas as pd
import numpy as np
import textwrap

# For tokenization
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

# For vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
#nltk.download('punkt')
#nltk.download('stopwords')

In [3]:
df = pd.read_csv('data/bbc_text_cls.csv')

In [4]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [5]:
# Function to wrap large text documents nicely

def wrap(x):
    return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)

In [6]:
# Obtain random business article

sample_doc = df[df.labels == 'business']['text'].sample(random_state=123)

In [7]:
# Wrap entire article (title and body)

print(wrap(sample_doc.iloc[0]))

Ad firm WPP's profits surge 15%

UK advertising giant WPP has posted
larger-than-expected annual profits and predicted that it will
outperform the market in 2005.

Pre-tax profits rose 15% from a year
ago to reach £546m ($1.04bn), ahead of average analysts' forecasts of
£532m.  Revenues were £4.3bn while the firm's operating margins were
14.1%, which it said could reach 14.8% by 2006. During the year WPP
bought US rival Grey Global, creating a giant big enough to rival
sector leader Omnicom.

Chief Executive Martin Sorrell on Friday told
Reuters news agency that WPP had submitted a proposal for United
Business Media's NOP World market research unit.  Analysts say the
unit sell could sell for up to £350m.  WPP in recent years has also
bought firms such as Ogilvy & Mather and Cordiant Communications.  It
also includes the firms Young & Rubicam and J Walter Thompson.  Events
such as the Olympics helped boost WPP's profits in 2004. The company
said the US Congressional elections and the FI

In [8]:
# View article body only using split function - 3 paragraphs

print(sample_doc.iloc[0].split("\n", 1)[1])


UK advertising giant WPP has posted larger-than-expected annual profits and predicted that it will outperform the market in 2005.

Pre-tax profits rose 15% from a year ago to reach £546m ($1.04bn), ahead of average analysts' forecasts of £532m. Revenues were £4.3bn while the firm's operating margins were 14.1%, which it said could reach 14.8% by 2006. During the year WPP bought US rival Grey Global, creating a giant big enough to rival sector leader Omnicom.

Chief Executive Martin Sorrell on Friday told Reuters news agency that WPP had submitted a proposal for United Business Media's NOP World market research unit. Analysts say the unit sell could sell for up to £350m. WPP in recent years has also bought firms such as Ogilvy & Mather and Cordiant Communications. It also includes the firms Young & Rubicam and J Walter Thompson. Events such as the Olympics helped boost WPP's profits in 2004. The company said the US Congressional elections and the FIFA World Cup are likely to present ad

In [9]:
# Tokenize article body into sentences

sentences = nltk.sent_tokenize(sample_doc.iloc[0].split("\n", 1)[1])

In [10]:
sentences

['\nUK advertising giant WPP has posted larger-than-expected annual profits and predicted that it will outperform the market in 2005.',
 "Pre-tax profits rose 15% from a year ago to reach £546m ($1.04bn), ahead of average analysts' forecasts of £532m.",
 "Revenues were £4.3bn while the firm's operating margins were 14.1%, which it said could reach 14.8% by 2006.",
 'During the year WPP bought US rival Grey Global, creating a giant big enough to rival sector leader Omnicom.',
 "Chief Executive Martin Sorrell on Friday told Reuters news agency that WPP had submitted a proposal for United Business Media's NOP World market research unit.",
 'Analysts say the unit sell could sell for up to £350m.',
 'WPP in recent years has also bought firms such as Ogilvy & Mather and Cordiant Communications.',
 'It also includes the firms Young & Rubicam and J Walter Thompson.',
 "Events such as the Olympics helped boost WPP's profits in 2004.",
 'The company said the US Congressional elections and the FI

In [11]:
len(sentences)

11

In [12]:
# Vectorize sentences 

tfidf_vect = TfidfVectorizer(stop_words=stopwords.words('english'), norm='l1')

In [13]:
X = tfidf_vect.fit_transform(sentences)

In [14]:
# Sparse matrix with TF-IDF scores (11 sentences x 105 words)

X

<11x105 sparse matrix of type '<class 'numpy.float64'>'
	with 129 stored elements in Compressed Sparse Row format>

## 1) Calculate cosine similarities

In [15]:
# Compute cosine similarity between each sentence and every other sentence

S = cosine_similarity(X)

In [16]:
# Sanity check for 11-x-11 matrix

S.shape

(11, 11)

In [17]:
# Confirm that there are 11 sentences

len(sentences)

11

## 2) Convert to probabilities

**In order to create the 'Markov' state transition matrix, you 'normalize' the cosine similarities between each and every sentence by dividing each row by its sum (row values should add up to one).**

In [18]:
# Ensure keepdims=True to maintain 2D matrix

S /= S.sum(axis=1, keepdims=True)

In [19]:
# Check that rows add up to one

S[0].sum()

0.9999999999999998

**Apply smoothing to the Markov matrix to deal with zero values, by creating a 'uniform' matrix of ones divided by the number of sentences, and combining it with the Markov matrix.**

**The 'uniform' matrix by multiplying by a factor of 0.15**

In [20]:
# Create uniform transition matrix in same shape as sentences

U = np.ones_like(S) / len(S)

In [21]:
# Rows should add up to one

U[0].sum()

1.0

In [22]:
# 'Smoothed' matrix - convex combination of S and U

factor = 0.15

S = (1 - factor) * S + factor * U

In [23]:
# Rows should add up to one

S[0].sum()

0.9999999999999998

## 3) Find the limiting distribution

**Also referred to as the stationary distribution, the limiting distribution is basically the scores for each sentence, and you need to compute the eigenvalues and eigenvectors of the 'smoothed' probability matrix**

**Eigenvalues** are a set of special scalar values that represent importance, associated with a linear system.

**Eigenvectors** are a set of special vectors associated with a linear system.

Since eigenvalue equations are done in column vectors, not row vectors, you must transpose the Markov matrix before finding the eigenvalues and eigenvectors.

In [24]:
# Calculate the eigenvalues and eigenvectors from transposed matrix

eigenvals, eigenvecs = np.linalg.eig(S.T)

In [25]:
# 11 eigenvalues (one should be 1)

eigenvals

array([1.        , 0.79424305, 0.74403335, 0.49082468, 0.52102797,
       0.70229786, 0.67039834, 0.61843748, 0.57085743, 0.59890148,
       0.58373726])

In [26]:
# 11 eigenvectors

len(eigenvecs)

11

In [27]:
# 1st eigenvector (1st column) - do not worry about negative values yet

eigenvecs[:, 0]

array([-0.31540874, -0.29916867, -0.2934562 , -0.30751681, -0.30162485,
       -0.29140274, -0.31760331, -0.2852635 , -0.29789447, -0.29632761,
       -0.30925864])

In [28]:
# Test assumption that this is an eigenvector - note that values do not change

eigenvecs[:, 0].dot(S)

array([-0.31540874, -0.29916867, -0.2934562 , -0.30751681, -0.30162485,
       -0.29140274, -0.31760331, -0.2852635 , -0.29789447, -0.29632761,
       -0.30925864])

In [29]:
# To make all values positive and add up to one, divided vector by its total 

eigenvecs[:, 0] / eigenvecs[:, 0].sum()

array([0.09514806, 0.09024899, 0.08852573, 0.09276734, 0.09098993,
       0.08790627, 0.09581009, 0.08605427, 0.08986461, 0.08939194,
       0.09329279])

In [30]:
# ------------------------- BRUTE FORCE METHOD TO FIND LIMITING DISTRIBUTION ------------------------- #


# Initialize limiting distribution matrix
limiting_dist = np.ones(len(S)) / len(S)

# Define threshold (10 to the power of -8)
threshold = 1e-8

# Stores how much distribution changes from one step to next
delta = float('inf')

# Store current iteration number (counter)
iters = 0

while delta > threshold:
    iters += 1 
    
    # Same as Markov matrix
    p = limiting_dist.dot(S) 
    
    # Compute change in limiting distribution 
    delta = np.abs(p - limiting_dist).sum() 
    
    # Update limiting distribution 
    limiting_dist = p

print(iters)

47


In [31]:
# Note that these values match eigenvecs[:, 0] / eigenvecs[:, 0].sum()

limiting_dist

array([0.09514806, 0.09024899, 0.08852574, 0.09276733, 0.09098993,
       0.08790628, 0.09581008, 0.08605426, 0.0898646 , 0.08939194,
       0.09329278])

In [32]:
# Values should add up to one

limiting_dist.sum()

1.0000000000000004

In [33]:
# Calculate absolute difference between limiting distribution and 1st eigenvector divided by itself

np.abs(eigenvecs[:,0] / eigenvecs[:,0].sum() - limiting_dist).sum()

3.496053004037325e-08

**The limiting distribution is the same as the first eigenvector divided by its total:**

`limiting_dist = eigenvecs[:, 0] / eigenvecs[:, 0].sum()`

**If you calculate the difference between the two, as above, the answer is approx 0.000000035, which is basically zero**

In [34]:
# Set the limiting distribution to scores and sort index numbers in descending order

scores = limiting_dist

In [35]:
sort_idx = np.argsort(-scores)

In [36]:
sort_idx

array([ 6,  0, 10,  3,  4,  1,  8,  9,  2,  5,  7], dtype=int64)

### There are many options for how to choose which sentences to include:

* top N sentences
* top N words
* top X% sentences or top X% words
* sentences with scores > average score
* sentences with scores > factor * average score

**You also don't have to sort, as it may make more sense in the same order as in the original document.**

In [37]:
print("Generated summary:\n")

for i in sort_idx[:5]:
    print(wrap("%.2f: %s" % (scores[i], sentences[i])))

Generated summary:

0.10: WPP in recent years has also bought firms such as Ogilvy &
Mather and Cordiant Communications.
0.10: 
UK advertising giant WPP has posted larger-than-expected annual
profits and predicted that it will outperform the market in 2005.
0.09: The long-term outlook looks "very favourable" because of media
and technology developments and the strength of the US economy, WPP
said.
0.09: During the year WPP bought US rival Grey Global, creating a
giant big enough to rival sector leader Omnicom.
0.09: Chief Executive Martin Sorrell on Friday told Reuters news
agency that WPP had submitted a proposal for United Business Media's
NOP World market research unit.


In [38]:
# View original article title

sample_doc.iloc[0].split("\n")[0]

"Ad firm WPP's profits surge 15%"

## Create TextRank summarization function

In [39]:
# Function to TextRank document - all steps from tokenization to printing summary

def summary(text, factor = 0.15):
    # Extract sentences 
    sents = nltk.sent_tokenize(text) 
    
    # Perform TF-IDF
    featurizer = TfidfVectorizer(stop_words=stopwords.words('english'), norm='l1') 
    X = featurizer.fit_transform(sents) 
    
    # Compute cosine similarity 
    S = cosine_similarity(X) 
    
    # Normalize cosine similarity matrix 
    S /= S.sum(axis=1, keepdims=True) 
    
    # Uniform transition matrix 
    U = np.ones_like(S) / len(S) 
    
    # Smoothed similarity matrix 
    S = (1 - factor) * S + factor * U 
    
    # Find the limiting / stationary distribution 
    eigenvals, eigenvecs = np.linalg.eig(S.T) 
    
    # Compute scores 
    scores = eigenvecs[:,0] / eigenvecs[:,0].sum() 
    
    # Sort the scores 
    sort_idx = np.argsort(-scores) 
    
    # Print top five sentences with scores 
    for i in sort_idx[:5]:
        print(wrap("%.2f: %s" % (scores[i], sents[i])))

In [40]:
# Test function on entertainment article

doc = df[df.labels == 'entertainment']['text'].sample()

In [41]:
summary(doc.iloc[0].split("\n", 1)[1])

0.12: Rapper Young Buck has been charged after allegedly stabbing a
man who hit Dr Dre as he was about to receive a lifetime achievement
award.
0.12: Mr Johnson allegedly approached Dr Dre, who was seated at a
table in front of the stage, and appeared to ask for an autograph
before punching him.
0.11: He said not holding the awards would be counter to the work the
magazine has done to promote hip hop music.
0.11: Vibe magazine president Kenard Gibbs said the attack earlier
this month in Santa Monica was "sickening".
0.11: 
The US Vibe awards will be held again next year despite a
stabbing which happened during the ceremony.


In [42]:
# View article title

doc.iloc[0].split("\n")[0]

'Vibe awards back despite violence'

# TextRank summarization with Sumy

You must first install Sumy with Windows command terminal:

    pip install sumy
    pip install git+git://github.com/miso-belica/sumy.git

Then you can import the necessary modules to use on the same randomly-selected entertainment article above, and compare the results.

In [43]:
import sumy

ModuleNotFoundError: No module named 'sumy'

In [44]:
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

ModuleNotFoundError: No module named 'sumy'

In [None]:
# Using TextRank Summarizer

summarizer = TextRankSummarizer()

parser = PlaintextParser.from_string(doc.iloc[0].split("\n", 1)[1], Tokenizer("english"))

tr_summary = summarizer(parser.document, sentences_count=5)

In [None]:
tr_summary

In [None]:
# Loop through sentences in summary to wrap each in output

for s in tr_summary:
    print(wrap(str(s)))

In [None]:
# Using LSA Summarizer

summarizer = LsaSummarizer()

lsa_summary = summarizer(parser.document, sentences_count=5)

for s in lsa_summary:
    print(wrap(str(s)))

In [None]:
# CAN'T INSTALL SUMY...!!!

# Text summarization with Gensim

The parameters for Gensim's **`summarize()`** function are:
    
* **`text`** (string) – given text document
* **`ratio`** (float, optional) – number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary, e.g. 10%, 20%.
* **`word_count`** (int or None, optional) – determines how many words will be contained in the output. If both `ratio` and `word_count` parameters are specified, the ratio will be ignored.
* **`split`** (bool, optional) – if True, list of sentences will be returned. Otherwise joined strings will be returned.


**Since this lecture, `gensim.summarization` code was removed from version Gensim 4.0. You could install an older version (3.8.3) or wait till a replacement has been implemented.**

In [45]:
from gensim.summarization.summarizer import summarize

summary = summarize(doc.iloc[0].split("\n", 1)[1])

print(wrap(summary))

ModuleNotFoundError: No module named 'gensim.summarization'

The formal documentation for `summarize()` function can be accessed from link below:

https://radimrehurek.com/gensim_3.8.3/summarization/summariser.html