<a href="https://colab.research.google.com/github/DrAlexSanz/nlpv2-course/blob/master/Text_rank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is extractive summarization, Instead of using the TFIDF approach I change the algorithm to score sentences to text rank

In [32]:
import numpy as np
import pandas as pd
import textwrap
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [33]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [34]:
# Get the data

!wget https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

--2022-08-08 12:29:42--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 104.21.23.210, 172.67.213.166, 2606:4700:3031::6815:17d2, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|104.21.23.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4.8M) [text/csv]
Saving to: ‘bbc_text_cls.csv.1’


2022-08-08 12:29:42 (21.8 MB/s) - ‘bbc_text_cls.csv.1’ saved [5085081/5085081]



In [35]:
df = pd.read_csv("bbc_text_cls.csv")

In [36]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [37]:
# My dataset is just one article
doc = df[df["labels"] == "business"]["text"].sample(random_state = 42)

In [38]:
def wrap_text(x):
    return textwrap.fill(x, replace_whitespace = False, fix_sentence_endings = True)

In [39]:
# Let's have a look at what I want to summarize. The title is included and it could just be a good summary or a baseline
print(wrap_text(doc.iloc[0]))

Christmas sales worst since 1981

UK retail sales fell in December,
failing to meet expectations and making it by some counts the worst
Christmas since 1981.

Retail sales dropped by 1% on the month in
December, after a 0.6% rise in November, the Office for National
Statistics (ONS) said.  The ONS revised the annual 2004 rate of growth
down from the 5.9% estimated in November to 3.2%. A number of
retailers have already reported poor figures for December.  Clothing
retailers and non-specialist stores were the worst hit with only
internet retailers showing any significant growth, according to the
ONS.

The last time retailers endured a tougher Christmas was 23 years
previously, when sales plunged 1.7%.

The ONS echoed an earlier
caution from Bank of England governor Mervyn King not to read too much
into the poor December figures.  Some analysts put a positive gloss on
the figures, pointing out that the non-seasonally-adjusted figures
showed a performance comparable with 2003. The Novembe

In [40]:
#Tokenize document into sentences

sents = nltk.sent_tokenize(doc.iloc[0].split("\n", 1)[1]) # Remove the title and split. This will produce a list [title, text] and I take the text

In [41]:
sents

['\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.',
 'Retail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said.',
 'The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%.',
 'A number of retailers have already reported poor figures for December.',
 'Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth, according to the ONS.',
 'The last time retailers endured a tougher Christmas was 23 years previously, when sales plunged 1.7%.',
 'The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures.',
 'Some analysts put a positive gloss on the figures, pointing out that the non-seasonally-adjusted figures showed a performance comparable with 2003.',
 'The November-De

In [42]:
featurizer = TfidfVectorizer(stop_words = stopwords.words("english"), norm = "l1")

In [43]:
x = featurizer.fit_transform(sents)
x


<17x154 sparse matrix of type '<class 'numpy.float64'>'
	with 212 stored elements in Compressed Sparse Row format>

In [44]:
# Compute similarity matrix
s = cosine_similarity(x)

In [45]:
s.shape # Should be square matrix

(17, 17)

In [46]:
len(sents) # sents by sents matrix

17

In [47]:
# Normalize the matrix (rows)

s /= s.sum(axis = 1, keepdims = True)

In [48]:
# Check that the normalization worked

s[13].sum()

1.0

In [49]:
# Create uniform transition matrix

u = np.ones_like(s)/len(s)
u[0].sum() # Check that the row sums to 1

1.0

In [50]:
# Choose the smoothing factor and create the smoothed matrix

factor = 0.15
s = (1 - factor) * s + factor * u
s[0].sum() # Check that the rows still sum to 1


1.0

In [52]:
# Stationary distribution

eigenvals, eigenv = np.linalg.eig(s.T)

In [53]:
eigenvals # And there should be a 1

array([1.        , 0.24245466, 0.72108199, 0.67644122, 0.34790129,
       0.34417302, 0.3866884 , 0.40333562, 0.41608572, 0.44238593,
       0.63909999, 0.62556792, 0.58922572, 0.57452382, 0.48511399,
       0.51329157, 0.52975372])

In [57]:
eigenv[:, 0] # 0 Because eigenvalue 1 is at position 0. Normally you need to find it

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [58]:
# If I multiply this vector by the eigenvalue 1, I should get the same vector

eigenv[:, 0].dot(s)

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [59]:
# Normalize the eigenvector
eigenv[:,0]/eigenv[:, 0].sum()

array([0.05907327, 0.06601563, 0.05402535, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114976, 0.05741304, 0.05906657, 0.05774684,
       0.07175905, 0.05092007])

In [78]:
# Alternative to find the limiting distribution. BRUTE FORCE!!

limiting_dist = np.ones(len(s))/len(s)
tol = 1e-8
delta = np.inf
iters = 0

while delta > tol:
    iters += 1
    # Markov transitions
    p = limiting_dist.dot(s)
    # Calculate change in limiting distribution
    delta = np.abs(p - limiting_dist).sum()
    # Update limiting dist
    limiting_dist = p

print(iters)

41


In [79]:
limiting_dist # And this sould be the same as the eigenv

array([0.05907327, 0.06601563, 0.05402534, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114977, 0.05741304, 0.05906657, 0.05774685,
       0.07175905, 0.05092008])

In [80]:
limiting_dist.sum()

0.9999999999999982

In [81]:
np.abs(eigenv[:, 0]/eigenv[:, 0].sum() - limiting_dist).sum() # If I choose a smaller tol I will have smaller error 

1.9964739014777244e-08

## And Now it's the same thing as the text summarization notebook

In [82]:
scores = limiting_dist
sort_idx = np.argsort(-scores)

ind = np.argpartition(scores, -5)[-5:]
ind

array([ 4,  1,  6, 15,  3])

In [83]:
# The result is not too bad! And I don't need to print the scores but why not?
# I don't need to sort by score, maybe I don't want to scramble the phrases

print("Here is the summary:")
for i in sort_idx[:5]:
    print(wrap_text("%.2f: %s" % (scores[i], sents[i])))

Here is the summary:
0.07: "The retail sales figures are very weak, but as Bank of England
governor Mervyn King indicated last night, you don't really get an
accurate impression of Christmas trading until about Easter," said Mr
Shaw.
0.07: A number of retailers have already reported poor figures for
December.
0.07: The ONS echoed an earlier caution from Bank of England governor
Mervyn King not to read too much into the poor December figures.
0.07: Retail sales dropped by 1% on the month in December, after a
0.6% rise in November, the Office for National Statistics (ONS) said.
0.06: Clothing retailers and non-specialist stores were the worst hit
with only internet retailers showing any significant growth, according
to the ONS.


In [84]:
# Without sorting, in case it's a story or something like that
for i in ind:
    print(wrap_text("%.2f: %s" % (scores[i], sents[i])))

0.06: Clothing retailers and non-specialist stores were the worst hit
with only internet retailers showing any significant growth, according
to the ONS.
0.07: Retail sales dropped by 1% on the month in December, after a
0.6% rise in November, the Office for National Statistics (ONS) said.
0.07: The ONS echoed an earlier caution from Bank of England governor
Mervyn King not to read too much into the poor December figures.
0.07: "The retail sales figures are very weak, but as Bank of England
governor Mervyn King indicated last night, you don't really get an
accurate impression of Christmas trading until about Easter," said Mr
Shaw.
0.07: A number of retailers have already reported poor figures for
December.


## Now put it in a function and compare with the tfidf method and see which one is better. I won't do it but this one should perform better.