## Text Summarization

### There are two types of text summarization:
1. **Extractive summaries:** These summaries remove parts of the text that are less important, keeping the most significant parts intact (easier to implement).
2. **Abstractive summaries:** These summaries create new sentences using words that may not be present in the original text.

### We will use the extractive approach, and to achieve this goal, we will follow these steps:

1. Break the document into sentences (using `nltk.sent_tokenize()`).
2. Treat each sentence as a document.
3. Compute the Tf-Idf matrix.
4. Compute the score for each sentence (**).
5. Sort the sentences in descending order based on their scores.

### (**) How to Compute the Sentence Score

We will use a scoring method similar to Google's PageRank, which is based on a random walk. After an infinite number of walks, our state distribution will converge to a limit distribution. The "walks" follow a Markov Matrix distribution. For web pages, Google uses hyperlinks to move to the next page. For our purpose, we will calculate the cosine similarity for each sentence, and these values will be used as probabilities (after normalization).

Here's the key idea:

The probability of changing the state at time ( t ) is:

p(s_{t+1}) = p(s_t) * p(s_{t+1} | s_t) = p(s_t) * A_{(t+1,t)} 

where A is the Markov Matrix.

As t approaches infinity (t -> \infty):

p(s_{\infty}) = p(s_{\infty}) x A_{(t+1, t)}

This follows the eigenvectors of a matrix (\lambda.v = A.v), with \lambda being the eigenvalue and v being the eigenvector).

So, we need to compute the Markov Matrix and find the eigenvector associated with an eigenvalue of 1. This eigenvector will represent the limiting state distribution, and its values will be the scores for each sentence.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


import subprocess

# Download and unzip wordnet
try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

[nltk_data] Downloading package wordnet to /kaggle/working/...
Archive:  /kaggle/working/corpora/wordnet.zip
   creating: /kaggle/working/corpora/wordnet/
  inflating: /kaggle/working/corpora/wordnet/lexnames  
  inflating: /kaggle/working/corpora/wordnet/data.verb  
  inflating: /kaggle/working/corpora/wordnet/index.adv  
  inflating: /kaggle/working/corpora/wordnet/adv.exc  
  inflating: /kaggle/working/corpora/wordnet/index.verb  
  inflating: /kaggle/working/corpora/wordnet/cntlist.rev  
  inflating: /kaggle/working/corpora/wordnet/data.adj  
  inflating: /kaggle/working/corpora/wordnet/index.adj  
  inflating: /kaggle/working/corpora/wordnet/LICENSE  
  inflating: /kaggle/working/corpora/wordnet/citation.bib  
  inflating: /kaggle/working/corpora/wordnet/noun.exc  
  inflating: /kaggle/working/corpora/wordnet/verb.exc  
  inflating: /kaggle/working/corpora/wordnet/README  
  inflating: /kaggle/working/corpora/wordnet/index.sense  
  inflating: /kaggle/working/corpora/wordnet/data.

In [2]:
dir = '../input/bbc-dataset/'

df = pd.read_csv(dir+'bbc_text_cls.csv')

In [3]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [4]:
corpus = df[df.labels=='business']['text'].sample(random_state = 42)

In [5]:
corpus

480    Christmas sales worst since 1981\n\nUK retail ...
Name: text, dtype: object

In [6]:
#Removing the title

corpus = corpus.iloc[0].split("\n",1)[1]

In [7]:
corpus

'\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of retailers have already reported poor figures for December. Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth, according to the ONS.\n\nThe last time retailers endured a tougher Christmas was 23 years previously, when sales plunged 1.7%.\n\nThe ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures. Some analysts put a positive gloss on the figures, pointing out that the non-seasonally-adjusted figures showed a performance comparable with 2003. The November-December jump last year wa

In [8]:
docs = nltk.sent_tokenize(corpus)

In [9]:
## Lemmatizing the input
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Função de lematização
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    words = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(words)
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
    return ' '.join(lemmatized_words)

docs_lemma = [lemmatize_text(doc) for doc in docs]
len(docs_lemma)

docs_lemma = docs

In [10]:
#Compute Tf-Idf Matrix
tf_docs = TfidfVectorizer(decode_error='ignore', stop_words = 'english', norm = 'l1').fit_transform(docs_lemma)
N, V = tf_docs.shape

## Computing the scores

In [11]:
def cosine_similarity(row1, row2):
    
    dot_product = row1.dot(row2.T).toarray()[0, 0]
    norm1 = np.sqrt(row1.multiply(row1).sum())
    norm2 = np.sqrt(row2.multiply(row2).sum())
    
    cos_similarity = dot_product/(norm1*norm2)
    return cos_similarity

In [12]:
#Computing the Markov-Matrix - M x M -> M = number of sentences
markov_matrix = np.zeros((N,N))
for i in range(N):
    
    for j in range(N):
        
        markov_matrix[i,j] = cosine_similarity(tf_docs[i,:], tf_docs[j,:])
    
    #Normalize the row:
    markov_matrix[i,:] = markov_matrix[i,:]/(markov_matrix[i,:].sum())

In [13]:
markov_matrix[:,1]

array([0.07039704, 0.49092893, 0.09514263, 0.03284654, 0.02693908,
       0.02951514, 0.04657406, 0.        , 0.07303445, 0.02222435,
       0.        , 0.0397964 , 0.02708133, 0.02693285, 0.05023479,
       0.05000916, 0.        ])

In [14]:
# Smoothing the matrix
lambda_smooth = 0.15
U = np.ones((N,N))*1/N

markov_matrix = markov_matrix*(1-lambda_smooth) + lambda_smooth*U

In [15]:
#Ensuring that is still normalized
print((markov_matrix.sum(axis = 0) - 1).sum())

#Ensuring that all values > 0
print((markov_matrix[markov_matrix<=0].sum()))

-2.1094237467877974e-15
0.0


In [16]:
#Computing eigenvector of markov_matrix

# In linear algebra: A * x = lambda * x; x = columns vector.  -> But we have rows: rows * A = rows * lambda; so we need to transpose A to have the same format
eigenvalues, eigenvectors = np.linalg.eig(markov_matrix.T)

In [17]:
print(eigenvalues)

[1.         0.21864025 0.71410912 0.31883039 0.33623722 0.66557628
 0.37708749 0.40113565 0.40980739 0.42617867 0.63084392 0.61855497
 0.59979725 0.55533    0.48662142 0.5061532  0.52044964]


In [18]:
# Find the index where eigenvalue = 1
def find_eigenvectors(eigvalues, value):
    distance = np.abs(eigvalues-value)
    idx_min = distance.argmin()
    
    return idx_min

In [19]:
idx = find_eigenvectors(eigenvalues, 1)
p_inf = eigenvectors[:,idx]  #Limit distribution

sentence_score = p_inf/p_inf.sum() #Score

In [20]:
sentence_score

array([0.06008621, 0.06624912, 0.05383286, 0.07433314, 0.06127197,
       0.05818043, 0.07018035, 0.05266862, 0.05273009, 0.0553683 ,
       0.05009977, 0.05087408, 0.05821364, 0.05805022, 0.05754962,
       0.06936807, 0.0509435 ])

In [21]:
sentence_score.sum()

1.0

In [22]:
n = 5
idx_sort = np.argsort(-sentence_score)

top_n_sentences = [docs_lemma[idx] for idx in idx_sort[:n]]

In [23]:
print("Text Summary:")
for i in range(n):
    print(f"\n[{sentence_score[idx_sort[i]]*100:.2f}%] {top_n_sentences[i]}")

print("\n\n=================================\n\n")
print("\nFull Text\n")
print(corpus)

Text Summary:

[7.43%] A number of retailers have already reported poor figures for December.

[7.02%] The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures.

[6.94%] "The retail sales figures are very weak, but as Bank of England governor Mervyn King indicated last night, you don't really get an accurate impression of Christmas trading until about Easter," said Mr Shaw.

[6.62%] Retail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said.

[6.13%] Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth, according to the ONS.





Full Text


UK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.

Retail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statis

## Lets create a function

In [24]:
def summarize(document):
    
    #Remove the title
    corpus = document.iloc[0].split("\n",1)[1]
    
    #Divide the text into sentences
    docs = nltk.sent_tokenize(corpus)
    
    #Lemmatize the docs
    docs_lemma = [lemmatize_text(doc) for doc in docs]

    
    #Compute Tf-Idf Matrix
    tf_docs = TfidfVectorizer(decode_error='ignore', stop_words = 'english', norm = 'l1').fit_transform(docs_lemma)
    N, V = tf_docs.shape
    
    #Computing the Markov-Matrix - M x M -> M = number of sentences
    markov_matrix = np.zeros((N,N))
    for i in range(N):

        for j in range(N):

            markov_matrix[i,j] = cosine_similarity(tf_docs[i,:], tf_docs[j,:])

        #Normalize the row:
        markov_matrix[i,:] = markov_matrix[i,:]/(markov_matrix[i,:].sum())
        
        
    # Smoothing the matrix
    lambda_smooth = 0.1
    U = np.ones((N,N))*1/N

    markov_matrix = markov_matrix*(1-lambda_smooth) + lambda_smooth*U
    
    
    #Ensuring that is still normalized
    norm_criteria = ((markov_matrix.sum(axis = 0) - 1).sum())

    #Ensuring that all values > 0
    norm_positive = ((markov_matrix[markov_matrix<=0].sum()))
    
    if norm_criteria>1e-5 or norm_positive>1e-5:
        print("Conditions of the Matrix are NOT met")
        
    #Computing eigenvector of markov_matrix
    eigenvalues, eigenvectors = np.linalg.eig(markov_matrix.T)
    
    
    idx = find_eigenvectors(eigenvalues, 1)
    p_inf = eigenvectors[:,idx]  #Limit distribution
    sentence_score = p_inf/p_inf.sum() #Score
    
    
    n = 5
    idx_sort = np.argsort(-sentence_score)
    top_n_sentences = [docs_lemma[idx] for idx in idx_sort[:n]]
    
    txt_title = document.iloc[0].split('\n',1)[0]
    print(f"Text Title: {txt_title}")
    
    print("\nText Summary:")
    for i in range(n):
        print(f"\n[{sentence_score[idx_sort[i]]*100:.2f}%] {top_n_sentences[i]}")

In [25]:
doc = df[df.labels == 'entertainment']['text'].sample(random_state=123)
summarize(doc)

Text Title: Goodrem wins top female MTV prize

Text Summary:

[11.59%] Goodrem , Green Day and the Black Eyed Peas take home two award each .

[10.57%] Other winner include Green Day , vote best group , and the Black Eyed Peas .

[10.29%] As well a best female , Goodrem also take home the Pepsi Viewers Choice Award , whilst Green Day bag the prize for best rock video for American Idiot .

[10.01%] The Black Eyed Peas win award for best R 'n ' B video and sexy video , both for Hey Mama .

[9.71%] Local singer and songwriter Missy Higgins take the title of breakthrough artist of the year , with Australian Idol winner Guy Sebastian take the honour for best pop video .


## TextRank - Using python libraries

In [26]:
!pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l- done
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Collecting chardet (from breadability>=0.1.20->sumy)
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading chardet-5.2.0-py3-none-any.whl (199 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.4/199.4 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hBui

In [27]:
#sumy libraries do the job!
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer  #Latent Semantic Analysis
from sumy.parsers.plaintext import PlaintextParser #Parse the text -> Create the 'vector' from the text that will be used
from sumy.nlp.tokenizers import Tokenizer

In [28]:
summarizer = TextRankSummarizer()

#We need to pass the text, and the tokenizer to create the parser object
parser = PlaintextParser.from_string(
    doc.iloc[0].split("\n", 1)[1],
    Tokenizer("english"))

#sentences_count = number of outputs
summary = summarizer(parser.document, sentences_count=5)

In [29]:
summary

(<Sentence: The 21-year-old singer won the award for best female artist, with Australian Idol runner-up Shannon Noll taking the title of best male at the ceremony.>,
 <Sentence: As well as best female, Goodrem also took home the Pepsi Viewers Choice Award, whilst Green Day bagged the prize for best rock video for American Idiot.>,
 <Sentence: The Black Eyed Peas won awards for best R 'n' B video and sexiest video, both for Hey Mama.>,
 <Sentence: Local singer and songwriter Missy Higgins took the title of breakthrough artist of the year, with Australian Idol winner Guy Sebastian taking the honours for best pop video.>,
 <Sentence: The ceremony was held at the Luna Park fairground in Sydney Harbour and was hosted by the Osbourne family.>)

In [30]:
for s in summary:
    print(s)

The 21-year-old singer won the award for best female artist, with Australian Idol runner-up Shannon Noll taking the title of best male at the ceremony.
As well as best female, Goodrem also took home the Pepsi Viewers Choice Award, whilst Green Day bagged the prize for best rock video for American Idiot.
The Black Eyed Peas won awards for best R 'n' B video and sexiest video, both for Hey Mama.
Local singer and songwriter Missy Higgins took the title of breakthrough artist of the year, with Australian Idol winner Guy Sebastian taking the honours for best pop video.
The ceremony was held at the Luna Park fairground in Sydney Harbour and was hosted by the Osbourne family.


In [31]:
#Using LSA
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=5)
for s in summary:
  print(s)

Goodrem, known in both Britain and Australia for her role as Nina Tucker in TV soap Neighbours, also performed a duet with boyfriend Brian McFadden.
Other winners included Green Day, voted best group, and the Black Eyed Peas.
Goodrem, Green Day and the Black Eyed Peas took home two awards each.
As well as best female, Goodrem also took home the Pepsi Viewers Choice Award, whilst Green Day bagged the prize for best rock video for American Idiot.
Artists including Carmen Electra, Missy Higgins, Kelly Osbourne, Green Day, Ja Rule and Natalie Imbruglia gave live performances at the event.
