<a href="https://colab.research.google.com/github/Nuri-Tas/NLP/blob/main/Text%20Classification/Text_Summarization_with_TextRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will build extractive text summarization tools. Before getting into TextRank, we are going to implement a quite straightforward text summarization algorithm by returning sentences based on their tf-idf scores. We will then implement the TextRank algorithm and compare the summaries with the simple algorithm.

In [4]:
import numpy as np
import pandas as pd
import nltk

# Simple Algorithm

## Load Data

We will work with BBC News dataset. The following code may throw out HTTP ERROR 403. You can still download the dataset in the given link.

In [11]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



Each row corresponds to a different news article. We will summarize three random news articles.

In [12]:
df = pd.read_csv("/content/bbc_text_cls.csv")
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [13]:
# we need to download punk from nltk to work with nltk.sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Build the Algorithm

 `Summarize` class contains both the simple algorithm and the TextRank algorithm. As mentioned previously, it suffices to set `score_method` accordingly to reach to a specific method's summarization as both methods only differ in the way they score sentences.

In [126]:
class Summarize:
  def __init__(self, article=None):
    if article is not None:
      self.article = article
      self.documents = nltk.sent_tokenize(self.article)

  # create a class if no article is specified
  @classmethod
  def get_random_article(self):
    article_no = np.random.choice(len(df))
    article = df.loc[article_no, "text"]
    return Summarize(article)

  # create a class to retrieve during the call for the methods
  @classmethod
  def get_article(self, article):
    return Summarize(article)

  def get_vocabulary(self):
      vocabulary = []
      for doc in self.documents:
        doc = doc.lower()
        for word in doc.split():
          if word not in vocabulary:
            vocabulary.append(word)
      return vocabulary

  def get_tfidf(self):
      vocabulary = self.get_vocabulary()
      # build tf and idf vectors 
      term_frequency = [[doc.lower().count(word) for word in vocabulary] for doc in self.documents]
      inverse_document_frequency = [len(self.documents) / sum([word in doc.lower() for doc in self.documents]) for word in vocabulary]

      # concatenate as tf-idf
      tf_idf = [[tf*idf for tf, idf in zip(doc, inverse_document_frequency)] for doc in term_frequency]
      return tf_idf
      
  def cosine_similarity(self, article, alpha=0.001):
      self = self.get_article(article)
      tfidf = self.get_tfidf()
      tfidf = [[item / sum(row) for item in row] for row in tfidf]
      cos_sim_vector = np.empty((len(self.documents), len(self.documents)))
      for i in range(len(tfidf)):
        for j in range(len(tfidf)):
            item1, item2 = tfidf[i], tfidf[j]
            dot_product = np.sum([i1 * j1 for i1, j1 in zip(item1, item2)])
            norm1 = np.sqrt(np.sum([item**2 for item in item1]))
            norm2 = np.sqrt(np.sum([item**2 for item in item2]))
            cos_similarity = dot_product / (norm1 * norm2)
            if cos_similarity != 0:
              cos_sim_vector[i, j] = cos_similarity
            else:
              cos_sim_vector[i, j] = alpha / len(tfidf)
      return cos_sim_vector

  # ranks the sentences based on the mean tf-idf scores per sentence - terms with zero tf-idf scores are ignored when calculating the mean
  def simple_score(self, article):
      self = self.get_article(article)
      tf_idf = self.get_tfidf()
      sentence_scores = [(self.documents[idx], np.mean([item for item in row if item != 0])) for idx, row in enumerate(tf_idf)]
      return sentence_scores
  
  # ranks the sentences based on the values they have in the dominant eigenvector
  def textrank_score(self, article):
      cos_similarity = self.cosine_similarity(article)
      eigen_values, eigen_vectors = np.linalg.eigh(cos_similarity)
      # numpy returns the eigenvalues in the ascending order
      # so sentence scores will correspond to the dominant eigenvector
      sentence_scores = [(self.documents[idx], item) for idx, item in  enumerate(eigen_vectors[-1])]
      return sentence_scores

  # return the top three sentences based on the score method, i.e. either simple score or TextRank. The default is simple.
  def get_summary(self, article=None, score_method="simple"):
      if article is None:
        self = self.get_random_article()
      else:
        self = self.get_article(article)

      if score_method == "simple":
        sentence_scores = self.simple_score(article)
      elif score_method == "textrank":
        sentence_scores = self.textrank_score(article)

      top_three = sorted(sentence_scores, key=lambda x: x[1], reverse=True)
      summarized = " ".join([item[0] for item in top_three[:3]])

      print("The main article is:", self.article, sep="\n")
      print("---------------")
      print("The summarized version is:", summarized, sep="\n")

## Summarization Examples with the Simple Algorithm

In [128]:
summarizer = Summarize()

In [129]:
article_no = np.random.choice(len(df))
sample1 = df.loc[article_no, "text"]
summarizer.get_summary(sample1)

The main article is:
Why Cell will get the hard sell

The world is casting its gaze on the Cell processor for the first time, but what is so important about it, and why is it so different?

The backers of the processor are big names in the computer industry. IBM is one of the largest and most respected chip-makers in the world, providing cutting edge technology to large businesses. Sony will be using the chip inside its PlayStation 3 console, and its dominance of the games market means that it now has a lot of power to dictate the future of computer and gaming platforms. The technology inside the Cell is being heralded as revolutionary, from a technical standpoint. Traditional computers - whether they are household PCs or PlayStation 2s - use a single processor to carry out the calculations that run the computer. The Cell technology, on the other hand, uses multiple Cell processors linked together to run lots of calculations simultaneously.

This gives it processing power an order of m

In [130]:
article_no = np.random.choice(len(df))
sample2 = df.loc[article_no, "text"]
summarizer.get_summary(sample2)

The main article is:
Sony PSP console hits US in March

US gamers will be able to buy Sony's PlayStation Portable from 24 March, but there is no news of a Europe debut.

The handheld console will go on sale for $250 (£132) and the first million sold will come with Spider-Man 2 on UMD, the disc format for the machine. Sony has billed the machine as the Walkman of the 21st Century and has sold more than 800,000 units in Japan. The console (12cm by 7.4cm) will play games, movies and music and also offers support for wireless gaming. Sony is entering a market which has been dominated by Nintendo for many years.

It launched its DS handheld in Japan and the US last year and has sold 2.8 million units. Sony has said it wanted to launch the PSP in Europe at roughly the same time as the US, but gamers will now fear that the launch has been put back. Nintendo has said it will release the DS in Europe from 11 March. "It has gaming at its core, but it's not a gaming device. It's an entertainment 

In [131]:
article_no = np.random.choice(len(df))
sample3 = df.loc[article_no, "text"]
summarizer.get_summary(sample3)

The main article is:
Moyes U-turn on Beattie dismissal

Everton manager David Moyes will discipline striker James Beattie after all for his headbutt on Chelsea defender William Gallas.

The Scot initially defended Beattie, whose dismissal put Everton on the back foot in a game they ultimately lost 1-0, saying Gallas overreacted. But he has had a rethink after looking over the video evidence again. He said: "I believe that I should set the record straight by conceding that the dismissal was right and correct." Moyes added: "My comments on Saturday came immediately after the final whistle and at a point when I had only had the opportunity to see one, very quick re-run of the incident."

The club website also reported that Beattie, who seemed unrepentant after Saturday's match, insisting Gallas "would have stayed down a lot longer" if he had headbutted him, has now apologised. Moyes continued: "Although the incident was totally out of character - James has never even been suspended before

# Text Rank

TextRank differs from the simple algorithm we built in the way it scores each sentences. It takes inspiration from the PageRank algorithm where the algorithm starts with a random page and randomly walks between pages based on the links between pages. Intiutively, a page who gets many links into it will be scored higher compared to pages with fewer links pointing to them. We transform the concept of "link" to TextRank by taking cosine similarity between sentences. Additionally, we use Perron-Frobenius theorem to calculate the limiting distribution corresponding to a cosine similarity vector, that is the converging distribution when the random walk is implemented infinitely many times.

We will now return the summaries of the same articles we retrieved with the simple algorithm:

In [132]:
summarizer.get_summary(sample1, score_method="textrank")

The main article is:
Why Cell will get the hard sell

The world is casting its gaze on the Cell processor for the first time, but what is so important about it, and why is it so different?

The backers of the processor are big names in the computer industry. IBM is one of the largest and most respected chip-makers in the world, providing cutting edge technology to large businesses. Sony will be using the chip inside its PlayStation 3 console, and its dominance of the games market means that it now has a lot of power to dictate the future of computer and gaming platforms. The technology inside the Cell is being heralded as revolutionary, from a technical standpoint. Traditional computers - whether they are household PCs or PlayStation 2s - use a single processor to carry out the calculations that run the computer. The Cell technology, on the other hand, uses multiple Cell processors linked together to run lots of calculations simultaneously.

This gives it processing power an order of m

In [134]:
summarizer.get_summary(sample2, score_method="textrank")

The main article is:
Sony PSP console hits US in March

US gamers will be able to buy Sony's PlayStation Portable from 24 March, but there is no news of a Europe debut.

The handheld console will go on sale for $250 (£132) and the first million sold will come with Spider-Man 2 on UMD, the disc format for the machine. Sony has billed the machine as the Walkman of the 21st Century and has sold more than 800,000 units in Japan. The console (12cm by 7.4cm) will play games, movies and music and also offers support for wireless gaming. Sony is entering a market which has been dominated by Nintendo for many years.

It launched its DS handheld in Japan and the US last year and has sold 2.8 million units. Sony has said it wanted to launch the PSP in Europe at roughly the same time as the US, but gamers will now fear that the launch has been put back. Nintendo has said it will release the DS in Europe from 11 March. "It has gaming at its core, but it's not a gaming device. It's an entertainment 

In [136]:
summarizer.get_summary(sample3, score_method="textrank")

The main article is:
Moyes U-turn on Beattie dismissal

Everton manager David Moyes will discipline striker James Beattie after all for his headbutt on Chelsea defender William Gallas.

The Scot initially defended Beattie, whose dismissal put Everton on the back foot in a game they ultimately lost 1-0, saying Gallas overreacted. But he has had a rethink after looking over the video evidence again. He said: "I believe that I should set the record straight by conceding that the dismissal was right and correct." Moyes added: "My comments on Saturday came immediately after the final whistle and at a point when I had only had the opportunity to see one, very quick re-run of the incident."

The club website also reported that Beattie, who seemed unrepentant after Saturday's match, insisting Gallas "would have stayed down a lot longer" if he had headbutted him, has now apologised. Moyes continued: "Although the incident was totally out of character - James has never even been suspended before