#TextRank Algorithm for Text Summarization

The following notebook implements the 'TextRank' Algorithm which is an unsupervised learning algorithm for text summarization. It is based upon the 'PageRank' algorithm that is used to rank web pages. A similar technique is used to rank sentences present in the text and fetch top N sentences for our summary generation.


The algorithm is implemented using various libraries and importing them into our code. The below cell imports the necessary libraries and modules, important ones being 'NLTK', 'NumPy', 'Pandas',  and 'Matplotlib'.

In [7]:
import pandas as pd
import numpy as np
import csv
import nltk
nltk.download("stopwords")
nltk.download('punkt')
import re
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import word2vec
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


#Importing the Dataset

>The dataset is stored on GitHub repository and is fetched directly. It can also be uploaded from a local system by importing 'files' module from 'google.colab'.

>The dataset contains 2 columns: 'Heading' and 'Content'. To access the content section through the heading of the article, we set the index of the dataframe to 'Heading' column of the dataset.

In [8]:
url="https://raw.githubusercontent.com/Aditi2806/Articles-Dataset/master/Articles%20Dataset.csv"
dataframe=pd.read_csv(url,index_col="Heading",encoding="utf-8")
dataframe[:10]

Unnamed: 0_level_0,Unnamed: 0,Content
Heading,Unnamed: 1_level_1,Unnamed: 2_level_1
9 Tips For Training Lightning-Fast Neural Networks In Pytorch,0,"Let’s face it, your model is probably still st..."
How To Become A One-Drink Wonder,1,"Anyone can publish on Medium per our Policies,..."
Treat Yourself Like a CEO and You’ll Make 10x More Income,2,"As I wrote that headline, an old joke came to ..."
Bored? 7 Fun Things You Can Build,3,There is no real secret when it comes to becom...
First AI Model of the Universe Knows Science it was Never Taught,4,A new 3D model of the Universe developed by an...
10 Bad Habits of Unsuccessful People,5,The first successful person I ever met — truly...
Amazon Accidentally Sent Out Their Email Template,6,It’s comforting to see that even the titans of...
When Women ‘Dangle the Steak' in Front of Men,7,"I truly thought that by now, there wasn’t an o..."
Why Do Men’s Legacies Matter More Than Women’s Safety?,8,Almost immediately after Washington Post repor...
Is A.I. the Antichrist?,9,It may seem that old religious principles woul...


#Pre-processing of Text

>The input text cannot be directly inputted to our algprithm since it may contain words that do add much meaning to our text and may hinder the generation of summary. We tokenize the input text into individual sentences, tokenize them further into individual words present in the text and clean the tokenized list to remove stopwords, invalid words, and other errors.

In [0]:
stopwords=stopwords.words('english')
def remove_stopwords(sentence):
    clean_text=" ".join([s for s in sentence if s not in stopwords])
    return clean_text

In [0]:
def pagerank(A, eps=0.0001, d=0.85):
    P = np.ones(len(A)) / len(A)
    while True:
        new_P = np.ones(len(A)) * (1 - d) / len(A) + d * A.T.dot(P)
        delta = abs(new_P - P).sum()
        if delta <= eps:
            return new_P
        P = new_P

#Executing the TextRank Algorithm on the Dataset

> The algorithm starts with creating a list of distict article headings
present in the dataset. This makes it easier to loop through the 
dataset and create summary for every article present in the text and save it to our file.

>The algorithm works at the sentence level, hence it first tokenizes the text into sentences and cleans the text by pre-processing it. The pre-processed text is stored in the 'clean_sentences' list. The list is tokenized into words and this list will be used to create the 'word embeddings' for every sentence, this converts the text into usable input to the algorithm since the algorithm only works on numerical data. 

>To create the word embeddings, 'Word2Vec' model of 'gensim' library is used to convert the text to vector representation of size 2. For each sentence in the text, the average value of the numpy values is taken and a sentence score is generated for each sentence of text. These scores are stored in 'sentence_vectors' list. An important input for 'textrank algorithm' is the similarity matrix which tells about the similarity between various sentences. The similarity matrix is generated of size NxN where N is the number of sentences present in the text.

>The similarity matrix contains the cosine-similarity scores for every sentence, and this matrix builds the foundation for our algorithm. the matrix is used to generate a graph, where the nodes represents the sentences and the edges represents the cosine-similarity scores. The graph is generated using 'NetworkX' library which also includes the 'PageRank' algorithm to be applied on our generated graph. The pagerank algorithm is executed on the graph to calculate the rank for every sentence and sort the sentences in descending order of their rank scores. These scores will be used to fetch the top K sentences for our summary.



In [11]:
article_name=list(set([i for i in dataframe.index]))
dataframe['TextRank Summary']=""
count=0
for row in article_name:
    sentences=sent_tokenize(dataframe['Content'][row])
    clean_sentences=pd.Series(sentences).str.replace("[^a-zA-Z]"," ")
    clean_sentences=[s for s in clean_sentences]
    clean_sentences=[remove_stopwords(s.split()) for s in clean_sentences]

    tokenized=[]
    for s in clean_sentences:
        temp=[]
        for word in s.split(' '):
            word=word.split('.')[0]
            temp.append(word.lower())
        tokenized.append(temp)
    #print(tokenized)

    #unique_words=set([j for i in tokenized for j in i])

    model = word2vec.Word2Vec(tokenized,workers=1,size=2,min_count=1,window=3,sg=0)

    word_embeddings={}
    sentence_vectors=[]
    #print("length of tokenized:",len(tokenized))
    for words in tokenized:
        total=0
        #word_embeddings[word]=model[word]
        if len(words)!=0:
            values=[model[word] for word in words]
            v=sum(values).reshape(-1,1)/(len(words)+0.001)
        else:
            v=np.zeros((2,))
        sentence_vectors.append(v)
    similarity_matrix = np.zeros([len(sentence_vectors),len(sentence_vectors)])
    for i in range(len(sentence_vectors)):
        for j in range(len(sentence_vectors)):
            if i!=j:
                similarity_matrix[i][j]=cosine_similarity(sentence_vectors[i].reshape(1,-1),sentence_vectors[j].reshape(1,-1))[0,0]
    



    #nx_graph = nx.from_numpy_array(similarity_matrix)
    #nx.draw(nx_graph)
    #plt.show()
    try:
        nx_graph = nx.from_numpy_array(similarity_matrix)
        summary=[]
        scores = nx.pagerank(nx_graph,0.6)
        ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),reverse=True)
        temp=pd.DataFrame(ranked_sentences,columns=['TextRank Score', 'Sentence'])
        sn=5
        print("Article Heading: ", row)
        print("Generated Summary:")
        if len(ranked_sentences)<sn:
            for i in range(len(ranked_sentences)):
                summary.append(ranked_sentences[i][1])
        else:
            for i in range(sn):
                summary.append(ranked_sentences[i][1])
        dataframe['TextRank Summary'][row]=' '.join(summary)
        #print("Article Heading: ", row)
        #print(pd.DataFrame(similarity_matrix))
        summary = ' '.join(summary)
        print(summary)
        
        count+=1
    except:
        pass
#print(count)

Article Heading:  How It Feels to Live With AIDS for 30 Years
Generated Summary:
Why risk that? Most died within 15 months of being diagnosed.The official prediction was that there would be twice as many cases in the following year.As a young gay Black man, Chris would be considered high-risk, but he was as far away from contracting the virus as one could possibly be. She told Chris that in order to even speak to the recruiter, he’d have to sign some forms. By 1988, funding had been established for national, regional, and community-based organizations — including CAP, the Colorado AIDS Project.Chris knew nothing about CAP. “I looked in the mirror and saw how swollen my glands were.”It was called the Monster back then, a nickname earned by how quickly it took down those infected.
Article Heading:  I Loved My Husband. I Loved Him So Much.
Generated Summary:
Those phone calls were not hallucinations. I loved my husband. I love my husband so much.”I would catch myself chanting this as I tr

In [0]:
import pandas as pd
import numpy as np
import csv
import nltk
#nltk.download("stopwords")
import re
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import word2vec
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

content = " "

sentences=sent_tokenize(content)
clean_sentences=pd.Series(sentences).str.replace("[^a-zA-Z]"," ")
clean_sentences=[s for s in clean_sentences]
clean_sentences=[remove_stopwords(s.split()) for s in clean_sentences]

tokenized=[]
for s in clean_sentences:
    temp=[]
    for word in s.split(' '):
        word=word.split('.')[0]
        temp.append(word.lower())
    tokenized.append(temp)
    #print(tokenized)

    #unique_words=set([j for i in tokenized for j in i])

model = word2vec.Word2Vec(tokenized,workers=1,size=2,min_count=1,window=3,sg=0)

word_embeddings={}
sentence_vectors=[]
#print("length of tokenized:",len(tokenized))
for words in tokenized:
    total=0
        #word_embeddings[word]=model[word]
    if len(words)!=0:
        values=[model[word] for word in words]
        v=sum(values).reshape(-1,1)/(len(words)+0.001)
    else:
        v=np.zeros((2,))
    sentence_vectors.append(v)
similarity_matrix = np.zeros([len(sentence_vectors),len(sentence_vectors)])
for i in range(len(sentence_vectors)):
    for j in range(len(sentence_vectors)):
        if i!=j:
            similarity_matrix[i][j]=cosine_similarity(sentence_vectors[i].reshape(1,-1),sentence_vectors[j].reshape(1,-1))[0,0]
    

try:
    nx_graph = nx.from_numpy_array(similarity_matrix)
    summary=[]
    scores = nx.pagerank(nx_graph,0.6)
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),reverse=True)
    temp=pd.DataFrame(ranked_sentences,columns=['TextRank Score', 'Sentence'])
    sn=5
    if len(ranked_sentences)<sn:
        for i in range(len(ranked_sentences)):
            summary.append(ranked_sentences[i][1])
    else:
        for i in range(sn):
            summary.append(ranked_sentences[i][1])

    summary = ' '.join(summary)
    print(summary)

    except:
        pass

In [0]:
dataframe[:10]

#Downloading the final output file

The results obtained from the file are stored in a Dataframe and converted to an 'excel' file anmed 'TextRank Result.xlsx'. The file is downloaded using the 'files' module from 'google.colab' library.

In [0]:
dataframe.to_excel('TextRank Result.xlsx',encoding='utf8')

In [0]:
from google.colab import files
files.download('TextRank Result.xlsx')