# Text summarisation using PageRank?

I saw this approach on Medium...
https://medium.com/analytics-vidhya/an-introduction-to-text-summarization-using-the-textrank-algorithm-with-python-implementation-2370c39d0c60

And immediately knew I had to give it a try on my TED Talks data...

In [1]:
import numpy as np 
import pandas as pd 
import nltk 
import re

In [2]:
df = pd.read_csv("~/Documents/Twitter/TED_data.csv")
df.head()
dt = df.iloc[:2]
print(dt)

   ID        date       author  \
0   1  2017-02-17  Grady Booch   
1   2  2020-03-24    Stefan Al   

                                         long_title  \
0       Grady Booch: Don't fear superintelligent AI   
1  Stefan Al: Why isn't the Netherlands underwater?   

                                            keywords  \
0  TED, Talks, Themes, Speakers, Technology, Ente...   
1  TED, Talks, Themes, Speakers, Technology, Ente...   

                                         description  \
0  New tech spawns new anxieties, says scientist ...   
1  In January 1953, a tidal surge shook the North...   

                                   title      author2  \
0         Don't fear superintelligent AI  Grady Booch   
1  Why isn't the Netherlands underwater?    Stefan Al   

                                           full_text  Year  
0  When I was a kid, I was the quintessential ner...  2017  
1  In January of 1953, a tidal surge shook the No...  2020  


In [3]:
dt['full_text'][0]

'When I was a kid, I was the quintessential nerd. I think some of you were, too.  And you, sir, who laughed the loudest, you probably still are.  I grew up in a small town in the dusty plains of north Texas, the son of a sheriff who was the son of a pastor. Getting into trouble was not an option. And so I started reading calculus books for fun.  You did, too. That led me to building a laser and a computer and model rockets, and that led me to making rocket fuel in my bedroom. Now, in scientific terms, we call this a very bad idea.  Around that same time, Stanley Kubrick\'s "2001: A Space Odyssey" came to the theaters, and my life was forever changed. I loved everything about that movie, especially the HAL 9000. Now, HAL was a sentient computer designed to guide the Discovery spacecraft from the Earth to Jupiter. HAL was also a flawed character, for in the end he chose to value the mission over human life. Now, HAL was a fictional character, but nonetheless he speaks to our fears, our f

In [6]:
from nltk.tokenize import sent_tokenize
sentences = [] 
for s in dt['full_text']: 
    sentences.append(sent_tokenize(s))
# flatten list
sentences = [y for x in sentences for y in x]

In [7]:
sentences[:5]

['When I was a kid, I was the quintessential nerd.',
 'I think some of you were, too.',
 'And you, sir, who laughed the loudest, you probably still are.',
 'I grew up in a small town in the dusty plains of north Texas, the son of a sheriff who was the son of a pastor.',
 'Getting into trouble was not an option.']

In [4]:
# Extract word vectors 
word_embeddings = {} 
f = open('glove.6B.100d.txt', encoding='utf-8') 
for line in f: 
    values = line.split() 
    word = values[0] 
    coefs = np.asarray(values[1:], dtype='float32')   
    word_embeddings[word] = coefs 
f.close()

In [8]:
# remove punctuations, numbers and special characters 
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ") 

# make alphabets lowercase 
clean_sentences = [s.lower() for s in clean_sentences]

In [9]:
#nltk.download('stopwords')
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

In [10]:
# function to remove stopwords 
def remove_stopwords(sen):     
    sen_new = " ".join([i for i in sen if i not in stop_words])          
    return sen_new
# remove stopwords from the sentences 
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

### Vector Representation of Sentences

In [11]:
# Extract word vectors 
word_embeddings = {} 
f = open('glove.6B.100d.txt', encoding='utf-8') 
for line in f: 
    values = line.split() 
    word = values[0] 
    coefs = np.asarray(values[1:], dtype='float32')    
    word_embeddings[word] = coefs 
f.close()

Now, let’s create vectors for our sentences. We will first fetch vectors (each of size 100 elements) for the constituent words in a sentence and then take mean/average of those vectors to arrive at a consolidated vector for the sentence.

In [12]:
sentence_vectors = [] 
for i in clean_sentences: 
    if len(i) != 0: 
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001) 
    else: 
        v = np.zeros((100,)) 
    sentence_vectors.append(v)

### Similarity Matrix Preparation
The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge. Let’s create an empty similarity matrix for this task and populate it with cosine similarities of the sentences.
Let’s first define a zero matrix of dimensions (n * n). We will initialize this matrix with cosine similarity scores of the sentences. Here, n is the number of sentences.

In [13]:
# similarity matrix 
sim_mat = np.zeros([len(sentences), len(sentences)])

In [14]:
sim_mat.shape

(135, 135)

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

In [16]:
# And initialize the matrix with cosine similarity scores.
for i in range(len(sentences)):
    for j in range(len(sentences)): 
        if i != j: 
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

### Applying PageRank Algorithm
Before proceeding further, let’s convert the similarity matrix sim_mat into a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings.

In [17]:
import networkx as nx 
nx_graph = nx.from_numpy_array(sim_mat) 
scores = nx.pagerank(nx_graph)

In [19]:
ranked_sentences = sorted(((scores[i],s) for i,s in 
                           enumerate(sentences)), reverse=True)
# Extract top 10 sentences as the summary 
for i in range(10): 
    print(ranked_sentences[i][1])

These things are all true to a degree, but it's also the case that these technologies brought to us things that extended the human experience in some profound ways.
At a point in time we saw the written word become pervasive, people thought we would lose our ability to memorize.
So let's accept for a moment that it's possible to build such an artificial intelligence for this kind of mission and others.
Is it really possible for us to take a system of millions upon millions of devices, to read in their data streams, to predict their failures and act in advance?
Indeed, we stand at a remarkable time in human history, where, driven by refusal to accept the limits of our bodies and our minds, we are building machines of exquisite, beautiful complexity and grace that will extend the human experience in ways beyond our imagining.
How might I use computing to help take us to the stars?
Is it really possible to build an artificial intelligence like that?
When we first saw telephones come in, p

In [59]:
print(scores)

{0: 0.0034815934709387874, 1: 0.007270682802675584, 2: 0.006329128685573784, 3: 0.006617024369020606, 4: 0.007243105474039097, 5: 0.006761948534664872, 6: 0.0034695259288742963, 7: 0.007431057910719146, 8: 0.008140082583390338, 9: 0.007712234183499516, 10: 0.007348137343875435, 11: 0.006455121514384519, 12: 0.008176206049631325, 13: 0.006909588837474255, 14: 0.006032778186141106, 15: 0.008447014670640716, 16: 0.007699609848075474, 17: 0.007768335204556238, 18: 0.008001823525931782, 19: 0.008128464040602516, 20: 0.008134646854397192, 21: 0.008301950245211505, 22: 0.007697782768761685, 23: 0.0082611063928738, 24: 0.006433234997727685, 25: 0.008372863995087718, 26: 0.007322374684660267, 27: 0.008126646818843931, 28: 0.006526848586570854, 29: 0.005877186440572776, 30: 0.007737410605077583, 31: 0.00832516626600242, 32: 0.008444476636615038, 33: 0.005784363367006628, 34: 0.007442111851959082, 35: 0.005784363367006628, 36: 0.008064909572423597, 37: 0.005784363367006628, 38: 0.0079763838285295