  ## **TEXT SUMMARIZATION:-**

### **IMPORTING LIBRARIES:-**

In [56]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize
import networkx as nx

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **LOADING DATASET:-**

In [76]:
dataset = pd.read_csv("tennis_articles.csv",encoding = 'Windows-1252')
print(dataset)

   article_id  ...                                             source
0           1  ...  https://www.tennisworldusa.org/tennis/news/Mar...
1           2  ...  http://www.tennis.com/pro-game/2018/10/copil-s...
2           3  ...  https://scroll.in/field/899938/tennis-roger-fe...
3           4  ...  http://www.tennis.com/pro-game/2018/10/nishiko...
4           5  ...  https://www.express.co.uk/sport/tennis/1036101...
5           6  ...  https://www.express.co.uk/sport/tennis/1037119...
6           7  ...  http://www.tennis.com/pro-game/2018/10/tennisc...
7           8  ...  https://www.foxsports.com.au/tennis/tennis-jou...

[8 rows x 4 columns]


In [74]:
dataset['article_text'][7]

'I PLAYED golf last week with Todd Reid. He picked me up at 5.30am, half an hour early because he couldn’t sleep. Or hadn’t slept, to be specific. Not because he’d been out on a bender or anything — those days were in the past. The former Wimbledon junior champion was full of hope, excited about getting his life back together after a troubled few years and a touch-and-go battle with pancreatitis. “I’m pleased with that,” he said after grinding out an eight-over-par front nine at the not-so-royal Northbridge club in Sydney and smashing down an egg- and-bacon roll at the halfway house. To most players of his rare sporting gifts, such a modest return would be unacceptable. To Reid the 15-marker, just being up and at ‘em was enough; a few bogeys and one well-made par — broomstick putter and all — vindication for his recent decision to renew his membership at nearby Bankstown. Exhausted after spending half his round deep in the bushes searching for my ball, as well as those of two other gol

### There are 8 articles to be summarized. Here we are going to generate a single summary for 8 articles

### **SPLITTING TEXT INTO SENTENCES:-**

In [58]:
#list initialising
sentences = []
for s in dataset['article_text']:
  sentences.append(sent_tokenize(s))#appends each sentence to the list

sentences = [y for x in sentences for y in x] #flattens the list


In [59]:
sentences[5]#prints 5th line

"I'm a pretty competitive girl."

### **EXTRACTING WORD EMBEDDINGS/WORD VECTORS:-**

In [60]:
# Extract word vectors
word_embeddings = {}#empty dictionary
f = open('glove.6B.100d.txt', encoding='utf-8')#GloVe algo for obtaining word vectors
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [61]:
len(word_embeddings)#no. of word vectors in dictionary(word_embeddings)

181223

### **TEXT PREPROCESSING:-**

In [62]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
#removes stopwords from sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

### **CREATING VECTORS FOR SENTENCES:-**

In [63]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

### **SIMILARITY MATRIX:-**

In [64]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
#Cosine Similarity to compute the similarity between a pair of sentences
from sklearn.metrics.pairwise import cosine_similarity
#initialize the matrix with cosine similarity scores
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]


### **APPLYING PAGERANK ALGORITHM:-**

In [65]:
#Convert the similarity matrix sim_mat into a graph
#The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences
nx_graph = nx.from_numpy_array(sim_mat)
#We will apply the PageRank algorithm on this graph to arrive at the sentence rankings
scores = nx.pagerank(nx_graph)

### **SUMMARY EXTRACTION**

In [77]:
#extract the top N sentences based on their rankings for summary generation.
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
print("SUMMARY:")
for i in range(10):
  print(ranked_sentences[i][1])
  print('\n')

SUMMARY:
“I was on a nice trajectorythen,” Reid recalled.“If I hadn’t got sick, I think I could have started pushing towards the second week at the slams and then who knows.” Duringa comeback attempt some five years later, Reid added Bernard Tomic and 2018 US Open Federer slayer John Millman to his list of career scalps.


Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.


So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.


Speaking at the Swiss Indoors tournament where he will play in Sunday’s final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.


Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London 