In [1]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kmist\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [2]:
df=pd.read_csv('C:/Users/kmist/Desktop/tennis_articles_v4.csv')

In [3]:
df.head(2)

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...


In [4]:
df['article_text'][0]

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same 

Generate Summary for each individual article

In [8]:
#split text into sentences
from nltk.tokenize import sent_tokenize
sentences=[]
for s in df['article_text']:
    sentences.append(sent_tokenize(s))
    
sentences = [y for x in sentences for y in x] # flatten list    

In [12]:
sentences[:10]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 "I'm a pretty competitive girl.",
 "I say my hellos, but I'm not sending any players flowers as well.",
 "Uhm, I'm not really friendly or close to many players.",
 "I have not a lot of friends away from the courts.'",
 'When she said she is not really close to a lot of players, is that something strategic that she is doing?',
 "Is it different on the men's tour than the women's tour?"]

In [22]:
#Change working directory
import os
print("Current Working Directory " , os.getcwd())
print('***changing the working directory to read word embeddings***')
 	
os.chdir("C:/Users/kmist/Desktop")
print('new path')
print(os.getcwd())

Current Working Directory  C:\Users\kmist\Desktop
***changing the working directory to read word embeddings***
new path
C:\Users\kmist\Desktop


In [23]:
#Extract word vectors
word_embeddings={}
f=open('glove.6B.100d.txt', encoding='utf-8')

In [24]:
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [25]:
len(word_embeddings)

400000

 Text Preprocessing, remove punctuations, numbers and special characters

In [26]:
clean_sentences=pd.Series(sentences).str.replace("[^a-zA-Z]"," ")
clean_sentences=[s.lower() for s in clean_sentences]

In [28]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kmist\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [29]:
from nltk.corpus import stopwords
stop_words=stopwords.words('english')

In [31]:
def remove_stopwords(sen):
    sen_new=" ".join([i for i in sen if i not in stop_words])
    return sen_new

In [32]:
clean_sentences=[remove_stopwords(r.split()) for r in clean_sentences]

Now that we have adequately clean data we can extract their word vectors from the corpus.
We will first fetch vectors (each of size 100 elements) for the constituent words in a sentence and then take mean/average of those vectors to arrive at a consolidated vector for the sentence.

In [35]:
sentence_vectors=[]
for i in clean_sentences:
    if len(i)!=0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))     
    sentence_vectors.append(v)

Using cosine to find similarity between sentences 

In [37]:
sim_matrix=np.zeros([len(sentences),len(sentences)])

In [38]:
from sklearn.metrics.pairwise import cosine_similarity

In [42]:
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]                      

Using PageRank Algrithm. First conver the similarity matirx into graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores.
Then we apply PageRank to extract sentence rankings

In [43]:
import networkx as nx
nx_graph=nx.from_numpy_array(sim_matrix)
scores=nx.pagerank(nx_graph)

Finally Summary Extraction !!

In [44]:
# Work backwards to extract the top N sentences based on their rankings
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [46]:
#Extract top 10 sentences in the summary
for i in range(10):
    print(ranked_sentences[i][1])      

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London 