### Import required libraries

In [1]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import re

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Method 1

Generate the summary in the source language and then use neural machine translation to translate it into another given language

## Part 1 : Summarization

### Read the data

In [4]:
df = pd.read_csv("tennis_articles_v4.csv")

### Inspect the data

In [5]:
df.head()

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [6]:
df['article_text'][0]

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same 

### Split text into sentences

In [7]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

In [69]:
sentences[:5]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 "I'm a pretty competitive girl."]

### GloVe Word Embeddings

In [9]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [10]:
len(word_embeddings)

400000

### Text Preprocessing

In [11]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [12]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [13]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [14]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

### Vector representation of sentences


In [15]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

### Similarity matrix preparation

In [17]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

### Applying PageRank Algorithm

In [20]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

### Summary Extraction

In [21]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [50]:
# Extract top 10 sentences as the summary
for i in range(10):
    print(ranked_sentences[i][1])

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London 

## Part 2 : Machine Translation

In [35]:
from nltk import sent_tokenize
import googletrans
from googletrans import Translator
translator = Translator()

In [53]:
def translate(data):
    token = sent_tokenize(data)
    for tt in token:
        translated_sentence = translator.translate(tt, dest="fr")
    return translated_sentence.text

In [54]:
for i in range(10):
    print(translate(ranked_sentences[i][1]))

Quand je suis sur les tribunaux ou quand je suis sur le jeu de la cour, je suis un compétiteur et je veux battre chaque personne si elles sont dans le vestiaire ou à travers le net.So Je ne suis pas un pour entamer une conversation sur le temps et je sais que dans les prochaines minutes, je dois aller et essayer de gagner un match de tennis.
Les principaux intervenants estiment qu'un grand événement à la fin Novembre combiné avec un en Janvier avant l'Open d'Australie signifie trop le tennis et trop peu de repos.
Prenant la parole lors du tournoi Swiss Indoors où il jouera dimanche finale contre Marius Copil Roumain, le numéro trois mondial a déclaré que, compte tenu du laps de temps incroyablement court pour prendre une décision, il a choisi de tout engagement.
« Je me sentais comme les meilleures semaines que je devais apprendre à connaître les joueurs quand je jouais ont été les semaines de la Fed Cup ou les semaines olympiques, pas nécessairement lors des tournois.
À l'heure actuel

## Method 2

Apply TextRank algorithm to source document and find corresponding translations in second language

### Reading the data

In [55]:
with open('./deu.txt', encoding='utf-8') as f:
    text = f.read()

In [56]:
deu_eng = text.strip().split('\n')
deu_eng = [i.split('\t') for i in deu_eng]
#deu_eng = np.array(deu_eng)

In [67]:
deu_eng

[['Hi.', 'Hallo!'],
 ['Hi.', 'Grüß Gott!'],
 ['Run!', 'Lauf!'],
 ['Wow!', 'Potzdonner!'],
 ['Wow!', 'Donnerwetter!'],
 ['Fire!', 'Feuer!'],
 ['Help!', 'Hilfe!'],
 ['Help!', 'Zu Hülf!'],
 ['Stop!', 'Stopp!'],
 ['Wait!', 'Warte!'],
 ['Go on.', 'Mach weiter.'],
 ['Hello!', 'Hallo!'],
 ['I ran.', 'Ich rannte.'],
 ['I see.', 'Ich verstehe.'],
 ['I see.', 'Aha.'],
 ['I try.', 'Ich probiere es.'],
 ['I won!', 'Ich hab gewonnen!'],
 ['I won!', 'Ich habe gewonnen!'],
 ['Smile.', 'Lächeln!'],
 ['Cheers!', 'Zum Wohl!'],
 ['Freeze!', 'Keine Bewegung!'],
 ['Freeze!', 'Stehenbleiben!'],
 ['Got it?', 'Kapiert?'],
 ['Got it?', 'Verstanden?'],
 ['Got it?', 'Einverstanden?'],
 ['He ran.', 'Er rannte.'],
 ['He ran.', 'Er lief.'],
 ['Hop in.', 'Mach mit!'],
 ['Hug me.', 'Drück mich!'],
 ['Hug me.', 'Nimm mich in den Arm!'],
 ['Hug me.', 'Umarme mich!'],
 ['I fell.', 'Ich fiel.'],
 ['I fell.', 'Ich fiel hin.'],
 ['I fell.', 'Ich stürzte.'],
 ['I fell.', 'Ich bin hingefallen.'],
 ['I fell.', 'Ich bin gestür

In [70]:
eng = []
for i in range(100):
    eng.append(deu_eng[i][0])

### Text Preprocessing

In [73]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(eng).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [74]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

### Vector representation of sentences

In [75]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

### Similarity matrix preparation

In [77]:
# similarity matrix
sim_mat = np.zeros([len(eng), len(eng)])

In [78]:
for i in range(len(eng)):
    for j in range(len(eng)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

### Applying PageRank Algorithm

In [79]:
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

### Summary Extraction

In [80]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(eng)), reverse=True)

In [81]:
# Extract top 10 sentences as the summary
for i in range(10):
    print(ranked_sentences[i][1])

Get out.
Get out.
Get out!
Get out!
Go on.
Come on.
Come on!
Come on!
Come on!
Come on!
