## Text summarization (Senetnce Ranking)

#### STEP 1 : Data cleaning ( removing non letter characters, turning to lower case letters )
#### STEP 2 : Building Sentence Similarity Matrix
#### STEP 3 : Sentence Ranking
#### STEP 4 : Summary Generation

## Initial Phase
### Importing Libraries and Reading Data

In [1]:
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
import networkx as nx
from nltk.tokenize import  sent_tokenize

In [2]:
df = pd.read_csv('tennis_articles_v4.csv')
df['article_text']

0    Maria Sharapova has basically no friends as te...
1    BASEL, Switzerland (AP), Roger Federer advance...
2    Roger Federer has revealed that organisers of ...
3    Kei Nishikori will try to end his long losing ...
4    Federer, 37, first broke through on tour over ...
5    Nadal has not played tennis since he was force...
6    Tennis giveth, and tennis taketh away. The end...
7    Federer won the Swiss Indoors last week by bea...
Name: article_text, dtype: object

In [3]:
import re
s = 'he&&&s'
s = re.sub("[^a-zA-Z]"," ",s)

## STEP 1 : Data Cleaning
### Cleaning sentences, by removing Non Alphabet Characters and converting to Lower Case Letters

In [4]:
dict = {}
s = ""
for a in df['article_text']:
      s += a

s = s.lower()
sentences = sent_tokenize(s)
final = []

for s in sentences:
      temp = re.sub("[^a-zA-Z]"," ",s)
      temp = temp.lower()
      final.append(temp)
      dict[temp] = s

## STEP 2 : Building Senetnce Similarity Matrix
### Similarity is found using Cosine Similarity between vector representation of sentences

In [5]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []

    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]

    all_words = list(set(sent1 + sent2))

    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)

    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1

    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1

    return 1 - cosine_distance(vector1, vector2)

def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))

    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
    return similarity_matrix

## STEP 3 : Sentence Ranking
### Sentences are ranked using PageRank Algorithm on the Graph generated from the Sentence Similarity Matrix

In [6]:
# Step 2 - Generate Similary Martix across sentences
sentence_similarity_martix = build_similarity_matrix(final, '')

# Step 3 - Rank sentences in similarity martix
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)

# Step 4 - Sort the rank and pick top sentences
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(final)), reverse=True)
print("Indexes of top ranked_sentence order are ", ranked_sentence)

# Step 5 - Of course, output the summarized text
summarized_sentences = [sentence for importance, sentence in ranked_sentence]
summarize_text = ". ".join(summarized_sentences)
print('Summarize Text:\n', summarize_text)

Indexes of top ranked_sentence order are  [(0.009324689801439471, 'argentina and britain received wild cards to the new look event  and will compete along with the four      semi finalists and the    teams who win qualifying rounds next february '), (0.009320168305270129, 'the competition is set to feature    countries in the november       finals in madrid next year  and will replace the classic home and away ties played four times per year for decades '), (0.009317507906421309, ' nadal has not played tennis since he was forced to retire from the us open semi finals against juan martin del porto with a knee injury '), (0.009314905907429172, ' not always  but i really feel like in the mid      years there was a huge shift of the attitudes of the top players and being more friendly and being more giving  and a lot of that had to do with players like roger coming up '), (0.00930520140708298, 'but with the atp world tour finals due to begin next month  nadal is ready to prove his fitness 

## STEP 4 : Summary Generation
### Summary is outputted as the top 10 ranked sentences

In [7]:
# Initialize an empty list for summarize_text
summarize_text = []

# Print the content of the top-ranked sentence
for i in range(1):
    print(dict[ranked_sentence[i][1]])

# Append the top 10 sentences to the summarize_text list
for i in range(10):
    summarize_text.append(" ".join(ranked_sentence[i][1]))

# Print the summarized text
print('Summarize Text:\n', ". ".join(summarize_text))

argentina and britain received wild cards to the new-look event, and will compete along with the four 2018 semi-finalists and the 12 teams who win qualifying rounds next february.
Summarize Text:
 a r g e n t i n a   a n d   b r i t a i n   r e c e i v e d   w i l d   c a r d s   t o   t h e   n e w   l o o k   e v e n t     a n d   w i l l   c o m p e t e   a l o n g   w i t h   t h e   f o u r             s e m i   f i n a l i s t s   a n d   t h e         t e a m s   w h o   w i n   q u a l i f y i n g   r o u n d s   n e x t   f e b r u a r y  . t h e   c o m p e t i t i o n   i s   s e t   t o   f e a t u r e         c o u n t r i e s   i n   t h e   n o v e m b e r               f i n a l s   i n   m a d r i d   n e x t   y e a r     a n d   w i l l   r e p l a c e   t h e   c l a s s i c   h o m e   a n d   a w a y   t i e s   p l a y e d   f o u r   t i m e s   p e r   y e a r   f o r   d e c a d e s  .   n a d a l   h a s   n o t   p l a y e d   t e n n i s   s i n c e   h e  