# Text Summarizer:  Producing a concise and fluent summary while preserving key information and overall meaning

Extractive Summarization attempts to summarize articles by selecting a subset of words that retain the most important points. This approach weights the important part of sentences and uses the same to form the summary.
Sentences are weighted and ranked based on importance and similarity among each other. Cosine similarity is primarily used to measure similarity.

In [155]:
import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
import codecs

In [156]:
def read_article(file_name):
    file = codecs.open(file_name, "r", encoding='utf-8')  # handles accentuated characters
    filedata = file.readlines()
    article = filedata[0].split(". ")    # split the text by sentences using ". "
    
    sentences = []
    for sentence in article:             # iterate thru sentences, printing each and generate list of wards for each sentence
        #print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))    # replace any non character by " "
    #sentences.pop()   ##### systematically eliminate last sentence of the text from the returned sentences??
    
    return sentences

In [34]:
def sentence_similarity(sentence_1, sentence_2, stopwords=None):
    if stopwords is None:
        stopwords = []     # create an empty list to avoid error below
 
    sentence_1 = [w.lower() for w in sentence_1]
    sentence_2 = [w.lower() for w in sentence_2]

    all_words = list(set(sentence_1 + sentence_2))  # create total vocabulary of unique words for the two sentences compared

    vector1 = [0] * len(all_words)                  # prepare one-hot vectors for each sentence over all vocab
    vector2 = [0] * len(all_words)

    # build the vector for the first sentence
    for w in sentence_1:
        if w in stopwords:
            continue 
        vector1[all_words.index(w)] += 1           # list.index(element) returns the index of the given element in the list

    # build the vector for the second sentence
    for w in sentence_2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1

    return 1 - cosine_distance(vector1, vector2)   # Cosine = 0 for similar sentences => returns 1 if perfectly similar

In [35]:
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))  # create a square matrix with dim the num of sentences
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences (diagonal of the square matrix)
                continue
            # similarity of each sentence to all other sentences in the text is measured and logged in the matrix
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

In [175]:
def generate_summary(file_name, top_n=5, show=False):
    #stop_words = stopwords.words('english')
    stop_words = stopwords.words('french')
    summarize_text = []
    
    # Step 1 - Read text and tokenize
    sentences =  read_article(file_name)
    print("number of sentences in text : ", len(sentences))
    
    # Step 2 - Generate Similary Matrix across sentences
    sentence_similarity_matrix = build_similarity_matrix(sentences, stop_words)
    
    # Step 3 - Rank sentences in similarity matrix. let’s convert the similarity matrix into a graph. 
    # The nodes of this graph will represent the sentences and the edges will represent the similarity scores between
    # the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings.
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix)
    scores = nx.pagerank(sentence_similarity_graph)
    
    # Step 4 - Sort the rank and pick top sentences extract the top N sentences based on their rankings for summary generation
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    if show :
        print("Indexes of top ranked_sentence order are ", ranked_sentence)
    # extract the top N sentences based on their rankings for summary generation
    for i in range(top_n):
        summarize_text.append(" ".join(ranked_sentence[i][1]))
    
    # Step 5 - Output the summarize text
    print("Summarize Text: \n", ". ".join(summarize_text)+'.')

In [176]:
# let's begin
generate_summary( "covid.txt", 5, False)

number of sentences in text :  36
Summarize Text: 
 Le port du masque n’est donc pas utile. Rendre obligatoire le port du masque dans certains lieux extérieurs, relève sans doute plus du principe de précaution que d’une exigence scientifique. À l’instar de Paris, de plus en plus de municipalités rendent obligatoire le port du masque dans les rues et les zones les plus densément occupées. Après Lille, Nice ou encore Toulouse, c’est désormais au tour de Paris de rejoindre la liste grandissante des municipalités rendant obligatoire le port du masque dans certaines rues et certains quartiers. En revanche, sur une terrasse bondée ou lors d’un rassemblement festif très dense, le port du masque est à recommander.


In [177]:
generate_summary( "dgse.txt", 3)

number of sentences in text :  22
Summarize Text: 
 Le 28 juillet, les trois hommes sont mis en examen pour «tentative d'homicide volontaire en bande organisée», «recel en bande organisée de vol, transport, acquisition, détention d'armes de catégorie B en réunion» et «association de malfaiteurs en vue de la commission de crimes et délits punis de 10 ans d'emprisonnement». Selon le parquet de Paris, les deux jeunes militaires arrêtés le 24 juillet à Créteil (Val-de-Marne) semblaient viser une femme de 54 ans. Les 30 et 31 juillet, deux autres hommes sont à leur tour placés en garde à vue.


In [178]:
generate_summary( "dreyfus.txt", 3)

number of sentences in text :  21
Summarize Text: 
 Malgré les menées de l'armée pour étouffer cette affaire, le premier jugement condamnant Dreyfus est cassé par la Cour de cassation au terme d'une enquête minutieuse, et un nouveau conseil de guerre a lieu à Rennes en 1899. À cette date, l'opinion comme la classe politique française est unanimement défavorable à Dreyfus. Le même mois, Mathieu Dreyfus porte plainte auprès du ministère de la Guerre contre Walsin Esterhazy.


In [179]:
generate_summary( "summarize.txt", 3)

number of sentences in text :  11
Summarize Text: 
 The limited study is available for abstractive summarization as it requires a deeper understanding of the text as compared to the extractive approach. It’s good to understand Cosine similarity to make the best use of code you are going to see. Since we will be representing our sentences as the bunch of vectors, we can use it to find the similarity among sentences.


In [190]:
generate_summary( "maria.txt", 5)

number of sentences in text :  16
Summarize Text: 
 So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think everyone just thinks because we're tennis players we should be the greatest of friends. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net. I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players.
