<a href="https://colab.research.google.com/github/Abhinaay/TextSummarization-and-Classification/blob/master/Text_Summarization_Test3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Implementation of Tf-idf with stemming.**

In [0]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import re
import bs4 as bs
import urllib.request

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Narendra_Modi')
article = scraped_data.read()

In [0]:
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
    article_text += p.text

In [0]:
# Sentence Tokenization
sentences = nltk.sent_tokenize(article_text)

In [0]:
# Create the Frequency matrix of the words in each sentence.
def _create_frequency_matrix(sentences):
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix

In [0]:
# Calculate tf and generate matrix
def _create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix

In [0]:
#  Creating a table for documents per words
def _create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table

In [0]:
# Calculate IDF and generate a matrix
def _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix

In [0]:
# Calculate TF-IDF and generate a matrix
def _create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix

In [0]:
# Score the sentences
def _score_sentences(tf_idf_matrix) -> dict:
    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue

In [0]:
# Find the threshold
def _find_average_score(sentenceValue) -> int:
    
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    average = (sumValues / len(sentenceValue))

    return average

In [0]:
# Generate the summary
def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary

In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
import math

from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords    
    
total_documents = len(sentences)

freq_matrix = _create_frequency_matrix(sentences)

tf_matrix = _create_tf_matrix(freq_matrix)

count_doc_per_words = _create_documents_per_words(freq_matrix)

idf_matrix = _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)

tf_idf_matrix = _create_tf_idf_matrix(tf_matrix, idf_matrix)

sentence_scores = _score_sentences(tf_idf_matrix)

threshold = _find_average_score(sentence_scores)

summary = _generate_summary(sentences, sentence_scores, 1.5 * threshold)
print(summary)

 He is the first prime minister outside of the Indian National Congress to win two consecutive terms with a full majority and the second to complete five years in office after Atal Bihari Vajpayee. He was introduced to the RSS at the age of eight, beginning a long association with the organisation. In 1971 he became a full-time worker for the RSS. Modi was elected to the legislative assembly soon after. He began a high-profile sanitation campaign and weakened or abolished environmental and labour laws. He initiated a controversial demonetisation of high-denomination banknotes. 1920). Shortly afterwards, the RSS was banned. [91] Independent sources put the death toll at over 2000. [85][92] Approximately 150,000 people were driven to refugee camps. [110][111] The Supreme Court gave the matter to the magistrate's court. Zakia Jaffri filed a protest petition in response. His policies during his second term have been credited with reducing corruption in the state. By December 2008, 500,000 