<b> Text Summarization using TFIDF
============================
    
### From Leaf by Niggle - JRR Tolkien
    
#### Automatic text summarization is the task of producing a **concise and fluent summary** without any human help while preserving the meaning of the original text document.

####  There are two different approach for Summarization: Extractive and Abstractive

#### **Extractive:** selects important parts  of the text to produce a reduced version

####  **Abstractive:** aim at producing summary by interpreting the text using advanced natural language techniques in order to generate a new shorter text

####  This notebook to performs **Extractive Summarization** using **TFIDF** in **Leaf by Niggle** text.
    
#### Two forms of summarization are used: manual calculation of the steps and the use of the Scikit-Learn vectorization library  

## <b> TFIDF stands for: Term Frequency - Inverse Document Frequency
    
    This is a technique to quantify a word in documents, computing a weight to each word which signifies the importance of the word in the document and corpus.
    
    
   **Document**: It can be a phrase, a text file a pandas row etc...
    
   **TF (Term Frequency)**: Frequency of a word in a document (term T in document D)
    
    *TF(T,D) = count of T in D/ number of words in D*
   
   **DF (Document Frequency)**: This measures the importance of document in whole set of corpus (term T in the document set N)
   
    *DF(T) = occurrence of T in documents*
   
   **IDF (Inverse Document Frequency)**: IDF is the inverse of the document frequency which measures the informativeness of term T.
    
    *IDF(T) = log(N/(DF+1))*
    
   **The whole expression is:**
    
    
    TF-IDF(T, D) = TF(T, D) * log(N/(DF + 1))
    
    

In [13]:
#-----------------------------------------------------
# Libraries
#-----------------------------------------------------

# Python 
import pandas as pd
import numpy as np
import json
from src import DataProcessing
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
from sklearn.feature_extraction.text import CountVectorizer
import math

# NLP
from nltk import sent_tokenize, word_tokenize
import spacy
from nltk.corpus import stopwords
nlp = spacy.load("en_core_web_sm")
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

# ML
from sklearn.feature_extraction.text import TfidfVectorizer

### <b> Load Data

In [14]:
folder = '../datasets/leaf_by_niggle.txt'
file = open(folder, 'r')
poem = file.read()

# <b> Summarization

### Tokenize Sentences

In [15]:
tokenized_sentences = sent_tokenize(poem)
documents_total = len(tokenized_sentences)
first_sentence = tokenized_sentences[0]
print('\n Number of Documents = ', documents_total)
print('\n First Sentence = ', first_sentence)


 Number of Documents =  591

 First Sentence =  There was once a little man called Niggle, who had a long journey to make.


## <b> Manual Summarization

### Frequency Matrix of the words in each sentence

In [4]:
def generate_words_frequency_matrix(tokenized_sentences):
    
    frequency_matrix = {}

    for sentence in tokenized_sentences:
        words_matrix = {}
        words = sentence.split()
        for word in words:
            if word not in words_matrix:
                words_matrix[word] = 1
            else:
                words_matrix[word] +=1
        frequency_matrix[sentence] = words_matrix
                
    return frequency_matrix

words_frequency_matrix =  generate_words_frequency_matrix(tokenized_sentences)

print("\n Original sentence = ", first_sentence)
print("\n Words frequency in sentence = ", words_frequency_matrix[first_sentence])


 Original sentence =  There was once a little man called Niggle, who had a long journey to make.

 Words frequency in sentence =  {'There': 1, 'was': 1, 'once': 1, 'a': 2, 'little': 1, 'man': 1, 'called': 1, 'Niggle,': 1, 'who': 1, 'had': 1, 'long': 1, 'journey': 1, 'to': 1, 'make.': 1}


###  Term Frequency Matrix

    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

In [5]:
def generate_term_frequency_matrix(words_frequency_matrix):
    
    terms_frequency_matrix = {}

    for document, terms_frequency_dict in words_frequency_matrix.items():   

        tf_matrix = {}        
        number_of_terms_in_document = len(terms_frequency_dict)    

        for term_t, number_term_t_in_document in terms_frequency_dict.items():        
            tf_matrix[term_t] = number_term_t_in_document/number_of_terms_in_document
        terms_frequency_matrix[document] = tf_matrix
        
    return terms_frequency_matrix

terms_frequency_matrix = generate_term_frequency_matrix(words_frequency_matrix)

print("\n Original sentence = ", first_sentence)
print("\n Words frequency in sentence = ", words_frequency_matrix[first_sentence])
print("\n TF for words in each sentence = ", terms_frequency_matrix[first_sentence])


 Original sentence =  There was once a little man called Niggle, who had a long journey to make.

 Words frequency in sentence =  {'There': 1, 'was': 1, 'once': 1, 'a': 2, 'little': 1, 'man': 1, 'called': 1, 'Niggle,': 1, 'who': 1, 'had': 1, 'long': 1, 'journey': 1, 'to': 1, 'make.': 1}

 TF for words in each sentence =  {'There': 0.07142857142857142, 'was': 0.07142857142857142, 'once': 0.07142857142857142, 'a': 0.14285714285714285, 'little': 0.07142857142857142, 'man': 0.07142857142857142, 'called': 0.07142857142857142, 'Niggle,': 0.07142857142857142, 'who': 0.07142857142857142, 'had': 0.07142857142857142, 'long': 0.07142857142857142, 'journey': 0.07142857142857142, 'to': 0.07142857142857142, 'make.': 0.07142857142857142}


### Inverse Document Frequency

    IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

In [6]:
def generate_inverse_document_frequency(words_frequency_matrix):


    N = len(words_frequency_matrix)
    idf = {}

    for doc, word_dict in words_frequency_matrix.items():       
        for word, count in word_dict.items():        
            val = 0        
            for sent in tokenized_sentences:                
                if word in sent:                
                    val+=1       
            idf[word] = math.log10(N / (val + 1))     
                
    return idf

inverse_document_frequency = generate_inverse_document_frequency(words_frequency_matrix)

print("\n IDF for word There = ", inverse_document_frequency['There'])


 IDF for word There =  1.4419568376564116


### TFIDF

In [7]:
def generate_tfidf(terms_frequency_matrix, inverse_document_frequency):
    
    tfidf = {}
    for words, tf_dict in terms_frequency_matrix.items():
        for word, score in tf_dict.items():
            tfidf[word] = score * inverse_document_frequency[word]
            
    return tfidf

tfidf = generate_tfidf(terms_frequency_matrix, inverse_document_frequency)

In [8]:
print("\n TFIDF for word There = ", tfidf['There'])


 TFIDF for word There =  0.09613045584376077


### Score Sentences and Find a Threshold

In [9]:
def generate_score_sentences(tokenized_sentences, tfidf):

    scored_sentence = {}

    for sent in tokenized_sentences:
        score = 0
        for word in word_tokenize(sent):
                if word in tfidf:
                    score += 1            

        scored_sentence[sent] = score
        
    return scored_sentence

scored_sentence = generate_score_sentences(tokenized_sentences, tfidf)

def generate_threshold(scored_sentence, total):
    sum_score = 0

    for sent, scores in scored_sentence.items():
        sum_score += scores
    
    return np.round(sum_score/(total-(total*0.15)))    

threshold =  generate_threshold(scored_sentence, documents_total)

In [10]:
print("\n Original sentence = ", first_sentence)
print('\n Scored sentence = ', scored_sentence[first_sentence])
print('\n Threshold = ', threshold)


 Original sentence =  There was once a little man called Niggle, who had a long journey to make.

 Scored sentence =  16

 Threshold =  15.0


### Summary

In [11]:
def generate_summary_text(scored_sentence, threshold):
    summary = ' '
    for sent, score in scored_sentence.items():
        if score >= threshold:
            summary += sent + ' '
            
    return summary

summary = generate_summary_text(scored_sentence, threshold)

## <b> Summarization Using TFIDF from Scikit-Learn

In [16]:
vectorizer = TfidfVectorizer()

vectors = vectorizer.fit_transform(tokenized_sentences)

feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
dense_list = dense.tolist()

df = pd.DataFrame(dense_list, columns=feature_names)

In [19]:
df.sample(5)

Unnamed: 0,able,about,absolutely,absorbed,ache,acquaintances,actually,added,adjoining,advice,...,written,wrong,year,years,yellow,yes,yet,you,young,your
302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
545,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.223418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
359,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
scored_sentence = generate_score_sentences(tokenized_sentences, df)

In [21]:
threshold =  generate_threshold(scored_sentence, documents_total)

In [22]:
print("\n Original sentence = ", first_sentence)
print('\n Scored sentence = ', scored_sentence[first_sentence])
print('\n Threshold = ', threshold)


 Original sentence =  There was once a little man called Niggle, who had a long journey to make.

 Scored sentence =  11

 Threshold =  13.0


In [23]:
skl_summary = generate_summary_text(scored_sentence, threshold)

## <b> Results

In [24]:
print("Original text\n") 
print('Text lenght = ', len(poem))
print('\n')
poem[0:2000]

Original text

Text lenght =  39470




'There was once a little man called Niggle, who had a long journey to make. He did not want to go, indeed the whole idea was distasteful to him; but he could not get out of it. He knew he would have to start some time, but he did not hurry with his preparations.\n\nNiggle was a painter. Not a very successful one, partly because he had many other things to do. Most of these things he thought were a nuisance; but he did them fairly well, when he could not get out of them: which (in his opinion) was far too often. The laws in his country were rather strict. There were other hindrances, too. For one thing, he was sometimes just idle, and did nothing at all. For another, he was kind-hearted, in a way. You know the sort of kind heart: it made him uncomfortable more often than it made him do anything; and even when he did anything, it did not prevent him from grumbling, losing his temper, and swearing (mostly to himself). All the same, it did land him in a good many odd jobs for his neighbour

In [25]:
print("Summarized text (TFIDF Manual) \n") 
print('Text lenght = ', len(summary))
print('\n')
summary[0:2000]

Summarized text (TFIDF Manual) 

Text lenght =  24597




' There was once a little man called Niggle, who had a long journey to make. He did not want to go, indeed the whole idea was distasteful to him; but he could not get out of it. He knew he would have to start some time, but he did not hurry with his preparations. Not a very successful one, partly because he had many other things to do. Most of these things he thought were a nuisance; but he did them fairly well, when he could not get out of them: which (in his opinion) was far too often. You know the sort of kind heart: it made him uncomfortable more often than it made him do anything; and even when he did anything, it did not prevent him from grumbling, losing his temper, and swearing (mostly to himself). All the same, it did land him in a good many odd jobs for his neighbour, Mr. Parish, a man with a lame leg. Occasionally he even helped other people from further off, if they came and asked him to. Also, now and again, he remembered his journey, and began to pack a few things in an i

In [26]:
print("Summarized text (TFIDF from Scikit Learn) \n") 
print('Text lenght = ', len(skl_summary))
print('\n')
skl_summary[0:2000]

Summarized text (TFIDF from Scikit Learn) 

Text lenght =  23439




' He did not want to go, indeed the whole idea was distasteful to him; but he could not get out of it. He knew he would have to start some time, but he did not hurry with his preparations. Most of these things he thought were a nuisance; but he did them fairly well, when he could not get out of them: which (in his opinion) was far too often. You know the sort of kind heart: it made him uncomfortable more often than it made him do anything; and even when he did anything, it did not prevent him from grumbling, losing his temper, and swearing (mostly to himself). All the same, it did land him in a good many odd jobs for his neighbour, Mr. Parish, a man with a lame leg. Occasionally he even helped other people from further off, if they came and asked him to. Also, now and again, he remembered his journey, and began to pack a few things in an ineffectual way: at such times he did not paint very much. He had a number of pictures on hand; most of them were too large and ambitious for his skil