# Text Summarization


There are 2 types of summarization:
    
1. Abstractive summarization

1. Extractive summarization.

**Abstractive Summarization**: Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. It aims at producing important material in a new way. They interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.
It can be correlated to the way human reads a text article or blog post and then summarizes in their own word.
**Input document → understand context → semantics → create own summary**

**Extractive Summarization**: Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points.
This approach weights the important part of sentences and uses the same to form the summary. Different algorithm and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other.
**Input document → sentences similarity → weight sentences → select sentences with higher rank**

# Steps in Text Summarization

1. Convert Paragraphs to Sentences
2. Text Preprocessing
3. Tokenizing the Sentences
4. Find Weighted Frequency of Occurrence
5. Replace Words by Weighted Frequency in Original Sentences
6. Sort Sentences in Descending Order of Sum


# Importing libraries and dependencies

In [None]:
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import pandas as pd
import networkx as nx
import nltk
import re
import string
import spacy
nlp = spacy.load("en_core_web_sm")
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.parsing.preprocessing import remove_stopwords
import string

# Reading Resume

In [None]:
txt_file = open("Nabeel Khan-Chief Data Scientist.txt", 'r',encoding = 'utf-8')
file_content = txt_file.read()
content

# Pre Processing

In [None]:
content = re.sub('\[.*?\]', '', content)
content = re.sub('[‘’“”…]', '', content)
content = re.sub('\n', '', content) 
content = re.sub('\t\t', '', content)
content = re.sub('[%s]' % re.escape(string.punctuation), '', content)
content = re.sub('\w*\d\w*', '', content)
content = re.sub('\W', ' ', content)
content = remove_stopwords(content)

There are two objects here.............one is the original resume and the other is processed one

# Converting Resume Text To Sentences

In [None]:
resume = nltk.sent_tokenize(file_content)

In [None]:
resume

# Finding Weighted Frequencies

- In order to find the weighted frequencies processed resume is used
- First storing all the English stop words from the nltk library into a stopwords variable. 
- Looping through all the sentences and then corresponding words to first checking if they are stop words. 
- If not,proceed to check whether the words exist in word_frequency dictionary i.e. word_frequencies, or not. 
- If the word is encountered for the first time, it is added to the dictionary as a key and its value is set to 1.
- If the word previously exists in the dictionary, its value is simply updated by 1.

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(content):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

Divide the number of occurances of all the words by the frequency of the most occurring word

In [None]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

# Calculating Score

- Calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence
- First create an empty sentence_scores dictionary. 
- The keys of this dictionary will be the sentences and the values will be the corresponding scores of the sentences. 

- Loop through each sentence in the resume and tokenize the sentence into words.

- Check if the word exists in the word_frequencies dictionary. 
- Calculate the score for only sentences with less than 30 words. 
- Next, check whether the sentence exists in the sentence_scores dictionary or not. 
- If the sentence doesn't exist, add it to the sentence_scores dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. 
- On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.

In [None]:
sentence_scores = {}
for sent in resume:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

# Summary

- The sentence_scores dictionary contains sentences with their corresponding score. 
- To summarize the resume, taking top N sentences with the highest scores. 

In [None]:
import heapq
summary_sentences = heapq.nlargest(15, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
summary = re.sub('\t\t\t', ' ', summary)
summary = re.sub('\n', '', summary)

In [None]:
summary