# Text summarization
Text summarization is the process of distilling the most important information from a source text.

## Applications
* Extract headlines from text
* In shorts (article summary)
* Question answering
* Key information extraction
* Product reviews

## Type of summarization

####   Based on Input Type
* Single Document - Summarize one doc
* Multi Document - Summarize multiple documents in one summary
#### Based on Output
* Extractive - Extractive strategies select the top N sentences that best represent the key points of the article. Grammatically correct but may not be suitable for smooth reading.
* Abstractive - Abstractive summaries looks to create an intermediate semantic representation of the document and build from it. May not have original content, may use prarphrasing. Challenging to create grammatically and semantically correct summaries.
* Hybrid - A mix of both
#### Based on Purpose
* Generic
* Domain Specific
* Query-based

### Steps
 - Text Cleaning
 - Word Tokenization
 - Word-frequency
 - Sentence Tokenization
 - Word/Sentence ranking
 - Summarization
 

In [1]:
from string import punctuation
import re

In [2]:
# Wiki article on the applications of biomechanics 
ml = """Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1] It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."""

In [3]:
ml_split = re.split("\s", ml)

In [4]:
stopwords = ['the', 'to', 'is', 'of', 'in', 'on', 'it', 'It', 'A', 'as']

In [5]:
punctuation = punctuation + '\n'

In [6]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [7]:
word_freq = {}

for word in ml_split:
    if word.lower() not in stopwords:
        if word.lower() not in punctuation:
            if word not in word_freq.keys():
                word_freq[word] = 1
            else:
                word_freq[word] += 1

In [8]:
print(word_freq)

{'Machine': 3, 'learning': 6, '(ML)': 1, 'study': 2, 'computer': 2, 'algorithms': 4, 'that': 1, 'improve': 1, 'automatically': 1, 'through': 2, 'experience.[1]': 1, 'seen': 1, 'a': 4, 'part': 1, 'artificial': 1, 'intelligence.': 1, 'build': 1, 'model': 1, 'based': 1, 'sample': 1, 'data,': 1, 'known': 1, '"training': 1, 'data",': 1, 'order': 1, 'make': 1, 'predictions': 2, 'or': 2, 'decisions': 1, 'without': 1, 'being': 1, 'explicitly': 1, 'programmed': 1, 'do': 1, 'so.[2]': 1, 'are': 1, 'used': 1, 'wide': 1, 'variety': 1, 'applications,': 1, 'such': 1, 'email': 1, 'filtering': 1, 'and': 2, 'vision,': 1, 'where': 1, 'difficult': 1, 'unfeasible': 1, 'develop': 1, 'conventional': 1, 'perform': 1, 'needed': 1, 'tasks.': 1, 'A': 1, 'subset': 1, 'machine': 4, 'closely': 1, 'related': 2, 'computational': 1, 'statistics,': 1, 'which': 1, 'focuses': 1, 'making': 1, 'using': 1, 'computers;': 1, 'but': 1, 'not': 1, 'all': 1, 'statistical': 1, 'learning.': 3, 'mathematical': 1, 'optimization': 1, 

In [9]:
max_freq = max(word_freq.values())
max_freq

6

In [10]:
# Normalizing the word frequency
for word in word_freq.keys():
    word_freq[word] = word_freq[word]/max_freq

In [11]:
print(word_freq)

{'Machine': 0.5, 'learning': 1.0, '(ML)': 0.16666666666666666, 'study': 0.3333333333333333, 'computer': 0.3333333333333333, 'algorithms': 0.6666666666666666, 'that': 0.16666666666666666, 'improve': 0.16666666666666666, 'automatically': 0.16666666666666666, 'through': 0.3333333333333333, 'experience.[1]': 0.16666666666666666, 'seen': 0.16666666666666666, 'a': 0.6666666666666666, 'part': 0.16666666666666666, 'artificial': 0.16666666666666666, 'intelligence.': 0.16666666666666666, 'build': 0.16666666666666666, 'model': 0.16666666666666666, 'based': 0.16666666666666666, 'sample': 0.16666666666666666, 'data,': 0.16666666666666666, 'known': 0.16666666666666666, '"training': 0.16666666666666666, 'data",': 0.16666666666666666, 'order': 0.16666666666666666, 'make': 0.16666666666666666, 'predictions': 0.3333333333333333, 'or': 0.3333333333333333, 'decisions': 0.16666666666666666, 'without': 0.16666666666666666, 'being': 0.16666666666666666, 'explicitly': 0.16666666666666666, 'programmed': 0.1666

In [12]:
sent_tokens = [sent for sent in re.split('\.', ml)]
print(sent_tokens)

['Machine learning (ML) is the study of computer algorithms that improve automatically through experience', '[1] It is seen as a part of artificial intelligence', ' Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so', '[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks', '\n\nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning', ' The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning', ' Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning', ' In 

In [13]:
sent_scores = {}
for sent in sent_tokens:
    for word in sent:
        if word.lower() in word_freq.keys():
            if sent not in sent_scores.keys():
                sent_scores[sent] = word_freq[word.lower()]
            else:
                sent_scores[sent] += word_freq[word.lower()]

In [14]:
print(sent_scores)

{'Machine learning (ML) is the study of computer algorithms that improve automatically through experience': 4.666666666666666, '[1] It is seen as a part of artificial intelligence': 3.333333333333333, ' Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so': 9.333333333333332, '[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks': 9.999999999999998, '\n\nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning': 9.333333333333332, ' The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning': 6.666666666666667, ' Da

In [15]:
from heapq import nlargest

In [16]:
select_length = int(len(sent_tokens)*0.3)
select_length

2

In [17]:
summary = nlargest(select_length, sent_scores, key = sent_scores.get)

In [18]:
len(summary)

2

In [19]:
final_summary = [word for word in summary]

In [20]:
summary = ' '.join(final_summary)

In [21]:
summary

'[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks  Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so'

In [22]:
len(ml), len(summary)

(1078, 397)