# Text summarization
Text summarization is the process of distilling the most important information from a source text.

## Applications
* Extract headlines from text
* In shorts (article summary)
* Question answering
* Key information extraction
* Product reviews

## Type of summarization

####   Based on Input Type
* Single Document - Summarize one doc
* Multi Document - Summarize multiple documents in one summary
#### Based on Output
* Extractive - Extractive strategies select the top N sentences that best represent the key points of the article. Grammatically correct but may not be suitable for smooth reading.
* Abstractive - Abstractive summaries looks to create an intermediate semantic representation of the document and build from it. May not have original content, may use prarphrasing. Challenging to create grammatically and semantically correct summaries.
* Hybrid - A mix of both
#### Based on Purpose
* Generic
* Domain Specific
* Query-based

### Steps
 - Text Cleaning
 - Word Tokenization
 - Word-frequency
 - Sentence Tokenization
 - Common word sentence similarity
 - Summarization
 

In [516]:
# Wiki article on the applications of biomechanics 
ml = """Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1] It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."""

In [517]:
# ml = re.sub(".[\d+]]", "", ml)
# ml

In [518]:
# bio_split = biomechanics.split(" ")
# bio_sent = biomechanics.split('.')

In [519]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
import re

In [520]:
stopwords = list(STOP_WORDS)

In [521]:
nlp = spacy.load("en_core_web_sm")

In [522]:
doc = nlp(ml)

In [523]:
punctuation = punctuation + '\n'

In [524]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [525]:
word_freq = {}

for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_freq.keys():
                word_freq[word.text] = 1
            else:
                word_freq[word.text] += 1

In [526]:
print(word_freq)

{'Machine': 3, 'learning': 9, 'ML': 1, 'study': 3, 'computer': 2, 'algorithms': 4, 'improve': 1, 'automatically': 1, 'experience.[1': 1, 'seen': 1, 'artificial': 1, 'intelligence': 1, 'build': 1, 'model': 1, 'based': 1, 'sample': 1, 'data': 3, 'known': 1, 'training': 1, 'order': 1, 'predictions': 2, 'decisions': 1, 'explicitly': 1, 'programmed': 1, 'so.[2': 1, 'wide': 1, 'variety': 1, 'applications': 1, 'email': 1, 'filtering': 1, 'vision': 1, 'difficult': 1, 'unfeasible': 1, 'develop': 1, 'conventional': 1, 'perform': 1, 'needed': 1, 'tasks': 1, '\n\n': 1, 'subset': 1, 'machine': 4, 'closely': 1, 'related': 2, 'computational': 1, 'statistics': 1, 'focuses': 1, 'making': 1, 'computers': 1, 'statistical': 1, 'mathematical': 1, 'optimization': 1, 'delivers': 1, 'methods': 1, 'theory': 1, 'application': 2, 'domains': 1, 'field': 2, 'Data': 1, 'mining': 1, 'focusing': 1, 'exploratory': 1, 'analysis': 1, 'unsupervised': 1, 'business': 1, 'problems': 1, 'referred': 1, 'predictive': 1, 'analy

In [527]:
max_freq = max(word_freq.values())
max_freq

9

In [528]:
# Normalizing the word frequency
for word in word_freq.keys():
    word_freq[word] = word_freq[word]/max_freq

In [529]:
print(word_freq)

{'Machine': 0.3333333333333333, 'learning': 1.0, 'ML': 0.1111111111111111, 'study': 0.3333333333333333, 'computer': 0.2222222222222222, 'algorithms': 0.4444444444444444, 'improve': 0.1111111111111111, 'automatically': 0.1111111111111111, 'experience.[1': 0.1111111111111111, 'seen': 0.1111111111111111, 'artificial': 0.1111111111111111, 'intelligence': 0.1111111111111111, 'build': 0.1111111111111111, 'model': 0.1111111111111111, 'based': 0.1111111111111111, 'sample': 0.1111111111111111, 'data': 0.3333333333333333, 'known': 0.1111111111111111, 'training': 0.1111111111111111, 'order': 0.1111111111111111, 'predictions': 0.2222222222222222, 'decisions': 0.1111111111111111, 'explicitly': 0.1111111111111111, 'programmed': 0.1111111111111111, 'so.[2': 0.1111111111111111, 'wide': 0.1111111111111111, 'variety': 0.1111111111111111, 'applications': 0.1111111111111111, 'email': 0.1111111111111111, 'filtering': 0.1111111111111111, 'vision': 0.1111111111111111, 'difficult': 0.1111111111111111, 'unfeas

In [530]:
sent_tokens = [sent for sent in doc.sents]
print(sent_tokens)

[Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1], It is seen as a part of artificial intelligence., Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

, A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning., The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning., Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning., In its application a

In [531]:
sent_scores = {}
for sent in sent_tokens:
    for word in sent:
        if word.text.lower() in word_freq.keys():
            if sent not in sent_scores.keys():
                sent_scores[sent] = word_freq[word.text.lower()]
            else:
                sent_scores[sent] += word_freq[word.text.lower()]

In [532]:
print(sent_scores)

{Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1]: 2.777777777777778, It is seen as a part of artificial intelligence.: 0.3333333333333333, Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

: 8.111111111111107, A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.: 5.222222222222222, The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 2.8888888888888893, Data mining is a related field of 

In [533]:
from heapq import nlargest

In [534]:
select_length = int(len(sent_tokens)*0.3)
select_length

2

In [535]:
summary = nlargest(select_length, sent_scores, key = sent_scores.get)

In [536]:
len(summary)

2

In [537]:
final_summary = [word.text for word in summary]

In [538]:
summary = ' '.join(final_summary)

In [539]:
summary

'Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.\n\n A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.'

In [540]:
len(ml), len(summary)

(1078, 583)