## `Text` Sumarrization using NLP

**What is text summarization**

Text summarization is the process of distiling the most important information from a source text.

**Why automatic text summarization**
* Summaries reduce reading time.
* When researching documents, summaries make the selection process easier.
* Automatic summarization improves the effectiveness of indexing.
* Automatic summarization algorithms are less biasd than human summarizers.
* Personalized summaries are useful in question-answering systems as they provide personalized information.

* Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of text documents they are able to process.


**Type of summarization**

* Base on input type
    * Single Document
    * Multi Document
* Base on Output type
    * Extractive
    * Abstractive
    
* Base on the Purpose
    * Generic
    * Domain specific
    * Query-based
    

**How to do text summarization**
- Text cleaning
- Sentence Tokenization
- Word tokenization

- Word-frequency table
- Clustering

- Summarization

**Text**


In [1]:
text = """

There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.

Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[8] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.


"""

## Let's Get Started with SpaCy

In [2]:
 !pip install -U spacy

 !python -m spacy download en_core_web_sm



ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

Collecting srsly<3.0.0,>=2.4.1
  Using cached srsly-2.4.2-cp38-cp38-win_amd64.whl (452 kB)
Installing collected packages: srsly
  Attempting uninstall: srsly
    Found existing installation: srsly 1.0.5
    Uninstalling srsly-1.0.5:
      Successfully uninstalled srsly-1.0.5
Successfully installed srsly-2.4.2



en-core-web-sm 2.3.1 requires spacy<2.4.0,>=2.3.0, but you have spacy 3.2.1 which is incompatible.


Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\calanche\\AppData\\Local\\Temp\\pip-uninstall-r39xwwei\\msgpack\\_packer.cp38-win_amd64.pyd'
Consider using the `--user` option or check the permissions.




  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting spacy<2.4.0,>=2.3.0
  Downloading spacy-2.3.7-cp38-cp38-win_amd64.whl (9.7 MB)
Collecting srsly<1.1.0,>=1.0.2
  Downloading srsly-1.0.5-cp38-cp38-win_amd64.whl (178 kB)
Collecting catalogue<1.1.0,>=0.0.7
  Downloading catalogue-1.0.0-py2.py3-none-any.whl (7.7 kB)
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4.5-cp38-cp38-win_amd64.whl (910 kB)
Installing collected packages: srsly, catalogue, thinc, spacy
  Attempting uninstall: srsly
    Found existing installation: srsly 2.4.2
    Uninstalling srsly-2.4.2:
      Successfully uninstalled srsly-2.4.2


In [4]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation


In [7]:
# stopwords = STOP_WORDS
# stopwords
stopwords = list(STOP_WORDS)
stopwords

['more',
 'whether',
 'twenty',
 'below',
 'behind',
 'whereby',
 'myself',
 'herein',
 'besides',
 'becomes',
 'never',
 'and',
 'our',
 "'d",
 'beside',
 'ourselves',
 'so',
 "'s",
 'any',
 'yourselves',
 'same',
 'their',
 'under',
 'yourself',
 '‘d',
 'out',
 'empty',
 'towards',
 '’re',
 'might',
 '’ve',
 'himself',
 'herself',
 'of',
 'off',
 'whoever',
 'serious',
 'using',
 'everywhere',
 'them',
 'seem',
 'before',
 'perhaps',
 'been',
 'for',
 'part',
 'when',
 'cannot',
 'nevertheless',
 'noone',
 'therein',
 'whence',
 'above',
 'always',
 'where',
 'former',
 'became',
 'a',
 'quite',
 'done',
 '‘ll',
 "'ll",
 "'m",
 'your',
 'there',
 'thru',
 'i',
 'again',
 'eight',
 'somewhere',
 'six',
 'hers',
 'own',
 'per',
 'then',
 'mine',
 'if',
 'become',
 'which',
 'therefore',
 'several',
 'we',
 'am',
 're',
 'toward',
 'seems',
 'only',
 'why',
 'almost',
 'take',
 'would',
 'fifteen',
 'from',
 'my',
 'made',
 'on',
 'beforehand',
 'among',
 'will',
 'yet',
 'upon',
 'thro

In [8]:
# NLp model
nlp = spacy.load('en_core_web_sm')

In [10]:
doc = nlp(text)
doc



There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summ

In [13]:
doc[1]

There

In [11]:
# tokenize sentences

tokens = [token.text for token in doc]
tokens

['\n\n',
 'There',
 'are',
 'broadly',
 'two',
 'types',
 'of',
 'extractive',
 'summarization',
 'tasks',
 'depending',
 'on',
 'what',
 'the',
 'summarization',
 'program',
 'focuses',
 'on',
 '.',
 'The',
 'first',
 'is',
 'generic',
 'summarization',
 ',',
 'which',
 'focuses',
 'on',
 'obtaining',
 'a',
 'generic',
 'summary',
 'or',
 'abstract',
 'of',
 'the',
 'collection',
 '(',
 'whether',
 'documents',
 ',',
 'or',
 'sets',
 'of',
 'images',
 ',',
 'or',
 'videos',
 ',',
 'news',
 'stories',
 'etc',
 '.',
 ')',
 '.',
 'The',
 'second',
 'is',
 'query',
 'relevant',
 'summarization',
 ',',
 'sometimes',
 'called',
 'query',
 '-',
 'based',
 'summarization',
 ',',
 'which',
 'summarizes',
 'objects',
 'specific',
 'to',
 'a',
 'query',
 '.',
 'Summarization',
 'systems',
 'are',
 'able',
 'to',
 'create',
 'both',
 'query',
 'relevant',
 'text',
 'summaries',
 'and',
 'generic',
 'machine',
 '-',
 'generated',
 'summaries',
 'depending',
 'on',
 'what',
 'the',
 'user',
 'needs

In [21]:
# We must remove punctuation and stopwords

punctuation = punctuation + '\n' + '\n\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n\n\n\n'

In [22]:
# text cleaning

word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            # first time the word is apear
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [23]:
word_frequencies

{'broadly': 1,
 'types': 1,
 'extractive': 1,
 'summarization': 11,
 'tasks': 1,
 'depending': 2,
 'program': 1,
 'focuses': 2,
 'generic': 3,
 'obtaining': 1,
 'summary': 4,
 'abstract': 2,
 'collection': 3,
 'documents': 2,
 'sets': 1,
 'images': 3,
 'videos': 3,
 'news': 4,
 'stories': 1,
 'etc': 1,
 'second': 1,
 'query': 4,
 'relevant': 2,
 'called': 2,
 'based': 1,
 'summarizes': 1,
 'objects': 1,
 'specific': 1,
 'Summarization': 1,
 'systems': 1,
 'able': 1,
 'create': 1,
 'text': 1,
 'summaries': 2,
 'machine': 1,
 'generated': 1,
 'user': 1,
 'needs': 1,
 'example': 3,
 'problem': 2,
 'document': 4,
 'attempts': 1,
 'automatically': 3,
 'produce': 1,
 'given': 2,
 'interested': 1,
 'generating': 1,
 'single': 1,
 'source': 2,
 'use': 1,
 'multiple': 1,
 'cluster': 1,
 'articles': 3,
 'topic': 2,
 'multi': 1,
 'related': 2,
 'application': 2,
 'summarizing': 1,
 'Imagine': 1,
 'system': 3,
 'pulls': 1,
 'web': 1,
 'concisely': 1,
 'represents': 1,
 'latest': 1,
 'Image': 1,
 '

In [25]:
# find out max frequency
max_frequency = max(word_frequencies.values())

In [26]:
max_frequency

11

In [27]:
# then normalize the dictionary
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [28]:
print(word_frequencies)

{'broadly': 0.09090909090909091, 'types': 0.09090909090909091, 'extractive': 0.09090909090909091, 'summarization': 1.0, 'tasks': 0.09090909090909091, 'depending': 0.18181818181818182, 'program': 0.09090909090909091, 'focuses': 0.18181818181818182, 'generic': 0.2727272727272727, 'obtaining': 0.09090909090909091, 'summary': 0.36363636363636365, 'abstract': 0.18181818181818182, 'collection': 0.2727272727272727, 'documents': 0.18181818181818182, 'sets': 0.09090909090909091, 'images': 0.2727272727272727, 'videos': 0.2727272727272727, 'news': 0.36363636363636365, 'stories': 0.09090909090909091, 'etc': 0.09090909090909091, 'second': 0.09090909090909091, 'query': 0.36363636363636365, 'relevant': 0.18181818181818182, 'called': 0.18181818181818182, 'based': 0.09090909090909091, 'summarizes': 0.09090909090909091, 'objects': 0.09090909090909091, 'specific': 0.09090909090909091, 'Summarization': 0.09090909090909091, 'systems': 0.09090909090909091, 'able': 0.09090909090909091, 'create': 0.0909090909

In [29]:
# sentences tokenization
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[

There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on., The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.)., The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query., Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

, An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document., Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic)., This problem is called multi-document summarization., A related applicatio

In [30]:
# calcule the sentences score
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent]  += word_frequencies[word.text.lower()]

In [33]:
sentence_scores
# con esto determinamos la importancia dentro del parrafo de la sentencia

{
 
 There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.: 2.818181818181818,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).: 3.9999999999999987,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.: 3.909090909090909,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
 : 3.09090909090909,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 3.9999999999999996,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of a

In [34]:
from heapq import nlargest

In [35]:
select_length = int(len(sentence_tokens)*0.3)
select_length

4

In [36]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [37]:
summary
# the 3 most importances sentecces

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
 ]

In [38]:
# combine the sentencs

final_summary = [word.text for word in summary]

In [39]:
summary_join =  " ".join(final_summary)

In [40]:
summary_join

'An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n\n'

In [42]:
print(text)



There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summ

In [41]:
print(summary_join)

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.




In [43]:
len(text)

1874

In [45]:
len(summary_join)

606