[View in Colaboratory](https://colab.research.google.com/github/KushalVenkatesh/botkit/blob/master/Text_summarization_algorithm_in_python_using_Gensim_(1).ipynb)

Text summarization is one of the newest and most exciting fields in NLP, allowing for developers to quickly find meaning and extract key words and phrases from documents. 
Here I have made an attempt to give the user a walk through on text summarization features in Gensim.

The gensim implementation is based on the popular “TextRank” algorithm and was contributed by the people from the Engineering Faculty of the University in Buenos Aires. 

Text Summarization using Gensim:

This module automatically summarizes the given text, by extracting one or more important sentences from the text. In a similar way, it can also extract keywords.

Here, I will demonstrate how to use this summarization module via some examples. First, we will try a small example, then we will try two larger ones, and then we will review the performance of the summarizer in terms of speed.

This summarizer is based on the "TextRank" algorithm, from an article by Mihalcea et al. This algorithm was later improved upon by Barrios et al. in another article, by introducing something called a "BM25 ranking function". For following this, we have to install python and gensim.

Let us install gensim using pip command. It can also be installed via the CLI in the anaconda promt.

In [0]:
!pip install gensim


Collecting gensim
  Downloading https://files.pythonhosted.org/packages/eb/b5/e74d478d9e89528cc869c52a6d794f5a7dc5452585e23ad24db513636dc1/gensim-3.4.0-cp36-cp36m-win_amd64.whl (22.5MB)
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/69/c92661a333f733510628f28b8282698b62cdead37291c8491f3271677c02/smart_open-1.5.7.tar.gz
Collecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/cf/94/a27c0437087d412932a187879af6f7a0839509368a643e6caca229a4529a/boto3-1.7.44-py2.py3-none-any.whl (128kB)
Collecting botocore<1.11.0,>=1.10.44 (from boto3->smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/a5/f7/80f2e100e6051ee85caecf852b809b042e260f890516a70e3a7831fc9950/botocore-1.10.44-py2.p

After installing gensim, we have to import summarize package from gensim.summarization

In [0]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim.summarization import summarize

2018-06-22 12:00:54,485 : INFO : 'pattern' package not found; tag filters are not available for English


Here I will demonstrate how to summarize text using a small toy example; 
later, we will use a larger piece of text. 
In reality, the text is very small, but it suffices to act as an illustrative example.

In [0]:
text = "Thomas A. Anderson is a man living two lives. By day he is an " + \
    "average computer programmer and by night a hacker known as " + \
    "Neo. Neo has always questioned his reality, but the truth is " + \
    "far beyond his imagination. Neo finds himself targeted by the " + \
    "police when he is contacted by Morpheus, a legendary computer " + \
    "hacker branded a terrorist by the government. Morpheus awakens " + \
    "Neo to the real world, a ravaged wasteland where most of " + \
    "humanity have been captured by a race of machines that live " + \
    "off of the humans' body heat and electrochemical energy and " + \
    "who imprison their minds within an artificial reality known as " + \
    "the Matrix. As a rebel against the machines, Neo must return to " + \
    "the Matrix and confront the agents: super-powerful computer " + \
    "programs devoted to snuffing out Neo and the entire human " + \
    "rebellion. "

print ('Input text:')
print (text)

Input text:
Thomas A. Anderson is a man living two lives. By day he is an average computer programmer and by night a hacker known as Neo. Neo has always questioned his reality, but the truth is far beyond his imagination. Neo finds himself targeted by the police when he is contacted by Morpheus, a legendary computer hacker branded a terrorist by the government. Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix. As a rebel against the machines, Neo must return to the Matrix and confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion. 


In [0]:
To summarize this text, I have passed the raw string data as input to the function "summarize", and
it will return a summary.

Please Note: I have made sure that the string does not contain any newlines where the line breaks in a sentence. 
A sentence with a newline in it (i.e. a carriage return, "\n") will be treated as two sentences.

In [0]:
print ('Summary:')
print (summarize(text))

2018-06-22 12:04:17,335 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-22 12:04:17,336 : INFO : built Dictionary(53 unique tokens: ['thoma', 'anderson', 'live', 'man', 'averag']...) from 6 documents (total 68 corpus positions)


Summary:
Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix.


Here, I can also use the "split" option if I want a list of strings instead of a single string.

In [0]:
print (summarize(text, split=True))

2018-06-22 12:05:26,501 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-22 12:05:26,501 : INFO : built Dictionary(53 unique tokens: ['thoma', 'anderson', 'live', 'man', 'averag']...) from 6 documents (total 68 corpus positions)


["Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix."]


In [0]:
#print (summarize(text, split=True))

2018-06-22 12:08:17,275 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-22 12:08:17,276 : INFO : built Dictionary(53 unique tokens: ['thoma', 'anderson', 'live', 'man', 'averag']...) from 6 documents (total 68 corpus positions)


["Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix."]


Here, I can adjust how much text the summarizer outputs, via the "ratio" parameter or the "word_count" parameter. Using the "ratio" parameter, I can specify what fraction of sentences in the original text should be returned as output. Below I have specified that we want 50% of the original text (the default is 20%).

In [0]:
print ('Summary:')
print (summarize(text, ratio=0.5))

2018-06-22 12:08:55,162 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-22 12:08:55,163 : INFO : built Dictionary(53 unique tokens: ['thoma', 'anderson', 'live', 'man', 'averag']...) from 6 documents (total 68 corpus positions)


Summary:
By day he is an average computer programmer and by night a hacker known as Neo. Neo has always questioned his reality, but the truth is far beyond his imagination.
Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix.
As a rebel against the machines, Neo must return to the Matrix and confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion.


Using the "word_count" parameter, I can specify the maximum amount of words we want in the summary. Below I have specified that we want no more than 50 words.

In [0]:
print ('Summary:')
print (summarize(text, word_count=50))

2018-06-22 12:10:02,774 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-22 12:10:02,775 : INFO : built Dictionary(53 unique tokens: ['thoma', 'anderson', 'live', 'man', 'averag']...) from 6 documents (total 68 corpus positions)


Summary:
Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix.


This module also supports keyword extraction. Keyword extraction works in the same way as summary generation (i.e. sentence extraction), in that the algorithm tries to find words that are important or seem representative of the entire text. The keywords are not always single words; in the case of multi-word keywords, they are typically all nouns.

In [0]:
from gensim.summarization import keywords

print ('Keywords:')
print (keywords(text))

Keywords:
humanity
human
neo
humans body
super
reality
hacker


Here I will use another example with a larger piece of text. I will be using a synopsis of the movie "The Matrix", which I have taken from the IMDb page, mentioned below.

In the code below, I have read the text file directly from a web-page using "requests". Then I will demonstrate how to produce a summary and some keywords.

In [0]:
import requests

text = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text

print ('Summary:')
print (summarize(text, ratio=0.01))

print ('\nKeywords:')
print (keywords(text, ratio=0.01))

2018-06-22 12:12:29,709 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-22 12:12:29,714 : INFO : built Dictionary(1093 unique tokens: ['cascad', 'code', 'fill', 'give', 'green']...) from 416 documents (total 2985 corpus positions)


Summary:
Anderson, a software engineer for a Metacortex, the other life as Neo, a computer hacker "guilty of virtually every computer crime we have a law for." Agent Smith asks him to help them capture Morpheus, a dangerous terrorist, in exchange for amnesty.
Morpheus explains that he's been searching for Neo his entire life and asks if Neo feels like "Alice in Wonderland, falling down the rabbit hole." He explains to Neo that they exist in the Matrix, a false reality that has been constructed for humans to hide the truth.
Neo is introduced to Morpheus's crew including Trinity; Apoc (Julian Arahanga), a man with long, flowing black hair; Switch; Cypher (bald with a goatee); two brawny brothers, Tank (Marcus Chong) and Dozer (Anthony Ray Parker); and a young, thin man named Mouse (Matt Doran).
Trinity brings the helicopter down to the floor that Morpheus is on and Neo opens fire on the three Agents.

Keywords:
neo
morpheus
trinity
cypher
agents
agent
smith
tank
says
saying


If we know this movie or have seen it before, we see that this summary is actually quite good. 
We also see that some of the most important characters (Neo, Morpheus, Trinity) were extracted 
as keywords as well.

Let me demonstrate another example similar to the one above. 
This time, I will use the IMDb synopsis of the move "The Big Lebowski".

Again, I have downloaded the text and will demonstrate how to produce a summary and some keywords.

In [0]:
import requests

text = requests.get('http://rare-technologies.com/the_big_lebowski_synopsis.txt').text

print ('Summary:')
print (summarize(text, ratio=0.01))

print ('\nKeywords:')
print (keywords(text, ratio=0.01))

2018-06-22 12:14:41,888 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-22 12:14:41,892 : INFO : built Dictionary(1054 unique tokens: ['angel', 'elliott', 'fella', 'hillsid', 'jeffrei']...) from 227 documents (total 2434 corpus positions)


Summary:
Dude agrees to meet with the Big Lebowski, hoping to get compensation for his rug since it "really tied the room together" and figures that his wife, Bunny, shouldn't be owing money around town.
Walter resolves to go to Plan B; he tells Larry to watch out the window as he and Dude go back out to the car where Donny is waiting.

Keywords:
dude
dudes
walter
lebowski
brandt
maude
donny
bunny


Its interesting to note that the keywords above have managed to find some of the main characters in the movie.

Perfomance:Performance has been good for the above datasets as these tests were run on on my system: 
Windows 10 Home @2018 Microsoft Corporation with 
an Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99GHZ 

These tests were also able to run on an Intel Core i5 4210U CPU @ 1.70 GHz x 4 processor. The tests were able able to run on the book called "Honest Abe" written by Alonzo Rothschild as well. 