NLP Text Summarization

In [1]:
!pip install -U spacy

!python -m spacy download en_core_web_sm

Collecting spacy
  Downloading spacy-2.3.5-cp37-cp37m-manylinux2014_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 4.1 MB/s eta 0:00:01    |█████████████████▋              | 5.7 MB 4.1 MB/s eta 0:00:02
[?25hCollecting numpy>=1.15.0
  Downloading numpy-1.19.5-cp37-cp37m-manylinux2010_x86_64.whl (14.8 MB)
[K     |████████████████████████████████| 14.8 MB 43.9 MB/s eta 0:00:01
[?25hCollecting plac<1.2.0,>=0.9.6
  Downloading plac-1.1.3-py2.py3-none-any.whl (20 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp37-cp37m-manylinux2014_x86_64.whl (35 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.5-cp37-cp37m-manylinux2014_x86_64.whl (126 kB)
[K     |████████████████████████████████| 126 kB 32.9 MB/s eta 0:00:01
[?25hCollecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.4-cp37-cp37m-manylinux2014_x86_64.whl (9.8 MB)
[K     |████████████████████████████████| 9.8 MB 42.1 MB/s eta 0:00:01
Collecting thinc<7.5.0,>=7.4.1
  Downloading thin

In [15]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [16]:
stopWords = list(STOP_WORDS)

In [17]:
introductionText = """
The capacity of language is what differentiates humans from other species. Humans learned how to speak around 100000 years ago, and 3000 years after humans learned how to write. Furthermore, this ability to communicate and share information is what makes humans successful. However, human language is one of the most complex and diverse parts of our species; there are 6500 languages spoken in the world. One of the means of communication that we use is communication online. According to industry estimates, only 20% of our data acquired through our messages and online activities has a structured form. The rest of our data is in an unstructured textual form. The Web consists of over a trillion pages of information, and as mentioned, is mostly in natural language. For knowledge acquisition to be possible, an agent needs to partially be able to interpret, the ambiguous natural language used. To be able to do so, it is important to understand the techniques of text analysis and natural language processing. Furthermore, text analysis/ text mining is the process of deriving significant information from natural language text; whereas, natural language processing is part of computer science and artificial intelligence that deals with human languages. This paper will cover natural language processing and some algorithms used such as Tokenization, Stemming, and Lemmatizing."""

In [18]:
nlp = spacy.load('en_core_web_sm')

In [19]:
doc = nlp(introductionText)

In [20]:
tokens = [token.text for token in doc]
print(tokens)

['\n', 'The', 'capacity', 'of', 'language', 'is', 'what', 'differentiates', 'humans', 'from', 'other', 'species', '.', 'Humans', 'learned', 'how', 'to', 'speak', 'around', '100000', 'years', 'ago', ',', 'and', '3000', 'years', 'after', 'humans', 'learned', 'how', 'to', 'write', '.', 'Furthermore', ',', 'this', 'ability', 'to', 'communicate', 'and', 'share', 'information', 'is', 'what', 'makes', 'humans', 'successful', '.', 'However', ',', 'human', 'language', 'is', 'one', 'of', 'the', 'most', 'complex', 'and', 'diverse', 'parts', 'of', 'our', 'species', ';', 'there', 'are', '6500', 'languages', 'spoken', 'in', 'the', 'world', '.', 'One', 'of', 'the', 'means', 'of', 'communication', 'that', 'we', 'use', 'is', 'communication', 'online', '.', 'According', 'to', 'industry', 'estimates', ',', 'only', '20', '%', 'of', 'our', 'data', 'acquired', 'through', 'our', 'messages', 'and', 'online', 'activities', 'has', 'a', 'structured', 'form', '.', 'The', 'rest', 'of', 'our', 'data', 'is', 'in', '

In [21]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [22]:
wordFrequencies = {}
for word in doc:
    if word.text.lower() not in stopWords:
        if word.text.lower() not in punctuation:
            if word.text not in wordFrequencies.keys():
                wordFrequencies[word.text] = 1
            else:
                wordFrequencies[word.text] += 1

In [23]:
print(wordFrequencies)

{'capacity': 1, 'language': 8, 'differentiates': 1, 'humans': 3, 'species': 2, 'Humans': 1, 'learned': 2, 'speak': 1, '100000': 1, 'years': 2, 'ago': 1, '3000': 1, 'write': 1, 'Furthermore': 2, 'ability': 1, 'communicate': 1, 'share': 1, 'information': 3, 'makes': 1, 'successful': 1, 'human': 2, 'complex': 1, 'diverse': 1, 'parts': 1, '6500': 1, 'languages': 2, 'spoken': 1, 'world': 1, 'means': 1, 'communication': 2, 'use': 1, 'online': 2, 'According': 1, 'industry': 1, 'estimates': 1, '20': 1, 'data': 2, 'acquired': 1, 'messages': 1, 'activities': 1, 'structured': 1, 'form': 2, 'rest': 1, 'unstructured': 1, 'textual': 1, 'Web': 1, 'consists': 1, 'trillion': 1, 'pages': 1, 'mentioned': 1, 'natural': 6, 'knowledge': 1, 'acquisition': 1, 'possible': 1, 'agent': 1, 'needs': 1, 'partially': 1, 'able': 2, 'interpret': 1, 'ambiguous': 1, 'important': 1, 'understand': 1, 'techniques': 1, 'text': 4, 'analysis': 1, 'processing': 3, 'analysis/': 1, 'mining': 1, 'process': 1, 'deriving': 1, 'sign

In [24]:
maxFrequency = max(wordFrequencies.values())
maxFrequency

8

In [25]:
for word in wordFrequencies.keys():
    wordFrequencies[word] = wordFrequencies[word]/maxFrequency

print(wordFrequencies)

{'capacity': 0.125, 'language': 1.0, 'differentiates': 0.125, 'humans': 0.375, 'species': 0.25, 'Humans': 0.125, 'learned': 0.25, 'speak': 0.125, '100000': 0.125, 'years': 0.25, 'ago': 0.125, '3000': 0.125, 'write': 0.125, 'Furthermore': 0.25, 'ability': 0.125, 'communicate': 0.125, 'share': 0.125, 'information': 0.375, 'makes': 0.125, 'successful': 0.125, 'human': 0.25, 'complex': 0.125, 'diverse': 0.125, 'parts': 0.125, '6500': 0.125, 'languages': 0.25, 'spoken': 0.125, 'world': 0.125, 'means': 0.125, 'communication': 0.25, 'use': 0.125, 'online': 0.25, 'According': 0.125, 'industry': 0.125, 'estimates': 0.125, '20': 0.125, 'data': 0.25, 'acquired': 0.125, 'messages': 0.125, 'activities': 0.125, 'structured': 0.125, 'form': 0.25, 'rest': 0.125, 'unstructured': 0.125, 'textual': 0.125, 'Web': 0.125, 'consists': 0.125, 'trillion': 0.125, 'pages': 0.125, 'mentioned': 0.125, 'natural': 0.75, 'knowledge': 0.125, 'acquisition': 0.125, 'possible': 0.125, 'agent': 0.125, 'needs': 0.125, 'par

In [26]:
sentenceTokens = [sentence for sentence in doc.sents]
print(sentenceTokens)

[
The capacity of language is what differentiates humans from other species., Humans learned how to speak around 100000 years ago, and 3000 years after humans learned how to write., Furthermore, this ability to communicate and share information is what makes humans successful., However, human language is one of the most complex and diverse parts of our species; there are 6500 languages spoken in the world., One of the means of communication that we use is communication online., According to industry estimates, only 20% of our data acquired through our messages and online activities has a structured form., The rest of our data is in an unstructured textual form., The Web consists of over a trillion pages of information, and as mentioned, is mostly in natural language., For knowledge acquisition to be possible, an agent needs to partially be able to interpret, the ambiguous natural language used., To be able to do so, it is important to understand the techniques of text analysis and natu

In [27]:
sentenceScore={}
for sentence in sentenceTokens:
    for word in sentence:
        if word.text.lower() in wordFrequencies.keys():
            if sentence not in sentenceScore.keys():
                sentenceScore[sentence] = wordFrequencies[word.text.lower()]
            else:
                sentenceScore[sentence] += wordFrequencies[word.text.lower()]
                
sentenceScore

{
 The capacity of language is what differentiates humans from other species.: 1.875,
 Humans learned how to speak around 100000 years ago, and 3000 years after humans learned how to write.: 2.375,
 Furthermore, this ability to communicate and share information is what makes humans successful.: 1.375,
 However, human language is one of the most complex and diverse parts of our species; there are 6500 languages spoken in the world.: 2.5,
 One of the means of communication that we use is communication online.: 1.0,
 According to industry estimates, only 20% of our data acquired through our messages and online activities has a structured form.: 1.625,
 The rest of our data is in an unstructured textual form.: 0.875,
 The Web consists of over a trillion pages of information, and as mentioned, is mostly in natural language.: 2.625,
 For knowledge acquisition to be possible, an agent needs to partially be able to interpret, the ambiguous natural language used.: 3.0,
 To be able to do so, it 

In [28]:
from heapq import nlargest

In [29]:
sentenceLength = int(len(sentenceTokens)*0.5)
sentenceLength

6

In [30]:
summary = nlargest(sentenceLength, sentenceScore, key = sentenceScore.get)
summary


[Furthermore, text analysis/ text mining is the process of deriving significant information from natural language text; whereas, natural language processing is part of computer science and artificial intelligence that deals with human languages.,
 To be able to do so, it is important to understand the techniques of text analysis and natural language processing.,
 For knowledge acquisition to be possible, an agent needs to partially be able to interpret, the ambiguous natural language used.,
 The Web consists of over a trillion pages of information, and as mentioned, is mostly in natural language.,
 However, human language is one of the most complex and diverse parts of our species; there are 6500 languages spoken in the world.,
 This paper will cover natural language processing and some algorithms used such as Tokenization, Stemming, and Lemmatizing.]

In [31]:
final_summary = [word.text for word in summary]
summary = ' '.join(final_summary)

In [32]:
summary

'Furthermore, text analysis/ text mining is the process of deriving significant information from natural language text; whereas, natural language processing is part of computer science and artificial intelligence that deals with human languages. To be able to do so, it is important to understand the techniques of text analysis and natural language processing. For knowledge acquisition to be possible, an agent needs to partially be able to interpret, the ambiguous natural language used. The Web consists of over a trillion pages of information, and as mentioned, is mostly in natural language. However, human language is one of the most complex and diverse parts of our species; there are 6500 languages spoken in the world. This paper will cover natural language processing and some algorithms used such as Tokenization, Stemming, and Lemmatizing.'