# Text Summarization with SpaCy


Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).


idea of summarization is to find a subset of data which contains the “information” of the entire set


# Main Idea:
    #Text Preprocessing(remove stopwords,punctuations).
    #Frequency table of words/Word Frequency Distribution - how many times each word appears in the document
    #Score each sentence depending on the words it contains and the frequency table
    #Build summary by joining every sentence above a certain score limit

In [1]:
# Load Pkgs
import spacy

In [2]:
# Text Preprocessing Pkg
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [3]:
# Build a List of Stopwords
stopwords = list(STOP_WORDS)
stopwords


['under',
 'whoever',
 'nevertheless',
 'empty',
 'used',
 'whose',
 'thereafter',
 'first',
 'while',
 'were',
 'some',
 'our',
 'will',
 'front',
 'less',
 'somehow',
 'next',
 'twelve',
 '’ve',
 'someone',
 'enough',
 'you',
 'am',
 'really',
 'during',
 'everyone',
 'about',
 'fifty',
 'whereas',
 'down',
 'is',
 'ever',
 'nothing',
 'however',
 'give',
 'off',
 'regarding',
 'which',
 'its',
 'five',
 'what',
 'namely',
 'as',
 'such',
 'well',
 'with',
 'before',
 'by',
 'anything',
 'here',
 'everything',
 'upon',
 'four',
 'another',
 'one',
 'ten',
 'this',
 'n‘t',
 'how',
 'i',
 'make',
 'hence',
 'must',
 'others',
 'out',
 'onto',
 'seeming',
 'if',
 'was',
 'anywhere',
 'his',
 'they',
 'nobody',
 'below',
 'same',
 'that',
 'serious',
 'whatever',
 'into',
 'noone',
 'cannot',
 'three',
 'name',
 'itself',
 'most',
 'therefore',
 'than',
 'after',
 'per',
 'herein',
 'either',
 '‘re',
 'where',
 'seems',
 'rather',
 'though',
 'became',
 'on',
 'no',
 'had',
 'ca',
 'top'

In [4]:
document1 ="""Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business problems, machine learning is also referred to as predictive analytics."""

In [5]:
document1

'Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business p

In [31]:
len(document1)

1069

In [6]:
document2 = """Our Father who art in heaven, hallowed be thy name. Thy kingdom come. Thy will be done, on earth as it is in heaven. Give us this day our daily bread; and forgive us our trespasses, as we forgive those who trespass against us; and lead us not into temptation, but deliver us from evil
"""

In [7]:
document2

'Our Father who art in heaven, hallowed be thy name. Thy kingdom come. Thy will be done, on earth as it is in heaven. Give us this day our daily bread; and forgive us our trespasses, as we forgive those who trespass against us; and lead us not into temptation, but deliver us from evil\n'

In [32]:
len(document2)

285

In [8]:
nlp = spacy.load('en_core_web_sm')

In [9]:
# Build an NLP Object
docx = nlp(document1)
docx


Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business pr

In [10]:
# Tokenization of Text
mytokens = [token.text for token in docx]
mytokens

['Machine',
 'learning',
 '(',
 'ML',
 ')',
 'is',
 'the',
 'scientific',
 'study',
 'of',
 'algorithms',
 'and',
 'statistical',
 'models',
 'that',
 'computer',
 'systems',
 'use',
 'to',
 'progressively',
 'improve',
 'their',
 'performance',
 'on',
 'a',
 'specific',
 'task',
 '.',
 'Machine',
 'learning',
 'algorithms',
 'build',
 'a',
 'mathematical',
 'model',
 'of',
 'sample',
 'data',
 ',',
 'known',
 'as',
 '"',
 'training',
 'data',
 '"',
 ',',
 'in',
 'order',
 'to',
 'make',
 'predictions',
 'or',
 'decisions',
 'without',
 'being',
 'explicitly',
 'programmed',
 'to',
 'perform',
 'the',
 'task',
 '.',
 'Machine',
 'learning',
 'algorithms',
 'are',
 'used',
 'in',
 'the',
 'applications',
 'of',
 'email',
 'filtering',
 ',',
 'detection',
 'of',
 'network',
 'intruders',
 ',',
 'and',
 'computer',
 'vision',
 ',',
 'where',
 'it',
 'is',
 'infeasible',
 'to',
 'develop',
 'an',
 'algorithm',
 'of',
 'specific',
 'instructions',
 'for',
 'performing',
 'the',
 'task',
 '.

# Word Frequency Table


#dictionary of words and their counts
#How many times each word appears in the document
#Using non-stopwords

In [11]:
# Build Word Frequency
# word.text is tokenization in spacy
word_frequencies = {}
for word in docx:
    if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [12]:
word_frequencies

{'Machine': 4,
 'learning': 8,
 '(': 1,
 'ML': 1,
 ')': 1,
 'scientific': 1,
 'study': 3,
 'algorithms': 3,
 'statistical': 1,
 'models': 1,
 'computer': 2,
 'systems': 1,
 'use': 1,
 'progressively': 1,
 'improve': 1,
 'performance': 1,
 'specific': 2,
 'task': 3,
 '.': 7,
 'build': 1,
 'mathematical': 2,
 'model': 1,
 'sample': 1,
 'data': 3,
 ',': 9,
 'known': 1,
 '"': 2,
 'training': 1,
 'order': 1,
 'predictions': 2,
 'decisions': 1,
 'explicitly': 1,
 'programmed': 1,
 'perform': 1,
 'applications': 1,
 'email': 1,
 'filtering': 1,
 'detection': 1,
 'network': 1,
 'intruders': 1,
 'vision': 1,
 'infeasible': 1,
 'develop': 1,
 'algorithm': 1,
 'instructions': 1,
 'performing': 1,
 'closely': 1,
 'related': 1,
 'computational': 1,
 'statistics': 1,
 'focuses': 2,
 'making': 1,
 'computers': 1,
 'The': 1,
 'optimization': 1,
 'delivers': 1,
 'methods': 1,
 'theory': 1,
 'application': 2,
 'domains': 1,
 'field': 2,
 'machine': 3,
 'Data': 1,
 'mining': 1,
 'exploratory': 1,
 'analy

# Maximum Word Frequency


#find the weighted frequency
#Each word over most occurring word
#Long sentence over short sentenc

In [13]:
# Maximum Word Frequency
maximum_frequency = max(word_frequencies.values())
maximum_frequency

9

In [14]:
for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

# Word Frequency Distribution¶


In [15]:
# Frequency Table
word_frequencies

{'Machine': 0.4444444444444444,
 'learning': 0.8888888888888888,
 '(': 0.1111111111111111,
 'ML': 0.1111111111111111,
 ')': 0.1111111111111111,
 'scientific': 0.1111111111111111,
 'study': 0.3333333333333333,
 'algorithms': 0.3333333333333333,
 'statistical': 0.1111111111111111,
 'models': 0.1111111111111111,
 'computer': 0.2222222222222222,
 'systems': 0.1111111111111111,
 'use': 0.1111111111111111,
 'progressively': 0.1111111111111111,
 'improve': 0.1111111111111111,
 'performance': 0.1111111111111111,
 'specific': 0.2222222222222222,
 'task': 0.3333333333333333,
 '.': 0.7777777777777778,
 'build': 0.1111111111111111,
 'mathematical': 0.2222222222222222,
 'model': 0.1111111111111111,
 'sample': 0.1111111111111111,
 'data': 0.3333333333333333,
 ',': 1.0,
 'known': 0.1111111111111111,
 '"': 0.2222222222222222,
 'training': 0.1111111111111111,
 'order': 0.1111111111111111,
 'predictions': 0.2222222222222222,
 'decisions': 0.1111111111111111,
 'explicitly': 0.1111111111111111,
 'programm

# Sentence Score and Ranking of Words in Each Sentence


#Sentence Tokens
#scoring every sentence based on number of words
#non stopwords in our word frequency table

In [16]:
# Sentence Tokens
sentence_list = [ sentence for sentence in docx.sents ]
sentence_list

[Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.,
 Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.,
 Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.,
 In its application acro

In [17]:
# Example of Sentence Tokenization,Word Tokenization and Lowering All Text
# for t in sentence_list:
#     for w in t:
#         print(w.text.lower())
[w.text.lower() for t in sentence_list for w in t ]

['machine',
 'learning',
 '(',
 'ml',
 ')',
 'is',
 'the',
 'scientific',
 'study',
 'of',
 'algorithms',
 'and',
 'statistical',
 'models',
 'that',
 'computer',
 'systems',
 'use',
 'to',
 'progressively',
 'improve',
 'their',
 'performance',
 'on',
 'a',
 'specific',
 'task',
 '.',
 'machine',
 'learning',
 'algorithms',
 'build',
 'a',
 'mathematical',
 'model',
 'of',
 'sample',
 'data',
 ',',
 'known',
 'as',
 '"',
 'training',
 'data',
 '"',
 ',',
 'in',
 'order',
 'to',
 'make',
 'predictions',
 'or',
 'decisions',
 'without',
 'being',
 'explicitly',
 'programmed',
 'to',
 'perform',
 'the',
 'task',
 '.',
 'machine',
 'learning',
 'algorithms',
 'are',
 'used',
 'in',
 'the',
 'applications',
 'of',
 'email',
 'filtering',
 ',',
 'detection',
 'of',
 'network',
 'intruders',
 ',',
 'and',
 'computer',
 'vision',
 ',',
 'where',
 'it',
 'is',
 'infeasible',
 'to',
 'develop',
 'an',
 'algorithm',
 'of',
 'specific',
 'instructions',
 'for',
 'performing',
 'the',
 'task',
 '.

In [18]:
# Sentence Score via comparrng each word with sentence
sentence_scores = {}  
for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

In [19]:
sentence_scores

{Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.: 4.555555555555556,
 Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.: 7.333333333333331,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.: 4.111111111111112,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 4.555555555555556,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.: 5.777777777777778,
 In its application across business problems, machine learning is also referred to as predictive analytics.: 3.7777777777777777}

# Finding Top N Sentence with largest score¶

In [20]:
from heapq import nlargest

In [21]:
summarized_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)



In [22]:
len(summarized_sentences)

6

In [23]:
# Convert Sentences from Spacy Span to Strings for joining entire sentence
for w in summarized_sentences:
    print(w.text)

Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.
Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Machine learning is closely related to computational statistics, which focuses on making predictions using computers.
In its application across business problems, machine learning is also referred to as predictive analytics.


In [24]:
# List Comprehension of Sentences Converted From Spacy.span to strings
final_sentences = [ w.text for w in summarized_sentences ]
final_sentences

['Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.',
 'Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.',
 'Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.',
 'The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.',
 'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.',
 'In its application across business problems, machine learning is also referred to as predictive analytics.']

In [25]:
summary = ' '.join(final_sentences)

In [26]:
summary

'Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. In its application across business problems, machine learning is also referred to as predictive analytics.'

In [27]:
len(summary)

843

In [28]:
# Place All As A Function For Reuseability
def text_summarizer(raw_docx):
    raw_text = raw_docx
    docx = nlp(raw_text)
    stopwords = list(STOP_WORDS)
    # Build Word Frequency
# word.text is tokenization in spacy
    word_frequencies = {}  
    for word in docx:  
        if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1


    maximum_frequncy = max(word_frequencies.values())

    for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
    # Sentence Tokens
    sentence_list = [ sentence for sentence in docx.sents ]

    # Calculate Sentence Score and Ranking
    sentence_scores = {}  
    for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

    # Find N Largest
    summary_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)
    final_sentences = [ w.text for w in summary_sentences ]
    summary = ' '.join(final_sentences)
    print("Original Document\n")
    print(raw_docx)
    print("Total Length:",len(raw_docx))
    print('\n\nSummarized Document\n')
    print(summary)
    print("Total Length:",len(summary))


In [29]:
text_summarizer(document2)

Original Document

Our Father who art in heaven, hallowed be thy name. Thy kingdom come. Thy will be done, on earth as it is in heaven. Give us this day our daily bread; and forgive us our trespasses, as we forgive those who trespass against us; and lead us not into temptation, but deliver us from evil

Total Length: 285


Summarized Document

Our Father who art in heaven, hallowed be thy name. Thy will be done, on earth as it is in heaven. Thy kingdom come.
Total Length: 116


In [30]:
text_summarizer(document1)

Original Document

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application