Introduction:
Text Summarization using Spacy

Spacy is an Open-Source Library used for Natural Language Processing. Here, we use Spacy to perform Extractive Text Summarization.

In [44]:
# Importing Depenendencies:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

In [45]:
# Opening a File to Process:
# text_file = open(r"cnn_stories_tokenized/000c835555db62e319854d9f8912061cdca1893e.story")

# with open(r"cnn_stories_tokenized/000c835555db62e319854d9f8912061cdca1893e.story") as text_file:
#     text = text_file.readlines()
# print(text[0])

Here, we are loading up the cnn_stories_tokenized dataset.
All the text in the .story files is being stored in a String "text".

In [46]:
# Loading up the Files:
text = ""

import os, glob
path = 'D:\Coding\Python\Projects\Text Summarization\Dataset\TestData'
for filename in glob.glob(os.path.join(path, '*.story')):
   with open(os.path.join(os.getcwd(), filename), 'r') as text_file: # open in readonly mode
      text += text_file.readlines()[0]
      # text.join(text_file)
      text += "\n\n"

print("Text: ", text)

Text:  -LRB- CNN -RRB- For the second time during his papacy , Pope Francis has announced a new group of bishops and archbishops set to become cardinals -- and they come from all over the world . Pope Francis said Sunday that he would hold a meeting of cardinals on February 14 `` during which I will name 15 new Cardinals who , coming from 13 countries from every continent , manifest the indissoluble links between the Church of Rome and the particular Churches present in the world , '' according to Vatican Radio .New cardinals are always important because they set the tone in the church and also elect the next pope , CNN Senior Vatican Analyst John L. Allen said . They are sometimes referred to as the princes of the Catholic Church . The new cardinals come from countries such as Ethiopia , New Zealand and Myanmar . `` This is a pope who very much wants to reach out to people on the margins , and you clearly see that in this set , '' Allen said . `` You 're talking about cardinals from t

The summarizer function. This function takes in a raw form of all the documents, i.e., the "text" string and returns the Summary, doc, length of the Initial Text and finally the Length of the Summarized Text.

This function is created with all the code below.

In [47]:
def summarizer(rawdoc):
    # Defining Stop-Words:
    stopwords = list(STOP_WORDS)
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(rawdoc)

    # Tokenization:
    tokens = [token.text for token in doc]
    
    # Word Frequency:
    word_freq = {}

    for word in doc:
        # If token is NOT a stopword and is NOT a punctuation:
        if word.text.lower() not in stopwords and word.text.lower() not in punctuation:
            # If the word is NOT already present in the "word_freq" Dictionary, add it:
            if word.text not in word_freq.keys():
                word_freq[word.text] = 1
            # If word is present in the "word_freq" Dictionary, increase its frequency by 1:
            else:
                word_freq[word.text] += 1

    # Maximum Frequency:
    max_freq = max(word_freq.values())
    
    # Normalization of Frequencies:
    for word in word_freq.keys():
        word_freq[word] = word_freq[word] / max_freq

    # Creating tokens of Sentences:
    sent_tokens = [sent for sent in doc.sents]

    # Calculating Sentence Score, which is the sum of Normalized Frequencies of
    # all the tokens occuring in the sentence:

    sent_scores = {}
    for sent in sent_tokens:
        for word in sent:
            if word.text in word_freq.keys():
                if sent not in sent_scores.keys():
                    sent_scores[sent] = word_freq[word.text]
                else:
                    sent_scores[sent] += word_freq[word.text]


    # Defining the Length of the Summary to be Created:

    summaryLength = int(len(sent_tokens) * 0.3)
    summary = nlargest(summaryLength, sent_scores, key=sent_scores.get)

    final_summary = [word.text for word in summary]
    summary = ' '.join(final_summary)

    return summary, doc, len(rawdoc.split(' ')), len(summary.split(' '))


These are the individual Snippets of Code used to create the Summarizer function, for better explanation:

Here, we are importing Stop-Words as a list from spacy:

In [49]:
# Defining Stop-Words:
stopwords = list(STOP_WORDS)
print(stopwords)

['then', 'wherever', 'too', 'otherwise', 'however', 'seem', 'therein', 'were', '’ve', 'whence', 'herein', 'at', 'with', 'twelve', 'least', 'unless', 'just', 'had', 'six', 'for', 'be', 'take', "'re", 'empty', 'thru', 'amongst', 'fifty', 'few', '’ll', '‘d', 'back', 'our', 'afterwards', 'less', 'here', 'beforehand', 'ever', 'how', 'per', 'hers', 'everywhere', 'other', 'have', 'front', 'is', 'me', 'eleven', 'its', 'am', 'up', 'seems', 'already', 'why', 'very', 'himself', '‘s', 'often', 'side', 'the', 'whereupon', 'might', 'made', 'each', "n't", 'themselves', 'whom', 'within', 'after', 'doing', 'put', '’d', 'ours', 'are', 'her', 'becomes', 'bottom', 'myself', 'mine', 'still', '‘m', 'move', 'all', 'so', 'under', 'none', 'really', 'who', 'whither', 'they', 'anyway', 'everyone', 'hereafter', 'somewhere', 'rather', "'s", 'on', 'these', 'neither', 'ten', 'ourselves', 'yet', 'beyond', 'any', 'also', 'onto', 'will', 'can', 'nor', 'somehow', 'that', 'nothing', 'next', 'whereafter', 'call', 'else', 

In [50]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print(doc)

-LRB- CNN -RRB- For the second time during his papacy , Pope Francis has announced a new group of bishops and archbishops set to become cardinals -- and they come from all over the world . Pope Francis said Sunday that he would hold a meeting of cardinals on February 14 `` during which I will name 15 new Cardinals who , coming from 13 countries from every continent , manifest the indissoluble links between the Church of Rome and the particular Churches present in the world , '' according to Vatican Radio .New cardinals are always important because they set the tone in the church and also elect the next pope , CNN Senior Vatican Analyst John L. Allen said . They are sometimes referred to as the princes of the Catholic Church . The new cardinals come from countries such as Ethiopia , New Zealand and Myanmar . `` This is a pope who very much wants to reach out to people on the margins , and you clearly see that in this set , '' Allen said . `` You 're talking about cardinals from typicaly

Performing Tokenization on the String:

In [51]:
# Tokenization:
tokens = [token.text for token in doc]
print(tokens)



Calculating the Word Frequency:

Here, we create a dictionary of all the tokens occuring in the text, which stores the tokens as keys and corresponding frequencies as values.


In [52]:
# Word Frequency:
word_freq = {}

for word in doc:
    # If token is NOT a stopword and is NOT a punctuation:
    if word.text.lower() not in stopwords and word.text.lower() not in punctuation:
        # If the word is NOT already present in the "word_freq" Dictionary, add it:
        if word.text not in word_freq.keys():
            word_freq[word.text] = 1
        # If word is present in the "word_freq" Dictionary, increase its frequency by 1:
        else:
            word_freq[word.text] += 1

print(word_freq)



Getting the token with Maximum Frequency:

In [53]:
# Maximum Frequency:
max_freq = max(word_freq.values())
print(max_freq)

100


Normalizing the Frequencies between range: 0 - 1

In [54]:
# Normalization of Frequencies:
for word in word_freq.keys():
    word_freq[word] = word_freq[word] / max_freq

print(word_freq)



Here, we perform Tokenization on Sentences as a Whole:

In [55]:
# Creating tokens of Sentences:
sent_tokens = [sent for sent in doc.sents]
print(sent_tokens)

[-LRB- CNN -RRB-, For the second time during his papacy , Pope Francis has announced a new group of bishops and archbishops set to become cardinals -- and they come from all over the world ., Pope Francis said Sunday that he would hold a meeting of cardinals on February 14 `` during which I will name 15 new Cardinals who , coming from 13 countries from every continent , manifest the indissoluble links between the Church of Rome and the particular Churches present in the world , '' according to Vatican Radio .New cardinals are always important because they set the tone in the church and also elect the next pope , CNN Senior Vatican Analyst John L. Allen said ., They are sometimes referred to as the princes of the Catholic Church ., The new cardinals come from countries such as Ethiopia , New Zealand and Myanmar ., `` This is a pope who very much wants to reach out to people on the margins , and you clearly see that in this set , '' Allen said ., `` You 're talking about cardinals from t

Here, we are calculating the Sentence Score.
The sentence score is defined as the Sum of the Normalized frequencies of all the Tokens appearing in the given Sentence:

In [56]:
# Calculating Sentence Score, which is the sum of Normalized Frequencies of
# all the tokens occuring in the sentence:

sent_scores = {}
for sent in sent_tokens:
        for word in sent:
                if word.text in word_freq.keys():
                        if sent not in sent_scores.keys():
                                sent_scores[sent] = word_freq[word.text]
                        else:
                                sent_scores[sent] += word_freq[word.text]

print(sent_scores)

{-LRB- CNN -RRB-: 2.66, For the second time during his papacy , Pope Francis has announced a new group of bishops and archbishops set to become cardinals -- and they come from all over the world .: 1.29, Pope Francis said Sunday that he would hold a meeting of cardinals on February 14 `` during which I will name 15 new Cardinals who , coming from 13 countries from every continent , manifest the indissoluble links between the Church of Rome and the particular Churches present in the world , '' according to Vatican Radio .New cardinals are always important because they set the tone in the church and also elect the next pope , CNN Senior Vatican Analyst John L. Allen said .: 2.26, They are sometimes referred to as the princes of the Catholic Church .: 0.05, The new cardinals come from countries such as Ethiopia , New Zealand and Myanmar .: 0.29000000000000004, `` This is a pope who very much wants to reach out to people on the margins , and you clearly see that in this set , '' Allen said

Summarization:

In [57]:
# Defining the Length of the Summary to be Created:

summaryLength = int(len(sent_tokens) * 0.5)
print(summaryLength)

107


In [58]:
summary = nlargest(summaryLength, sent_scores, key=sent_scores.get)
print(summary)

[''


-LRB- CNN -RRB- -- Bayern Munich 's record winning start to the Bundesliga season came to an abrupt end Sunday as they were stunned 2-1 at home by Bayer Leverkusen .


, ATLANTA , Georgia -LRB- CNN -RRB- -- Dressed head to toe in black , designer Isaac Mizrahi is wearing an outfit that seems to contradict his personality -- and his usual fashion flair .


, -LRB- CNN -RRB- -- San Francisco 's new sheriff is facing misdemeanor charges over an alleged domestic abuse incident on New Year 's Eve , authorities said .


, -LRB- CNN -RRB- -- Affectionately known in his home city of Madrid as `` the wise man of Hortaleza , '' Luis Aragones left the legacy of helping Spain 's ascension to the top of world football .


, West Virginia -LRB- CNN -RRB- -- North Central West Virginia Airport boasts quick check-ins , free , accessible parking and a convenient baggage claim .


, -LRB- CNN -RRB- -- Two of Turkey 's main political parties are pushing for a constitutional amendment to lift bans o

In [59]:
final_summary = [word.text for word in summary]
summary = ' '.join(final_summary)
# print(summary)

Calling the Summarizer function to get the final Summary of the given Text:

In [62]:
# Calling the Function:
summary_fin, doc, ogLength, summLength = summarizer(text)

print(summary_fin)

''


-LRB- CNN -RRB- -- Bayern Munich 's record winning start to the Bundesliga season came to an abrupt end Sunday as they were stunned 2-1 at home by Bayer Leverkusen .


 ATLANTA , Georgia -LRB- CNN -RRB- -- Dressed head to toe in black , designer Isaac Mizrahi is wearing an outfit that seems to contradict his personality -- and his usual fashion flair .


 -LRB- CNN -RRB- -- San Francisco 's new sheriff is facing misdemeanor charges over an alleged domestic abuse incident on New Year 's Eve , authorities said .


 -LRB- CNN -RRB- -- Affectionately known in his home city of Madrid as `` the wise man of Hortaleza , '' Luis Aragones left the legacy of helping Spain 's ascension to the top of world football .


 West Virginia -LRB- CNN -RRB- -- North Central West Virginia Airport boasts quick check-ins , free , accessible parking and a convenient baggage claim .


 -LRB- CNN -RRB- -- Two of Turkey 's main political parties are pushing for a constitutional amendment to lift bans on head

In [67]:
print("Length of Original Text: ", len(text.split(' ')))
print("Length of Summary Text: ", len(summary.split(' ')))

Length of Original Text:  3335
Length of Summary Text:  603


ROUGE Metric:

In [65]:
%pip install rouge

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [66]:
from rouge import Rouge

In [70]:
rouge = Rouge()
rouge.get_scores(summary, text)

[{'rouge-1': {'r': 0.1628232005590496, 'p': 1.0, 'f': 0.28004807451473085},
  'rouge-2': {'r': 0.11709770114942529,
   'p': 0.9367816091954023,
   'f': 0.208173688957003},
  'rouge-l': {'r': 0.16212438853948288,
   'p': 0.9957081545064378,
   'f': 0.2788461514378078}}]