Import Dependencies

In [None]:
# Load Pkgs
import spacy 

In [None]:
# Text Preprocessing Pkg
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [None]:
# Build a List of Stopwords
stopwords = list(STOP_WORDS)

In [None]:
stopwords

In [None]:
len(stopwords)

326

In [None]:
document1 = """Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as \"training data\", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business problems, machine learning is also referred to as predictive analytics."""

In [None]:
nlp = spacy.load('en')

In [None]:
# Build an NLP Object
docx = nlp(document1)

In [None]:
# Tokenization of Text
mytokens = [token.text for token in docx]

In [None]:
mytokens

Word Frequency Table

In [None]:
# Build Word Frequency
# word.text is tokenization in spacy
word_frequencies = {}
for word in docx:
    if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [None]:
word_frequencies

Maximum Word Frequency

In [None]:
# Maximum Word Frequency
maximum_frequency = max(word_frequencies.values())

In [None]:
maximum_frequency

9

In [None]:
for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

Word Frequency Distribution

In [None]:
# Frequency Table
word_frequencies

Sentence Score and Ranking of Words in Each Sentence

In [None]:
# Sentence Tokens
sentence_list = [ sentence for sentence in docx.sents ]

In [None]:
# Example of Sentence Tokenization,Word Tokenization and Lowering All Text
# for t in sentence_list:
#     for w in t:
#         print(w.text.lower())
[w.text.lower() for t in sentence_list for w in t ]

In [None]:
# Sentence Score via comparrng each word with sentence
sentence_scores = {}  
for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

Get Sentence Score

In [None]:
# Sentence Score Table
sentence_scores

{Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.: 4.555555555555556,
 Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.: 7.333333333333331,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.: 4.111111111111112,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 4.555555555555556,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.: 5.777777777777778,
 In its application across business problems, machine learning is also referred to as predictive analytics.: 3.7777777777777777}

Finding Top N Sentence with largest score

In [None]:
# Import Heapq 
from heapq import nlargest

In [None]:
summarized_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)

In [None]:
summarized_sentences

[Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.,
 Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.,
 In its application across business problems, machine learning is also referred to as predictive analytics.]

In [None]:
# Convert Sentences from Spacy Span to Strings for joining entire sentence
for w in summarized_sentences:
    print(w.text)

Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.
Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Machine learning is closely related to computational statistics, which focuses on making predictions using computers.
In its application across business problems, machine learning is also referred to as predictive analytics.


In [None]:
# List Comprehension of Sentences Converted From Spacy.span to strings
final_sentences = [ w.text for w in summarized_sentences ]

Join sentences

In [None]:
summary = ' '.join(final_sentences)

In [None]:
summary

'Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. In its application across business problems, machine learning is also referred to as predictive analytics.'

In [None]:
# Length of Original Text
len(document1)

1069

In [None]:
# Length of Summary
len(summary)

843

In [None]:
# Place All As A Function For Reuseability
def text_summarizer(raw_docx):
    raw_text = raw_docx
    docx = nlp(raw_text)
    stopwords = list(STOP_WORDS)
    # Build Word Frequency
# word.text is tokenization in spacy
    word_frequencies = {}  
    for word in docx:  
        if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1


    maximum_frequncy = max(word_frequencies.values())

    for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
    # Sentence Tokens
    sentence_list = [ sentence for sentence in docx.sents ]

    # Calculate Sentence Score and Ranking
    sentence_scores = {}  
    for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

    # Find N Largest
    summary_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)
    final_sentences = [ w.text for w in summary_sentences ]
    summary = ' '.join(final_sentences)
    print("Original Document\n")
    print(raw_docx)
    print("Total Length:",len(raw_docx))
    print('\n\nSummarized Document\n')
    print(summary)
    print("Total Length:",len(summary))

In [None]:
document2 = """ Harry Potter is the most miserable, lonely boy you can imagine. He’s shunned by his relatives, the Dursleys, who have raised him since he was an infant. He’s forced to live in the cupboard under the stairs, forced to wear his cousin Dudley’s hand-me-down clothes, and forced to go to his neighbour’s house when the rest of the family is doing something fun. Yes, he’s just about as miserable as you can get.

Harry’s world gets turned upside down on his 11th birthday, however. A magical half-giant, Hagrid, informs Harry that he’s really a wizard, and will soon be attending Hogwarts School of Witchcraft and Wizardry. Harry also learns that, in the wizarding world, he’s a hero. When he was an infant, the evil Lord Voldemort killed his parents and then tried to kill Harry too. What’s so amazing to everyone is that Harry survived, and allegedly destroyed Voldemort in the process.

When Harry hears all this, he doesn’t know what to think. However, everything Hagrid tells him turns out to be true, and with a joyful heart Harry starts wizarding school in September. He quickly becomes best friends with a boy named Ron Weasley, and before they even make it to Christmas, they break tons of school rules when they attack a troll and prevent it from killing fellow student Hermione Granger. After the troll incident, the three become inseparable, and Harry is amazed to have found such great friends. He is constantly busy trying to stay on top of the mounds of homework, as well as participating in weekly Quidditch practices. Quidditch is a popular sport among wizards and Harry is the youngest Quidditch player in over a century. It's also a game Harry loves more than anything else at school."""


In [None]:
text_summarizer(document2)

Original Document

 Harry Potter is the most miserable, lonely boy you can imagine. He’s shunned by his relatives, the Dursleys, who have raised him since he was an infant. He’s forced to live in the cupboard under the stairs, forced to wear his cousin Dudley’s hand-me-down clothes, and forced to go to his neighbour’s house when the rest of the family is doing something fun. Yes, he’s just about as miserable as you can get.

Harry’s world gets turned upside down on his 11th birthday, however. A magical half-giant, Hagrid, informs Harry that he’s really a wizard, and will soon be attending Hogwarts School of Witchcraft and Wizardry. Harry also learns that, in the wizarding world, he’s a hero. When he was an infant, the evil Lord Voldemort killed his parents and then tried to kill Harry too. What’s so amazing to everyone is that Harry survived, and allegedly destroyed Voldemort in the process.

When Harry hears all this, he doesn’t know what to think. However, everything Hagrid tells him

Calculating the Reading Time of A Text

In [None]:
document1

'Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business p

In [None]:
# Get Total Word Counts with Tokenization
docx1 = nlp(document1)

In [None]:
# Tokens
mytokens = [ token.text for token in docx1 ]

In [None]:
# Total Number or Length of Words
len(mytokens)

173

In [None]:
# Tokens
mytokens = [ token.text for token in docx1 ]

In [None]:
# Reading Time
def readingTime(docs):
    total_words_tokens =  [ token.text for token in nlp(docs)]
    estimatedtime  = len(total_words_tokens)/200
    return '{} mins'.format(round(estimatedtime))

In [None]:
readingTime(document1)

'1 mins'

In [None]:
document2

" Harry Potter is the most miserable, lonely boy you can imagine. He’s shunned by his relatives, the Dursleys, who have raised him since he was an infant. He’s forced to live in the cupboard under the stairs, forced to wear his cousin Dudley’s hand-me-down clothes, and forced to go to his neighbour’s house when the rest of the family is doing something fun. Yes, he’s just about as miserable as you can get.\n\nHarry’s world gets turned upside down on his 11th birthday, however. A magical half-giant, Hagrid, informs Harry that he’s really a wizard, and will soon be attending Hogwarts School of Witchcraft and Wizardry. Harry also learns that, in the wizarding world, he’s a hero. When he was an infant, the evil Lord Voldemort killed his parents and then tried to kill Harry too. What’s so amazing to everyone is that Harry survived, and allegedly destroyed Voldemort in the process.\n\nWhen Harry hears all this, he doesn’t know what to think. However, everything Hagrid tells him turns out to 

In [None]:
# Get Total Word Counts with Tokenization
docx2 = nlp(document2)

In [None]:
# Tokens
mytokens2 = [ token.text for token in docx2 ]

In [None]:
# Total Number or Length of Words
len(mytokens2)

351

In [None]:
readingTime(document2)

'2 mins'