# Text Summarization - In Class Coding Example

In [1]:
import urllib.request  
import bs4 as BeautifulSoup
import nltk
nltk.download('stopwords')
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zdszy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Web Scrapying on the Article Source for Text Summarization

Look at the HTML code of [this](https://en.wikipedia.org/wiki/Machine_learning) webpage. What did you notice? What types of tags tend to hold important text information?


In [2]:
# Raw HTML from Wikipedia using Urllib
text = urllib.request.urlopen('https://en.wikipedia.org/wiki/Machine_learning')
text
print(type(text))

<class 'http.client.HTTPResponse'>


In [3]:
# Processed HTML
article = text.read()
article

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Machine learning - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"2444c583-9249-48d2-ab56-63821da0d508","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Machine_learning","wgTitle":"Machine learning","wgCurRevisionId":1035217173,"wgRevisionId":1035217173,"wgArticleId":233488,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: missing periodical","Harv and Sfn no-target errors","CS1 maint: uses authors parameter","Articles with short description","Short description

In [4]:
# Parsing the URL content 
article_parsed = BeautifulSoup.BeautifulSoup(article,'html.parser')
print(type(article_parsed))
paragraphs = article_parsed.find_all('p')
print(type(paragraphs))

<class 'bs4.BeautifulSoup'>
<class 'bs4.element.ResultSet'>


In [5]:
# Extracted text from <p> tags
paragraphs

[<p><b>Machine learning</b> (<b>ML</b>) is the study of computer <a href="/wiki/Algorithm" title="Algorithm">algorithms</a> that improve automatically through experience and by the use of data.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> It is seen as a part of <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a>. Machine learning algorithms build a model based on sample data, known as "<a class="mw-redirect" href="/wiki/Training_data" title="Training data">training data</a>", in order to make predictions or decisions without being explicitly programmed to do so.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> Machine learning algorithms are used in a wide variety of applications, such as in medicine, <a href="/wiki/Email_filtering" title="Email filtering">email filtering</a>, <a href="/wiki/Speech_recognition" title="Speech recognition">speech recognition</a>, and <a href="/wiki/Co

In [6]:
# To get the content within all poaragrphs loop through it
article_content = ""
for p in paragraphs:  
    article_content += p.text

In [7]:
print(article_content)

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.[1] It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.[3]
A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis 

## Raw Article Processing

In [8]:
# tokenize by word using nltk
article_tokens = nltk.word_tokenize(article_content)

In [9]:
# get stop words
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [10]:
# clean up tokens - stopWords, nonAlpha, convert to lower case
print(f"Before cleaning, length is {len(article_tokens)}")
token_cleaned = [word.lower() for word in article_tokens if word.isalpha() and word not in stop_words];
print(f"After cleaning characters, length is {len(token_cleaned)}\n")
print(token_cleaned)

Before cleaning, length is 7720
After cleaning characters, length is 3982

['machine', 'learning', 'ml', 'study', 'computer', 'algorithms', 'improve', 'automatically', 'experience', 'use', 'data', 'it', 'seen', 'part', 'artificial', 'intelligence', 'machine', 'learning', 'algorithms', 'build', 'model', 'based', 'sample', 'data', 'known', 'training', 'data', 'order', 'make', 'predictions', 'decisions', 'without', 'explicitly', 'programmed', 'machine', 'learning', 'algorithms', 'used', 'wide', 'variety', 'applications', 'medicine', 'email', 'filtering', 'speech', 'recognition', 'computer', 'vision', 'difficult', 'unfeasible', 'develop', 'conventional', 'algorithms', 'perform', 'needed', 'tasks', 'a', 'subset', 'machine', 'learning', 'closely', 'related', 'computational', 'statistics', 'focuses', 'making', 'predictions', 'using', 'computers', 'machine', 'learning', 'statistical', 'learning', 'the', 'study', 'mathematical', 'optimization', 'delivers', 'methods', 'theory', 'application', 'd

## Weighted Term Frequency & Sentence Scoring

In [11]:
# get words freq
freq = {}
for word in token_cleaned:
    freq[word] =  freq.get(word, 0) + 1
#print(freq)
# sort from high to low
sorted_freq = dict(sorted(freq.items(), key = lambda item: item[1], reverse = True))
print(sorted_freq)

{'learning': 166, 'machine': 93, 'data': 87, 'training': 48, 'algorithms': 44, 'in': 36, 'model': 34, 'the': 30, 'set': 30, 'used': 29, 'artificial': 22, 'algorithm': 21, 'methods': 19, 'also': 18, 'classification': 18, 'one': 17, 'example': 17, 'ai': 17, 'field': 16, 'models': 16, 'systems': 16, 'supervised': 15, 'examples': 15, 'computer': 14, 'tasks': 14, 'using': 14, 'often': 14, 'unsupervised': 13, 'many': 13, 'input': 13, 'analysis': 12, 'this': 12, 'system': 12, 'called': 12, 'features': 12, 'use': 11, 'it': 11, 'known': 11, 'predictions': 11, 'a': 11, 'related': 11, 'mining': 11, 'problems': 11, 'for': 11, 'performance': 11, 'may': 11, 'networks': 11, 'knowledge': 11, 'method': 11, 'feature': 11, 'regression': 11, 'based': 10, 'perform': 10, 'statistics': 10, 'theory': 10, 'neural': 10, 'learn': 10, 'learned': 10, 'inputs': 10, 'represented': 10, 'decision': 10, 'detection': 10, 'without': 9, 'research': 9, 'network': 9, 'time': 9, 'environment': 9, 'function': 9, 'trained': 9,

In [12]:
# Calculate words weighted freq
weighted_freq = {}
# get maximum freq
max_freq = max(sorted_freq.values())
print(max_freq)

# compute and add to weighted_freq
weighted_freq = {word: count/max_freq for (word, count) in sorted_freq.items()}
print(weighted_freq)

166
{'learning': 1.0, 'machine': 0.5602409638554217, 'data': 0.5240963855421686, 'training': 0.2891566265060241, 'algorithms': 0.26506024096385544, 'in': 0.21686746987951808, 'model': 0.20481927710843373, 'the': 0.18072289156626506, 'set': 0.18072289156626506, 'used': 0.1746987951807229, 'artificial': 0.13253012048192772, 'algorithm': 0.12650602409638553, 'methods': 0.1144578313253012, 'also': 0.10843373493975904, 'classification': 0.10843373493975904, 'one': 0.10240963855421686, 'example': 0.10240963855421686, 'ai': 0.10240963855421686, 'field': 0.0963855421686747, 'models': 0.0963855421686747, 'systems': 0.0963855421686747, 'supervised': 0.09036144578313253, 'examples': 0.09036144578313253, 'computer': 0.08433734939759036, 'tasks': 0.08433734939759036, 'using': 0.08433734939759036, 'often': 0.08433734939759036, 'unsupervised': 0.0783132530120482, 'many': 0.0783132530120482, 'input': 0.0783132530120482, 'analysis': 0.07228915662650602, 'this': 0.07228915662650602, 'system': 0.07228915

In [13]:
# tokenize by sentence and get their scores, store in a dictionary
tokens_sentence = nltk.sent_tokenize(article_content)
sentence_with_scores = {}
# grade sentence based on the sum of its words' total weight
for sentence in tokens_sentence:
    words = nltk.word_tokenize(sentence)
    cur_score = 0
    for word in words:
        cur_score += weighted_freq.get(word.lower(), 0)
    sentence_with_scores[sentence] = cur_score
sentence_with_scores

{'Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.': 3.0180722891566263,
 '[1] It is seen as a part of artificial intelligence.': 0.39156626506024095,
 'Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.': 4.1144578313253,
 '[2] Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.': 3.5240963855421685,
 '[3]\nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.': 4.644578313253012,
 'The study of mathematical optimization delivers methods, theory and applicatio

## Summarization with Sentences of Highest Scores

In [14]:
# sort sentence by scores
sorted_sentence_tokens = dict(sorted(sentence_with_scores.items(), key = lambda pair: pair[1], reverse = True))
print(sorted_sentence_tokens)
print(len(sorted_sentence_tokens))

{'[30]\nMachine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases).': 7.843373493975906, 'Learning classifier systems (LCS) are a family of rule-based machine learning algorithms that combine a discovery component, typically a genetic algorithm, with a learning component, performing either supervised learning, reinforcement learning, or unsupervised learning.': 7.789156626506025, '[12]\nTom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P,  improves with experience E."[

In [15]:
# top 10
from itertools import count, takewhile
sen_choice = 10
counter = count()
selected_sentences = list(takewhile(lambda x: next(counter) < sen_choice, (items[0] for items in sorted_sentence_tokens.items())))
for i, sentence in enumerate(selected_sentences):
    print('---> Sentence %d: %s' % (i, sentence))

---> Sentence 0: [30]
Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases).
---> Sentence 1: Learning classifier systems (LCS) are a family of rule-based machine learning algorithms that combine a discovery component, typically a genetic algorithm, with a learning component, performing either supervised learning, reinforcement learning, or unsupervised learning.
---> Sentence 2: [12]
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P,  improves with experience E

# Using Spacy

In [16]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_trf')

In [17]:
# process with spacy
doc = nlp(article_content)

In [18]:
for token in doc.sents:
    print(token)

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.[1]
It is seen as a part of artificial intelligence.
Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2]
Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.[3]

A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Data mining is a related field of study, focusing on exploratory data analysis

## Clean and lemmatize words using spacy

In [19]:
# clean words and lemmatize
token_lemma_cleaned = [token.lemma_ for token in doc if not token.is_stop]
token_lemma_cleaned = [token.lower() for token in token_lemma_cleaned if token.isalpha()]
print(token_lemma_cleaned)

['machine', 'learning', 'ml', 'study', 'computer', 'algorithm', 'improve', 'automatically', 'experience', 'use', 'see', 'artificial', 'intelligence', 'machine', 'learning', 'algorithm', 'build', 'model', 'base', 'sample', 'datum', 'know', 'training', 'datum', 'order', 'prediction', 'decision', 'explicitly', 'program', 'machine', 'learning', 'algorithm', 'wide', 'variety', 'application', 'medicine', 'email', 'filtering', 'speech', 'recognition', 'computer', 'vision', 'difficult', 'unfeasible', 'develop', 'conventional', 'algorithm', 'perform', 'need', 'subset', 'machine', 'learning', 'closely', 'relate', 'computational', 'statistic', 'focus', 'make', 'prediction', 'computer', 'machine', 'learning', 'statistical', 'learning', 'study', 'mathematical', 'optimization', 'deliver', 'method', 'theory', 'application', 'domain', 'field', 'machine', 'learning', 'datum', 'mining', 'related', 'field', 'study', 'focus', 'exploratory', 'datum', 'analysis', 'unsupervised', 'application', 'business', '

## weighted freq and sentence scoring

In [20]:
# get words freq
freq_spacy = {}
for word in token_lemma_cleaned:
    freq_spacy[word] =  freq_spacy.get(word, 0) + 1
# sort from high to low
sorted_freq_spacy = dict(sorted(freq_spacy.items(), key = lambda item: item[1], reverse = True))
print(sorted_freq_spacy)

{'learning': 161, 'machine': 101, 'datum': 67, 'algorithm': 63, 'model': 49, 'training': 41, 'learn': 32, 'set': 32, 'example': 31, 'method': 30, 'system': 27, 'feature': 23, 'artificial': 22, 'input': 22, 'network': 20, 'represent': 20, 'computer': 19, 'task': 19, 'rule': 19, 'base': 18, 'field': 18, 'train': 18, 'data': 17, 'classification': 17, 'decision': 16, 'perform': 16, 'use': 15, 'problem': 15, 'ai': 15, 'approach': 14, 'label': 14, 'bias': 14, 'program': 13, 'unsupervised': 13, 'find': 13, 'knowledge': 13, 'analysis': 12, 'include': 12, 'call': 12, 'prediction': 11, 'mining': 11, 'research': 11, 'performance': 11, 'time': 11, 'function': 11, 'signal': 11, 'process': 11, 'neuron': 11, 'human': 10, 'neural': 10, 'environment': 10, 'supervised': 10, 'output': 10, 'regression': 10, 'technique': 10, 'representation': 10, 'detection': 10, 'tree': 10, 'improve': 9, 'theory': 9, 'linear': 9, 'predict': 9, 'new': 9, 'test': 9, 'cluster': 9, 'variable': 9, 'dictionary': 9, 'layer': 9, 

In [21]:
# Calculate words weighted freq
weighted_freq_spacy = {}
# get maximum freq
max_freq_spacy = max(sorted_freq_spacy.values())
print(max_freq_spacy)

# compute and add to weighted_freq
weighted_freq_spacy = {word: count/max_freq_spacy for (word, count) in sorted_freq_spacy.items()}
print(weighted_freq_spacy)

161
{'learning': 1.0, 'machine': 0.6273291925465838, 'datum': 0.4161490683229814, 'algorithm': 0.391304347826087, 'model': 0.30434782608695654, 'training': 0.2546583850931677, 'learn': 0.19875776397515527, 'set': 0.19875776397515527, 'example': 0.19254658385093168, 'method': 0.18633540372670807, 'system': 0.16770186335403728, 'feature': 0.14285714285714285, 'artificial': 0.13664596273291926, 'input': 0.13664596273291926, 'network': 0.12422360248447205, 'represent': 0.12422360248447205, 'computer': 0.11801242236024845, 'task': 0.11801242236024845, 'rule': 0.11801242236024845, 'base': 0.11180124223602485, 'field': 0.11180124223602485, 'train': 0.11180124223602485, 'data': 0.10559006211180125, 'classification': 0.10559006211180125, 'decision': 0.09937888198757763, 'perform': 0.09937888198757763, 'use': 0.09316770186335403, 'problem': 0.09316770186335403, 'ai': 0.09316770186335403, 'approach': 0.08695652173913043, 'label': 0.08695652173913043, 'bias': 0.08695652173913043, 'program': 0.0807

In [25]:
# tokenize by sentence and get their scores, store in a dictionary
tokens_sentence_spacy = doc.sents
sentence_with_scores_spacy = {}
# grade sentence based on the sum of its words' total weight
for sentence in tokens_sentence_spacy:
    words = nlp(sentence.text)
    cur_score = 0
    for word in words:
        cur_score += weighted_freq_spacy.get(word.text.lower(), 0)
    sentence_with_scores_spacy[sentence] = cur_score
print(sentence_with_scores_spacy)

{Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data.[1]: 2.024844720496894, It is seen as a part of artificial intelligence.: 0.18012422360248448, Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2]: 2.565217391304348, Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.[3]: 2.0683229813664594, 
A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.: 4.440993788819876, The study of mathematical optimization delivers methods, theory and application domains to the f

## Summarization with sentence with highest scores

In [23]:
# sort sentence by scores
sorted_sentence_tokens_spacy = dict(sorted(sentence_with_scores_spacy.items(), key = lambda pair: pair[1], reverse = True))
print(sorted_sentence_tokens_spacy)
print(len(sorted_sentence_tokens_spacy))

{
Learning classifier systems (LCS) are a family of rule-based machine learning algorithms that combine a discovery component, typically a genetic algorithm, with a learning component, performing either supervised learning, reinforcement learning, or unsupervised learning.: 7.552795031055901, Data mining uses many machine learning methods, but with different goals; on the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy.: 4.881987577639752, 

Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal" or "feedback" available to the learning system:
Supervised learning algorithms build a mathematical model of a set of data that contains both the inputs and the desired outputs.[38]: 4.689440993788821, 
Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on pred

In [24]:
from itertools import count, takewhile
sen_choice_spacy = 10
counter_spacy = count()
selected_sentences_spacy = list(takewhile(lambda x: next(counter_spacy) < sen_choice, (items[0] for items in sorted_sentence_tokens_spacy.items())))
for i, sentence in enumerate(selected_sentences_spacy):
    print('---> Sentence %d: %s' % (i, sentence))

---> Sentence 0: 
Learning classifier systems (LCS) are a family of rule-based machine learning algorithms that combine a discovery component, typically a genetic algorithm, with a learning component, performing either supervised learning, reinforcement learning, or unsupervised learning.
---> Sentence 1: Data mining uses many machine learning methods, but with different goals; on the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy.
---> Sentence 2: 

Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal" or "feedback" available to the learning system:
Supervised learning algorithms build a mathematical model of a set of data that contains both the inputs and the desired outputs.[38]
---> Sentence 3: 
Machine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses 