# **SUMMER CAMP 2022**
## DAY 4: TEXT SUMMARIZATION

## What we are going to do in this project

- Data collection 
- Text cleaning 

- Extractive text summary
  - Sentence Tokenization
  - Word frequency table
  - Clustering 
  - Summarization


- Abstractive text summary
  - Introduction to Hugging face transformers library

## Learning points 
- learning how to scrape data from the web using requests and beautiful soup
- preprocessing text data
- using models from the HF transformers library


### Step 0: Import and configure modules

In [1]:
import requests
from bs4 import BeautifulSoup

### Step 1: Gather text data through web scraping

In [2]:
# enter the URL of the article you want to summarize
url = "https://en.wikipedia.org/wiki/Amazon_(company)"
# send a request to the URL and save the response as a variable
r = requests.get(url)
r

<Response [200]>

In [3]:
# get the html content of the page
r.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Amazon (company) - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"04612bff-af4c-4259-874a-6cbbfa6b4e11","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Amazon_(company)","wgTitle":"Amazon (company)","wgCurRevisionId":1115387317,"wgRevisionId":1115387317,"wgArticleId":90451,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: bot: original URL status unknown","CS1 maint: url-status","Webarchive template wayback links","CS1 French-language sources (fr)",

In [4]:
# find all the paragraph tags <p> in the page
soup = BeautifulSoup(r.text)
results = soup.find_all("p")

results

[<p class="mw-empty-elt">
 </p>,
 <p><b>Amazon.com, Inc.</b><sup class="reference" id="cite_ref-10K_1-1"><a href="#cite_note-10K-1">[1]</a></sup> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt" lang="en-fonipa"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˈ/: primary stress follows">ˈ</span><span title="/æ/: 'a' in 'bad'">æ</span><span title="'m' in 'my'">m</span><span title="/ə/: 'a' in 'about'">ə</span><span title="'z' in 'zoom'">z</span><span title="/ɒ/: 'o' in 'body'">ɒ</span><span title="'n' in 'nigh'">n</span></span>/</a></span></span> <a href="/wiki/Help:Pronunciation_respelling_key" title="Help:Pronunciation respelling key"><i title="English pronunciation respelling"><span style="font-size:90%">AM</span>-ə-zon</i></a>) is an American <a href="/wiki/Multinational_corporation" title="Multinational corporation">multinational</a> <a href="/wiki/Technology_company" title="Technology compan

In [5]:
# extract the text from the paragraph tags and join them together
text = ""
for sent in results:
    text += sent.get_text()
text

'\nAmazon.com, Inc.[1] (/ˈæməzɒn/ AM-ə-zon) is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence. It has been referred to as "one of the most influential economic and cultural forces in the world",[5] and is one of the world\'s most valuable brands.[6] It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.\nAmazon was founded by Jeff Bezos from his garage in Bellevue, Washington,[7] on July 5, 1994. Initially an online marketplace for books, it has expanded into a multitude of product categories, a strategy that has earned it the moniker The Everything Store.[8] It has multiple subsidiaries including Amazon Web Services (cloud computing), Zoox (autonomous vehicles), Kuiper Systems (satellite Internet), and Amazon Lab126 (computer hardware R&D). Its other subsidiaries include Ring, Twitch, IMDb, and Whole Foods Market. I

### Step 2: Text cleaning

In [6]:
import re
# used this to remove the references/citations from the wikipedia text
pattern = "\[\d*?\]"

In [8]:
text = re.sub(pattern, '', text)
text

'\nAmazon.com, Inc. (/ˈæməzɒn/ AM-ə-zon) is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence. It has been referred to as "one of the most influential economic and cultural forces in the world", and is one of the world\'s most valuable brands. It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.\nAmazon was founded by Jeff Bezos from his garage in Bellevue, Washington, on July 5, 1994. Initially an online marketplace for books, it has expanded into a multitude of product categories, a strategy that has earned it the moniker The Everything Store. It has multiple subsidiaries including Amazon Web Services (cloud computing), Zoox (autonomous vehicles), Kuiper Systems (satellite Internet), and Amazon Lab126 (computer hardware R&D). Its other subsidiaries include Ring, Twitch, IMDb, and Whole Foods Market. Its acquisition 

In [9]:
text = text.replace("\n", "")
text

'Amazon.com, Inc. (/ˈæməzɒn/ AM-ə-zon) is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence. It has been referred to as "one of the most influential economic and cultural forces in the world", and is one of the world\'s most valuable brands. It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.Amazon was founded by Jeff Bezos from his garage in Bellevue, Washington, on July 5, 1994. Initially an online marketplace for books, it has expanded into a multitude of product categories, a strategy that has earned it the moniker The Everything Store. It has multiple subsidiaries including Amazon Web Services (cloud computing), Zoox (autonomous vehicles), Kuiper Systems (satellite Internet), and Amazon Lab126 (computer hardware R&D). Its other subsidiaries include Ring, Twitch, IMDb, and Whole Foods Market. Its acquisition of W

## **PART 1**
### EXTRACTIVE TEXT SUMMARY

### Step 1: Sentence Tokenization 

In [10]:
# !pip install -U pip setuptools wheel
# !pip install -U 'spacy[apple]'
# !python -m spacy download en_core_web_sm
import spacy
from string import punctuation
sp = spacy.load('en_core_web_sm')

In [11]:
# returning a list of stopwords 
all_stopwords = sp.Defaults.stop_words
list(all_stopwords)

['another',
 'along',
 "'ve",
 '‘ll',
 'hereafter',
 'by',
 'down',
 'themselves',
 'has',
 'empty',
 'and',
 'everything',
 'did',
 'towards',
 'however',
 'top',
 'without',
 'forty',
 'whole',
 'first',
 'us',
 'the',
 'while',
 'beyond',
 'get',
 'so',
 'both',
 'became',
 'in',
 'even',
 'ever',
 'put',
 'after',
 'four',
 'well',
 'how',
 'now',
 'across',
 'very',
 'amount',
 'them',
 'on',
 'you',
 'whither',
 'name',
 'least',
 'serious',
 'three',
 'latter',
 'various',
 'together',
 'therein',
 'above',
 'anything',
 'cannot',
 'thus',
 'sometimes',
 'yours',
 'meanwhile',
 'keep',
 'more',
 'several',
 'someone',
 'besides',
 'perhaps',
 'do',
 'of',
 'whatever',
 'go',
 'against',
 'throughout',
 'wherever',
 'becoming',
 'anywhere',
 'at',
 'what',
 'beside',
 'who',
 'her',
 'though',
 'six',
 'their',
 "'s",
 'every',
 'wherein',
 'elsewhere',
 'full',
 'whereby',
 '‘s',
 'twelve',
 'your',
 'itself',
 'n’t',
 'per',
 '‘re',
 'become',
 "'d",
 'ourselves',
 'his',
 'whe

In [12]:
doc = sp(text)

In [13]:
# tokenizing the original text
tokens = [token.text for token in doc]
tokens

['Amazon.com',
 ',',
 'Inc.',
 '(',
 '/ˈæməzɒn/',
 'AM',
 '-',
 'ə',
 '-',
 'zon',
 ')',
 'is',
 'an',
 'American',
 'multinational',
 'technology',
 'company',
 'that',
 'focuses',
 'on',
 'e',
 '-',
 'commerce',
 ',',
 'cloud',
 'computing',
 ',',
 'online',
 'advertising',
 ',',
 'digital',
 'streaming',
 ',',
 'and',
 'artificial',
 'intelligence',
 '.',
 'It',
 'has',
 'been',
 'referred',
 'to',
 'as',
 '"',
 'one',
 'of',
 'the',
 'most',
 'influential',
 'economic',
 'and',
 'cultural',
 'forces',
 'in',
 'the',
 'world',
 '"',
 ',',
 'and',
 'is',
 'one',
 'of',
 'the',
 'world',
 "'s",
 'most',
 'valuable',
 'brands',
 '.',
 'It',
 'is',
 'one',
 'of',
 'the',
 'Big',
 'Five',
 'American',
 'information',
 'technology',
 'companies',
 ',',
 'alongside',
 'Alphabet',
 ',',
 'Apple',
 ',',
 'Meta',
 ',',
 'and',
 'Microsoft',
 '.',
 'Amazon',
 'was',
 'founded',
 'by',
 'Jeff',
 'Bezos',
 'from',
 'his',
 'garage',
 'in',
 'Bellevue',
 ',',
 'Washington',
 ',',
 'on',
 'July',


In [14]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[Amazon.com, Inc. (/ˈæməzɒn/ AM-ə-zon) is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence.,
 It has been referred to as "one of the most influential economic and cultural forces in the world", and is one of the world's most valuable brands.,
 It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.,
 Amazon was founded by Jeff Bezos from his garage in Bellevue, Washington, on July 5, 1994.,
 Initially an online marketplace for books, it has expanded into a multitude of product categories, a strategy that has earned it the moniker The Everything Store.,
 It has multiple subsidiaries including Amazon Web Services (cloud computing), Zoox (autonomous vehicles), Kuiper Systems (satellite Internet), and Amazon Lab126 (computer hardware R&D).,
 Its other subsidiaries include Ring, Twitch, IMDb, and Whole Foods Market.,
 Its ac

### Step 2: Word frequency table

In [16]:
word_frequencies = {}
for word in doc:
    if word.text.lower() not in all_stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

word_frequencies

{'Amazon.com': 8,
 'Inc.': 3,
 '/ˈæməzɒn/': 1,
 'ə': 1,
 'zon': 1,
 'American': 4,
 'multinational': 1,
 'technology': 4,
 'company': 17,
 'focuses': 2,
 'e': 5,
 'commerce': 3,
 'cloud': 7,
 'computing': 9,
 'online': 8,
 'advertising': 2,
 'digital': 3,
 'streaming': 6,
 'artificial': 2,
 'intelligence': 1,
 'referred': 1,
 'influential': 1,
 'economic': 1,
 'cultural': 1,
 'forces': 1,
 'world': 6,
 'valuable': 1,
 'brands': 4,
 'Big': 1,
 'information': 3,
 'companies': 7,
 'alongside': 1,
 'Alphabet': 1,
 'Apple': 5,
 'Meta': 1,
 'Microsoft': 3,
 'Amazon': 135,
 'founded': 4,
 'Jeff': 4,
 'Bezos': 10,
 'garage': 1,
 'Bellevue': 1,
 'Washington': 2,
 'July': 4,
 '5': 3,
 '1994': 2,
 'Initially': 1,
 'marketplace': 2,
 'books': 9,
 'expanded': 3,
 'multitude': 1,
 'product': 12,
 'categories': 2,
 'strategy': 1,
 'earned': 2,
 'moniker': 1,
 'Store': 1,
 'multiple': 2,
 'subsidiaries': 4,
 'including': 10,
 'Web': 6,
 'Services': 5,
 'Zoox': 2,
 'autonomous': 1,
 'vehicles': 1,
 'Ku

In [17]:
max_freq = max(word_frequencies.values())
max_freq

135

In [18]:
# normalizing the word frequencies
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_freq

word_frequencies

{'Amazon.com': 0.05925925925925926,
 'Inc.': 0.022222222222222223,
 '/ˈæməzɒn/': 0.007407407407407408,
 'ə': 0.007407407407407408,
 'zon': 0.007407407407407408,
 'American': 0.02962962962962963,
 'multinational': 0.007407407407407408,
 'technology': 0.02962962962962963,
 'company': 0.1259259259259259,
 'focuses': 0.014814814814814815,
 'e': 0.037037037037037035,
 'commerce': 0.022222222222222223,
 'cloud': 0.05185185185185185,
 'computing': 0.06666666666666667,
 'online': 0.05925925925925926,
 'advertising': 0.014814814814814815,
 'digital': 0.022222222222222223,
 'streaming': 0.044444444444444446,
 'artificial': 0.014814814814814815,
 'intelligence': 0.007407407407407408,
 'referred': 0.007407407407407408,
 'influential': 0.007407407407407408,
 'economic': 0.007407407407407408,
 'cultural': 0.007407407407407408,
 'forces': 0.007407407407407408,
 'world': 0.044444444444444446,
 'valuable': 0.007407407407407408,
 'brands': 0.02962962962962963,
 'Big': 0.007407407407407408,
 'information

In [19]:
# generating the sentence tokens
sentence_tokens

[Amazon.com, Inc. (/ˈæməzɒn/ AM-ə-zon) is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence.,
 It has been referred to as "one of the most influential economic and cultural forces in the world", and is one of the world's most valuable brands.,
 It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.,
 Amazon was founded by Jeff Bezos from his garage in Bellevue, Washington, on July 5, 1994.,
 Initially an online marketplace for books, it has expanded into a multitude of product categories, a strategy that has earned it the moniker The Everything Store.,
 It has multiple subsidiaries including Amazon Web Services (cloud computing), Zoox (autonomous vehicles), Kuiper Systems (satellite Internet), and Amazon Lab126 (computer hardware R&D).,
 Its other subsidiaries include Ring, Twitch, IMDb, and Whole Foods Market.,
 Its ac

In [20]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

# code here

sentence_scores

{Amazon.com, Inc. (/ˈæməzɒn/ AM-ə-zon) is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence.: 0.5555555555555556,
 It has been referred to as "one of the most influential economic and cultural forces in the world", and is one of the world's most valuable brands.: 0.16296296296296298,
 It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.: 0.11851851851851851,
 Amazon was founded by Jeff Bezos from his garage in Bellevue, Washington, on July 5, 1994.: 0.07407407407407407,
 Initially an online marketplace for books, it has expanded into a multitude of product categories, a strategy that has earned it the moniker The Everything Store.: 0.36296296296296304,
 It has multiple subsidiaries including Amazon Web Services (cloud computing), Zoox (autonomous vehicles), Kuiper Systems (satellite Internet), and Amazon Lab126 (compu

### Step 3: Summarization

In [21]:
# Calculating 30% of the total number of sentences
select_length = int(len(sentence_tokens) * 0.3)
select_length

52

In [22]:
from heapq import nlargest

In [25]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
final_summary = [word.text for word in summary]
final_text = " ".join(final_summary)
final_text

'Andy Jassy, previously CEO of AWS, became Amazon\'s CEO.Amazon.com is an ecommerce platform that sells many product lines, include media (books, movies, music, and software), apparel, baby products, consumer electronics, beauty products, gourmet food, groceries, health and personal care products, industrial & scientific supplies, kitchen items, jewelry, watches, lawn and garden items, musical instruments, sporting goods, tools, automotive items, toys and games, and farm supplies and consulting services. This process, commonly known as a service-oriented architecture (SOA), resulted in mandatory dogfooding of services that would later be commercialized as part of AWS.As of June\xa02022[update], Amazon\'s board of directors were:Amazon.com is primarily a retail site with a sales revenue model; Amazon takes a small percentage of the sale price of each item that is sold through its website while also allowing companies to advertise their products by paying to be listed as featured product

In [26]:
# original text
doc

Amazon.com, Inc. (/ˈæməzɒn/ AM-ə-zon) is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence. It has been referred to as "one of the most influential economic and cultural forces in the world", and is one of the world's most valuable brands. It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.Amazon was founded by Jeff Bezos from his garage in Bellevue, Washington, on July 5, 1994. Initially an online marketplace for books, it has expanded into a multitude of product categories, a strategy that has earned it the moniker The Everything Store. It has multiple subsidiaries including Amazon Web Services (cloud computing), Zoox (autonomous vehicles), Kuiper Systems (satellite Internet), and Amazon Lab126 (computer hardware R&D). Its other subsidiaries include Ring, Twitch, IMDb, and Whole Foods Market. Its acquisition of Who

## **PART 2**
### ABSTRACTIVE TEXT SUMMARY USING HUGGING FACE TRANSFORMERS LIBRARY 🤗

In [27]:
# importing the libraries
from transformers import pipeline

In [28]:
# pipeline for text summarization
# default model used is sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)

summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


In [29]:
# doing text summary using the summarizer pipeline
hf_summary = summarizer(text, max_length = 400, min_length = 100, do_sample = False, truncation = True)

In [32]:
# Returning the summary text
hf_summary[0]['summary_text']

" Amazon.com, Inc. is an American multinational technology company that focuses on e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence . It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft . In 2021, it surpassed Walmart as the world's largest retailer outside of China, driven in large part by its paid subscription plan, Amazon Prime, which has over 200 million subscribers worldwide . It has been criticized for customer data collection practices, a toxic work culture, tax avoidance, and anti-competitive behavior ."