# CME538 - Introduction to Data Science
## Tutorial 3 - Parsing HTML

### Learning Objectives
After completing this tutorial, you should be comfortable:

- Understanding the overall structure of HTML tags
- Finding the information of interest within a web-page by going through the HTML source code
- Using Python and BeautifulSoup to extract the information of interest from HTML elements 

### Turtorial Structure
1. [What is HTML](#section1)
2. [Scraping The Guardian](#section2)
3. [Scraping quotes.toscrape.com](#section3)

<a id='section1'></a>
# 1. What is HTML?

source = https://www.w3schools.com/html/html_intro.asp

- HTML stands for Hyper Text Markup Language
- HTML is the standard markup language for creating Web pages
- It describes the structure of a Web page
- HTML consists of a series of elements that tell the browser how to display the content
- HTML elements label pieces of content such as "this is a **heading**" , "this is a **paragraph**" , "this is a **link**" , etc.

<div style="display: flex; justify-content: flex-start; gap: 1px;">
    <img src="HTML-basic_format.png" alt="HTML Basic Format" width="500"/>
    <img src="HTML-doc.png" alt="HTML Doc Structure" width="200"/>
</div>

HTML tags look something like this (very similar to XML files):

![HTML Tags](HTML-sample_tag.png)

An example of some HTML code below:

In [None]:
<div>
<img src="attachment:image.png" width="400" align='left'/>
</div>

# 2. Scraping the Guardian
<a id='section2'></a>

Let's scrape the best books from the 21st century, according to The Guardian:

In [1]:
import requests

# URL of the page to scrape
url = 'https://www.theguardian.com/books/2019/sep/21/best-books-of-the-21st-century'

# Check if we were successful in retrieiving the data
response = requests.get(url)

# Print response status
print(response)

<Response [200]>


To parse our HTML document, we’ll use a Python module called BeautifulSoup, the most common web scraping module for Python.

In [2]:
from bs4 import BeautifulSoup

If you don't have this library installed, no worries! Let's `pip install` it now:

In [3]:
!pip install bs4 # one time run! after installed, don't worry about rerunning.

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2
You should consider upgrading via the '/Users/ekaterinaossetchkina/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


Let's start to parse the website data with `BeautifulSoup` like so:

In [5]:
# print(response.text)

In [9]:
html_soup = BeautifulSoup(response.text, 'html.parser')

print(str(html_soup)[0:200]) # preview first 200 char

<!DOCTYPE html>

<html lang="en">
<head>
<!-- Hello there, HTML enthusiast! -->
<!-- DCR commit hash 2eb4af1a41e9fbee7a623e8ab6eaf7343a9ffee3 -->
<title>The 100 best books of the 21st century  | Books


We are interested in extracting out a few specific tags, we can use the built-in `find_all` method:

In [10]:
help(html_soup.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Look in the children of this PageElement and find all
    PageElements that match the given criteria.
    
    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.
    
    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find_all() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A ResultSet of PageElements.
    :rtype: bs4.element.ResultSet



In [11]:
# let's find all the h2 tags
book_titles = html_soup.find_all('h2')

# let's print our variable, returns a list
print(book_titles)

[<h2 class="dcr-1x1qaem" id="100">100</h2>, <h2 class="dcr-1x1qaem" id="i-feel-bad-about-my-neck"><strong>I Feel Bad About My Neck</strong></h2>, <h2 class="dcr-1x1qaem" id="by-nora-ephron-2006">by Nora Ephron (2006)</h2>, <h2 class="dcr-1x1qaem" id="99">99</h2>, <h2 class="dcr-1x1qaem" id="broken-glass"><strong>Broken Glass</strong></h2>, <h2 class="dcr-1x1qaem" id="by-alain-mabanckou-2005-translated-by-helen-stevenson-2009">by Alain Mabanckou (2005), translated by Helen Stevenson (2009)</h2>, <h2 class="dcr-1x1qaem" id="98">98</h2>, <h2 class="dcr-1x1qaem" id="the-girl-with-the-dragon-tattoo"><strong>The Girl </strong><strong>With the Dragon Tattoo</strong></h2>, <h2 class="dcr-1x1qaem" id="by-stieg-larsson-2005-translated-by-steven-t-murray-2008">by Stieg Larsson (2005), translated by Steven T Murray (2008)</h2>, <h2 class="dcr-1x1qaem" id="97">97</h2>, <h2 class="dcr-1x1qaem" id="harry-potter-and-the-goblet-of-fire"><strong>Harry Potter and the Goblet of Fire</strong></h2>, <h2 cla

In [16]:
# let's return the first title
print(book_titles[1])

<h2 class="dcr-1x1qaem" id="i-feel-bad-about-my-neck"><strong>I Feel Bad About My Neck</strong></h2>


In [17]:
# print out the text content of the tag
print(book_titles[1].text)

I Feel Bad About My Neck


In [18]:
# let's iterate over all the h2 tags
for title in book_titles:
    print(title.text)

100
I Feel Bad About My Neck
by Nora Ephron (2006)
99
Broken Glass
by Alain Mabanckou (2005), translated by Helen Stevenson (2009)
98
The Girl With the Dragon Tattoo
by Stieg Larsson (2005), translated by Steven T Murray (2008)
97
Harry Potter and the Goblet of Fire
by JK Rowling (2000)
96
A Little Life
by Hanya Yanagihara (2015)
95
Chronicles: Volume One
by Bob Dylan (2004)
94
The Tipping Point
by Malcolm Gladwell (2000)
93
Darkmans
by Nicola Barker (2007)
92
The Siege
by Helen Dunmore (2001)
91
Light
by M John Harrison (2002)
90
Visitation
by Jenny Erpenbeck (2008), translated by Susan Bernofsky (2010)
89
Bad Blood
by Lorna Sage (2000)
88
Noughts & Crosses
by Malorie Blackman (2001)
87
Priestdaddy
by Patricia Lockwood (2017)
86
Adults in the Room
by Yanis Varoufakis (2017)
85
The God Delusion
by Richard Dawkins (2006)
84
The Cost of Living
by Deborah Levy (2018)
83
Tell Me How It Ends
by Valeria Luiselli (2016), translated by Luiselli with Lizzie Davis (2017)
82
Coraline
by Neil Gaim

We want to simplify and get some sub-lists:

In [None]:
# example list
book_example = [100,
         'I Feel Bad About My Neck',
         'by Nora Ephron (2006)',
         99,
         'Broken Glass',
         'by Alain Mabanckou (2005), translated by Helen Stevenson (2009)',
         98,
         'The Girl With the Dragon Tattoo'
         'by Stieg Larsson (2005), translated by Steven T Murray (2008)']

To this:

In [None]:
rank_list = [100, 99, 98]
title_list = ['I Feel Bad About My Neck', 'Broken Glass', 'The Girl With the Dragon Tattoo']
year_author_list = ['Nora Ephron (2006)','Alain Mabanckou (2005)', 'Stieg Larsson (2005)']

Maybe we can use `list comprehension` to iterate over the output:

In [27]:
# INEFFICIENT
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

copy_list = []
for i in my_list:
    copy_list.append(i)
    
print(copy_list)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


In [28]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

# make a copy
copy_list = [item for item in my_list]
print(copy_list)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


In [29]:
# we can also modify the individual entries of the list

# add 2 to each entry
copy_list2 = [item+2 for item in my_list]
print(copy_list2)

[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


In [30]:
# the range function
for index in range(0, 6, 2): # range(start, stop, step) -> 0, 2, 4 (stop is up to and not including)
    print(index)

0
2
4


How to use the range function? Start at 0, step-size of 3, and we will iterate until the end of the list `len(my_list)`

In [31]:
print(my_list)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


In [32]:
# print every 3rd
every_third_list = [my_list[index] for index in range(0, len(my_list), 3)]
print(every_third_list)

[1, 4, 7, 10]


Now let's apply this to the books!

In [35]:
books[0:10]

['100',
 'I Feel Bad About My Neck',
 'by Nora Ephron (2006)',
 '99',
 'Broken Glass',
 'by Alain Mabanckou (2005), translated by Helen Stevenson (2009)',
 '98',
 'The Girl With the Dragon Tattoo',
 'by Stieg Larsson (2005), translated by Steven T Murray (2008)',
 '97']

In [39]:
books = [title.text for title in book_titles]
# print(books)

# let's extract the rank
rank = [books[index] for index in range(0, len(books), 3)]

# let's do the same for titles
titles = [books[index] for index in range(1, len(books), 3)]

# same for author/year 
author_year = [books[index] for index in range(2, len(books), 3)]

In [40]:
author_year[0]

'by Nora Ephron (2006)'

In [42]:
# get the year out
print(author_year[0].split('(')[-1].strip(')'))

2006


In [45]:
# get the author name

# split on (, removed the 'by' and strip() to remove extra white-spaces before/after
print(author_year[0].split('(')[0].replace('by','').strip())

Nora Ephron


Hmm looks like there is still some noise in the tags, let's make sure it's just books:

In [49]:
# trim our lists to be 100 length
rank = rank[0:100]
titles = titles[0:100]
author_year = author_year[0:100]

# print(titles)

In [51]:
# let's modify the author_year variable to create two sub-lists
year = [author.split('(')[-1].strip(')') for author in author_year]
author = [author.split('(')[0].replace('by','').strip() for author in author_year]

# let's print out our lists
print(year)
print(author)

['2006', '2009', '2008', '2000', '2015', '2004', '2000', '2007', '2001', '2002', '2010', '2000', '2001', '2017', '2017', '2006', '2018', '2017', '2002', '2013', '2002', '2009', '2015', '2015', '2011', '2018', '2016', '2009', '2019', '2000', '2003', '2013', '2001', '2018', '2014', '2012', '2000', '2010', '2006', '2014', '2002', '2002', '2005', '2000', '2019', '2006', '2017', '2000', '2004', '2009', '2003', '2011', '2002', '2003-2004', '2010', '2013', '2004', '2014', '2010', '2001', '2005', '2000', '2004', '2015', '2000', '2010', '2014', '2006', '2010', '2015', '2016', '2012', '2005', '2001', '2014', '2018', '2011', '2001', '2013', '2014', '2013', '2003', '2007', '2006', '2001', '2014', '2002', '2001', '2004', '2012', '2006', '2004', '2016', '2015', '2000', '2001', '2005', '2016', '2004', '2009']
['Nora Ephron', 'Alain Mabanckou', 'Stieg Larsson', 'JK Rowling', 'Hanya Yanagihara', 'Bob Dylan', 'Malcolm Gladwell', 'Nicola Barker', 'Helen Dunmore', 'M John Harrison', 'Jenny Erpenbeck', 'Lo

Let's create our dataframe!

In [52]:
import pandas as pd

# let's create a df from the list
df_books = pd.DataFrame({
    'Titles':titles,
    'Rank': rank,
    'Year': year,
    'Author': author})

df_books.head()

  from pandas.core.computation.check import NUMEXPR_INSTALLED


Unnamed: 0,Titles,Rank,Year,Author
0,I Feel Bad About My Neck,100,2006,Nora Ephron
1,Broken Glass,99,2009,Alain Mabanckou
2,The Girl With the Dragon Tattoo,98,2008,Stieg Larsson
3,Harry Potter and the Goblet of Fire,97,2000,JK Rowling
4,A Little Life,96,2015,Hanya Yanagihara


In [53]:
df_books.tail()

Unnamed: 0,Titles,Rank,Year,Author
95,Austerlitz,5,2001,WG Sebald
96,Never Let Me Go,4,2005,Kazuo Ishiguro
97,Secondhand Time,3,2016,Svetlana Alexievich
98,Gilead,2,2004,Marilynne Robinson
99,Wolf Hall,1,2009,Hilary Mantel


# 3. One more example: `Quotes to Scrape`
<a id='section3'></a>

In [54]:
# URL of the website
url = 'http://quotes.toscrape.com/'

In [56]:
# check that can scrape the website
response = requests.get(url)
print(response)

# Parse the page content with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# print(soup)

<Response [200]>


In [57]:
quotes = soup.find_all("div", class_ = "quote")
print(quotes)

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K.

In [62]:
# print out the first quote
quotes[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

In [65]:
# we can iterate using tags directly inside the quotes variable
quote_text = quotes[0].find('span', class_ = 'text')

print(quote_text.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”


In [67]:
# let's extract the author
quote_author = quotes[0].find('small', class_ = 'author')

print(quote_author.text)

Albert Einstein


In [84]:
# let's extract the tags
quote_tags = quotes[0].find('div', class_ = 'tags')

print(quote_tags.text)

# # let's convert it to a string
# tags = str(quote_tags.text).replace("Tags:", '').strip()

# # let's change the newline with comma-separated
# tags = tags.replace("\n",", ")

# in one line
tags = str(quote_tags.text).replace("Tags:", '').strip().replace("\n",", ")

print(tags)

change, deep-thoughts, thinking, world


In [87]:
# let's iterate over every quote inside quotes

# data to be stored in a list
data = []

for quote in quotes:
    
    # extract the text of the quote
    text = quote.find('span', class_ = 'text').text
    
    # extract the author of the quote
    author = quote.find('small', class_ = 'author').text
    
    # extract the tags
    quote_tags = quote.find('div', class_ = 'tags')
    tags = str(quote_tags.text).replace("Tags:", '').strip().replace("\n",", ")
    
    # we will make a dictionary for each row
    data.append({
        'Quote': text,
        'Author': author,
        'Tags': tags
    }
    )
    
# outside of the loop, let's make our dataframe by combining rows
df_quotes = pd.DataFrame(data)

df_quotes

Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood, success, value"
6,“It is better to be hated for what you are tha...,André Gide,"life, love"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison, failure, inspirational, paraphrased"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor, obvious, simile"
