# CME538 - Introduction to Data Science
## Tutorial 3 - Parsing HTML

### Learning Objectives
After completing this tutorial, you should be comfortable:

- Understanding the overall structure of HTML tags
- Finding the information of interest within a web-page by going through the HTML source code
- Using Python and BeautifulSoup to extract the information of interest from HTML elements 

### Turtorial Structure
1. [What is HTML](#section1)
2. [Scraping The Guardian](#section2)
3. [Scraping quotes.toscrape.com](#section3)

<a id='section1'></a>
# 1. What is HTML?

source = https://www.w3schools.com/html/html_intro.asp

- HTML stands for Hyper Text Markup Language
- HTML is the standard markup language for creating Web pages
- It describes the structure of a Web page
- HTML consists of a series of elements that tell the browser how to display the content
- HTML elements label pieces of content such as "this is a **heading**" , "this is a **paragraph**" , "this is a **link**" , etc.

<div style="display: flex; justify-content: flex-start; gap: 1px;">
    <img src="HTML-basic_format.png" alt="HTML Basic Format" width="500"/>
    <img src="HTML-doc.png" alt="HTML Doc Structure" width="200"/>
</div>

HTML tags look something like this (very similar to XML files):

![HTML Tags](HTML-sample_tag.png)

An example of some HTML code below:

In [None]:
<div>
<img src="attachment:image.png" width="400" align='left'/>
</div>

# 2. Scraping the Guardian
<a id='section2'></a>

Let's scrape the best books from the 21st century, according to The Guardian:

In [None]:
import requests

# URL of the page to scrape
url = 'https://www.theguardian.com/books/2019/sep/21/best-books-of-the-21st-century'

# Check if we were successful in retrieiving the data

To parse our HTML document, we’ll use a Python module called BeautifulSoup, the most common web scraping module for Python.

If you don't have this library installed, no worries! Let's `pip install` it now:

In [None]:
!pip install bs4 # one time run! after installed, don't worry about rerunning.

Let's start to parse the website data with `BeautifulSoup` like so:

In [None]:
# print(response.text)

We are interested in extracting out a few specific tags, we can use the built-in `find_all` method:

In [None]:
help(html_soup.find_all)

In [None]:
# let's find all the h2 tags
book_titles = 

In [None]:
# let's return the first title
print(book_titles[1])

In [None]:
# print out the text content of the tag
print(book_titles[1].text)

In [None]:
# let's iterate over all the h2 tags


We want to simplify and get some sub-lists:

In [None]:
# example list
book_example = [100,
         'I Feel Bad About My Neck',
         'by Nora Ephron (2006)',
         99,
         'Broken Glass',
         'by Alain Mabanckou (2005), translated by Helen Stevenson (2009)',
         98,
         'The Girl With the Dragon Tattoo'
         'by Stieg Larsson (2005), translated by Steven T Murray (2008)']

To this:

In [None]:
rank_list = [100, 99, 98]
title_list = ['I Feel Bad About My Neck', 'Broken Glass', 'The Girl With the Dragon Tattoo']
year_author_list = ['Nora Ephron (2006)','Alain Mabanckou (2005)', 'Stieg Larsson (2005)']

Maybe we can use `list comprehension` to iterate over the output:

In [None]:
# INEFFICIENT
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

copy_list = []
for i in my_list:
    copy_list.append(i)
    
print(copy_list)

In [None]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

# make a copy
copy_list = [item for item in my_list]
print(copy_list)

In [None]:
# we can also modify the individual entries of the list

# add 2 to each entry
copy_list2 = [item+2 for item in my_list]
print(copy_list2)

In [None]:
# the range function
for index in range(0, 6, 2): # range(start, stop, step) -> 0, 2, 4 (stop is up to and not including)
    print(index)

How to use the range function? Start at 0, step-size of 3, and we will iterate until the end of the list `len(my_list)`

In [None]:
print(my_list)

In [None]:
# print every 3rd
every_third_list = [my_list[index] for index in range(0, len(my_list), 3)]
print(every_third_list)

Now let's apply this to the books!

In [None]:
books = [title.text for title in book_titles]
# print(books)

# let's extract the rank
rank = 

# let's do the same for titles
titles = 

# same for author/year 
author_year = 

In [None]:
# get the year out


In [None]:
# get the author name

# split on (, removed the 'by' and strip() to remove extra white-spaces before/after


Hmm looks like there is still some noise in the tags, let's make sure it's just books:

In [None]:
# trim our lists to be 100 length


# print(titles)

In [None]:
# let's modify the author_year variable to create two sub-lists
year = 
author =

# let's print out our lists
print(year)
print(author)

Let's create our dataframe!

In [None]:

df_books = ...

# 3. One more example: `Quotes to Scrape`
<a id='section3'></a>

In [None]:
# URL of the website
url = 'http://quotes.toscrape.com/'

In [None]:
# check that can scrape the website


# Parse the page content with BeautifulSoup


In [None]:
# print out the first quote
quotes[0]

In [None]:
# we can iterate using tags directly inside the quotes variable


In [None]:
# let's extract the author


In [None]:
# let's extract the tags


# let's convert it to a string


# let's change the newline with comma-separated


# in one line


In [None]:
# let's iterate over every quote inside quotes

# data to be stored in a list
data = []

for quote in quotes:
    
    ...
    
# outside of the loop, let's make our dataframe by combining rows
df_quotes = pd.DataFrame(data)

df_quotes