# BLU03 - Learning Notebook - Part 3 of 3 - Web scraping


## 1. Introduction

In the context of data wrangling, we've already talked about three data sources: files, databases and public APIs.
Now it's time to delve into the Web!

As we all know, there is a huge amount of data in the Web. Whenever we search something on Google, it shows us thousands of web pages full of answers.

However, there is a problem here: in most of the cases, the web pages show us the data in a beautiful but unstructured way. This makes sense, since the purpose of a web page is to be read by a human and not to have its content analysed by some computer program.

So we are left with the boring task of copying and pasting the data we want into csv files or excel tables, possibly thousands of times, before feeding it to some data model...

But worry no more!

<img src="media/web_scraping_to_the_rescue.png" width=350/>

## 2. What is web scraping

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) is the name given to the process of extracting data from web pages in an automated way.
There are many [techniques](https://en.wikipedia.org/wiki/Web_scraping#Techniques) that can be used to do web scraping and the one we're going to explore here is HTML parsing.

A web page is an HTML document, so HTML parsing means to split the contents of a web page into several small pieces and select the parts we find interesting. This technique is useful when we want to extract data from many web pages that share a common template.

## 3. Understanding the HTML code of a web page

Before jumping to the part where we actually do web scraping, let's first understand the structure and code of a web page.

Usually, a web page has 3 different types of code:
* **HTML**: used to display the content of the web page
* **CSS**: used to apply styles to the web page, it's what makes the page pretty
* **JavaScript**: this is what makes the page dynamic, like triggering an action when a button is clicked.

We'll focus now on the HTML part, since it's the one that is related to what we want, which is data.

In the file **../web_pages/nationalmuseum.html** you can see an example of an HTML document that represents a web page. Let's see the code.

In [1]:
# use ! type for Windows (use full path)
! cat web_pages/nationalmuseum.html

<!DOCTYPE html>
<html>
  <body>
    <h1>Webpage about the Nationalmuseum</h1>
    <h3>It's in Sweden.</h3>
    <p>For more informations:</p>
    <br>
    <p>Check wikipedia!</p>
  </body>
</html>

And this is how the page looks in a browser.

![title](media/nationalmuseum_page_2.png)

As you can see above, an HTML page is a collection of HTML elements, where an element has the form:
```<tagname> content </tagname>```.

HTML elements can be nested with other HTML elements, meaning that the content between the start and end tags can be a set of elements.

An HTML element can also have no content. In that case, it's simply a tagname, like this:
```<tagname>```.

Let's go through the elements in this page:
- the ```<!DOCTYPE html>``` says that this document is an HTML document
- the ```<html>``` element is the root element of an HTML page
- the ```<body>``` element has the page content
- the ```<h1>``` element is a large heading
- the ```<h3>``` element is a smaller heading
- the ```<p>``` element is a paragraph
- the ```<br>``` element is a line break, which is an example of an element without content

## 4. How to scrape the web

Now let's go to the fun part!

Going back to our movies database, you can see that there are some characters for which we're missing the character_name.
You can try to query the database to find which are these characters, but in the meanwhile, we gathered them in file **../data/missing_character_names.csv**.

In [1]:
import pandas as pd
import requests
# Import some helper functions to print shorter outputs
import utils

from bs4 import BeautifulSoup

In [2]:
missing_character_names = pd.read_csv('data/missing_character_names.csv')
missing_character_names.head()

Unnamed: 0,id,movie_id,imdb_id,actor_id,name,character_name
0,1073,718,tt0116405,82957,Dan Aykroyd,
1,1218,17579,tt0120240,105261,Bonnie Hunt,
2,1219,17579,tt0120240,79974,N'Bushe Wright,
3,1220,17579,tt0120240,55658,Michael Rapaport,
4,1221,17579,tt0120240,57737,Denis Leary,


Can you think of a good way to get this missing data? An internet movie database seems like a very good candidate! Fortunately, the LDSA has got you covered.

As an exercise, let's try to find Dan Aykroyd's character name in the movie with ID `tt0116405`. A quick internet search reveals that this movie is called **Getting Away With Murder**.

The first thing to do is to open the web page that has the content we're interested in: **https://s02-infrastructure.s3.eu-west-1.amazonaws.com/ldsa_imdb/index.html#**

It should look like this:

<img src="media/imdb_movie_page.png"/>

Now, let's scroll down to the cast section of the page, since this is what we'll be scraping.

<img src="media/imdb_cast.png"/>

In order to get the page's content, we'll use a GET request.

We can get the content from the response, which will be... a bunch of incomprehensible HTML.

In [3]:
response = requests.get("https://s02-infrastructure.s3.eu-west-1.amazonaws.com/ldsa_imdb/index.html#")


# Printing short output, if you want to see everything, delete the friendly_print function call
utils.friendly_print_string(response.content)

b'<!DOCTYPE html><html lang="en">\r\n<head>\r\n<title>LDSA-IMDB</title>\r\n<meta charset="utf-8">\r\n<link rel="shortcut icon" type="image/x-icon" href="css/images/favicon.ico">\r\n<link rel="stylesheet" href="css/style.css" type="text/css" media="all">\r\n<link rel="stylesheet" href="css/colorbox.css" type="text/css" media="all">\r\n</head>\r\n<body>\r\n<!-- wrapper -->\r\n<div id="wrapper">\r\n  <div class="light-bg">\r\n    <!-- shell -->\r\n    <div class="shell">\r\n      <!-- header -->\r\n      <div class="header">\r\n   '


And here is where **Beautiful Soup** can help us. Beautiful soup is a package for parsing HTML documents. It allows us to break down HTML documents into smaller components, and extract the information we need. You can check its documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

First, we need to create an instance of the BeautifulSoup class, passing it the HTML document to parse.

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

By calling the **prettify** method, we can see the HTML elements of the document in a pretty and indented way.

In [5]:
# Printing short output, if you want to see everything, delete the friendly_print function call
utils.friendly_print_string(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   LDSA-IMDB
  </title>
  <meta charset="utf-8"/>
  <link href="css/images/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="css/style.css" media="all" rel="stylesheet" type="text/css"/>
  <link href="css/colorbox.css" media="all" rel="stylesheet" type="text/css"/>
 </head>
 <body>
  <!-- wrapper -->
  <div id="wrapper">
   <div class="light-bg">
    <!-- shell -->
    <div class="shell">
     <!-- header -->
     <div class="


By calling the **children** property of the soup, we can parse it into smaller elements.

We can see that this soup has two top-level elements:

* a Doctype element, with the value 'html'.
* a Tag element, with tag html.

As we've seen before, the Doctype element simply indicates that our soup corresponds to an html document (a webpage).

We're particularly interested in the `html` Tag element, which is where the actual HTML content is.

In [6]:
soup_children = list(soup.children)

# inspecting the types of the elements in the soup
[type(item) for item in soup_children]

[bs4.element.Doctype, bs4.element.Tag]

To get the `html` tag element from the soup, we can just call it by its name.

In [7]:
soup.html

<html lang="en">
<head>
<title>LDSA-IMDB</title>
<meta charset="utf-8"/>
<link href="css/images/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="css/style.css" media="all" rel="stylesheet" type="text/css"/>
<link href="css/colorbox.css" media="all" rel="stylesheet" type="text/css"/>
</head>
<body>
<!-- wrapper -->
<div id="wrapper">
<div class="light-bg">
<!-- shell -->
<div class="shell">
<!-- header -->
<div class="header">
<!-- socials -->
<div class="socials"> <a class="facebook-ico" href="#">facebook-ico</a> <a class="twitter-ico" href="#">twitter-ico</a> <a class="you-tube-ico" href="#">you-tube-ico</a> </div>
<!-- end of socials -->
<h1 id="logo"><a href="#">LDSA-IMDB</a></h1>
<!-- navigation -->
<nav id="navigation">
<ul>
<li><a href="#">SHOW ALL</a></li>
<li><a href="#">LATEST MOVIES <span>20</span></a></li>
<li><a href="#">TOP RATED</a></li>
<li><a href="#">MOST COMMENTED</a></li>
</ul>
</nav>
<!-- end of navigation -->
<div class="cl"> </div>
</div>
<!-- en

We can navigate through the tags contained inside the `html` tag, to get to any element in the page.

Let's check out the title of the page. This is contained in the `title` tag.

We can find it two levels below the `html` tag, inside the `head` tag: 

In [8]:
soup.html.head.title

<title>LDSA-IMDB</title>

We can see that this tag has no children tags. Its content is simply a string, which we can get by calling the **get_text** method:

In [9]:
soup.html.head.title.get_text()

'LDSA-IMDB'

By now, you must be thinking that this is a somewhat complicated process, as it requires manually inspecting the HTML document and navigating through thousands of tags in order to find the interesting content in the middle of a big mess. And you're right!

However, there is an easier way to access the interesting content directly.

First, you need to open the **developer tools** of your browser, in the page you want to scrape.
These are tools that allow you to look in greater detail at the content of the website and at the processes running in the background.

Usually, you just have to right-click the page and select the "Inspect" option. 
If that's not the case, just Google "How to open developer tools in *\<your browser\>*".

![title](media/dt_open.png)

The developer tools will open at the bottom or on the side of the window. We're only interested in the **Inspector** tool, which allows us to look at the HTML elements that correspond to the different parts of the page.

After clicking on the small arrow (circled in red), you can click on any object in the page with your mouse, and you'll see the correspondent HTML element highlighted in the developer tools window. Similarly, if you hover over the HTML code in the Inspector window, the corresponding part of the page will be highlighted.

![title](media/dt_actordiv.png)

By inspection, we can see that all the information about the actors/actresses is inside an element with tag **div** and **class** `actor-list`. The class of an HTML element can be useful to identify what its content might be.

![title](media/dt_actorlist.png)

We can inspect even further and notice that the `actor-list` div has three children. The children have two classes - `actor-info`and `grid-container` - which seem to indicate that each children element contains information for a single actor.

Drilling down a bit more, we notice that the `actor-info` div contains two children, with `div` tags and classes `actor-portrait`/`actor-data`.

Finally inside `actor-data`, we can find two children with `p` tags. These elements don't have a class, but have an **attribute** with name `infotype` and value `actor-name`/`character-name`. Attributes can also be used to identify the content of an element.

![title](media/dt_actor.png)

So we have arrived at the character names, which is exactly what we set out to discover!

Let try to replicate this process using our _beautiful soup_. 

First, call the soup's **find_all** method to find the div element with class `actor-list`(and make sure there's only one in this page).

In [10]:
# pay attention to the underscore after class (class_) in the function's parameters.
# this is because "class" is a Python keyword.
actor_list = soup.find_all('div', class_="actor-list")
print("Number of elements found: ", len(actor_list))

Number of elements found:  1


Cool! Now let's search for all children that have tag `div`, and the `actor-info` class:

In [11]:
actor_info = actor_list[0].find_all('div', class_='actor-info')

# Checking out how many were found:
print(f"Found {len(actor_info)} actors.\n")

# Checking out one of them
print(actor_info[0].prettify())

Found 3 actors.

<div class="actor-info grid-container">
 <div class="actor-portrait">
  <img src="css/images/dan.jpg" width="100%"/>
 </div>
 <div class="actor-data">
  <p infotype="actor-name">
   Dan Aykroyd
  </p>
  <p infotype="character-name">
   Jack Lambert
  </p>
 </div>
</div>



Looks correct!

Now, let's focus on the first actor - Dan Aykroyd - and discover the name of its character.

For that, we simply have to look for the `<p>` children elements with attribute **infotype**=**character-name**. Since we're looking for a single children element, we can use the `find` method:

Since searching for children is recursive (meaning: it searches for immediate children, then children-of-children, and so on), we don't need to find the `actor-data` div first. 

In [12]:
character_name = actor_info[0].find('p', infotype='character-name')

character_name

<p infotype="character-name">Jack Lambert</p>

We can extract the text using the `get_text` method:

In [13]:
character_name.get_text()

'Jack Lambert'

We have found Dan Aykroyd's character in **Getting Away With Murder**, which is Jack Lambert!

And the best part is that it will only take some minutes to get all the other character and actor names. You're invited to do that as an exercise.

## 5. Tables

An interesting and useful application of web scraping with the `BeautifulSoup` package is obtaining data from tables on the web.

In this example, we wish to get a list of companies in the SP500 from Wikipedia.  This table exists at: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies

We first begin by making our request, and then initializing a `soup` object

In [4]:
request = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(request.text, "html.parser")

The tag of interest will be the `table` tag.  If we use the `.find_all()` method, we can see that there are two tables on this page.

In [5]:
print(len(soup.find_all("table")))

2


Lets examine the content of the first table

In [8]:
table = soup.find_all("table")[0]
print(table)

<table class="wikitable sortable" id="constituents">
<tbody><tr>
<th><a href="/wiki/Ticker_symbol" title="Ticker symbol">Symbol</a>
</th>
<th>Security</th>
<th><a href="/wiki/SEC_filing" title="SEC filing">SEC filings</a></th>
<th><a href="/wiki/Global_Industry_Classification_Standard" title="Global Industry Classification Standard">GICS</a> Sector</th>
<th>GICS Sub-Industry</th>
<th>Headquarters Location</th>
<th>Date first added</th>
<th><a href="/wiki/Central_Index_Key" title="Central Index Key">CIK</a></th>
<th>Founded
</th></tr>
<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:MMM" rel="nofollow">MMM</a>
</td>
<td><a href="/wiki/3M" title="3M">3M</a></td>
<td><a class="external text" href="https://www.sec.gov/edgar/browse/?CIK=66740" rel="nofollow">reports</a></td>
<td>Industrials</td>
<td>Industrial Conglomerates</td>
<td><a href="/wiki/Saint_Paul,_Minnesota" title="Saint Paul, Minnesota">Saint Paul, Minnesota</a></td>
<td>1976-08-09</td>
<td>0000066740</td

This shows some new tags.  The ones of interest are:
- `th` : Table Header
- `tr` : Table Row
- `td` : Table Data

The Table Headers will be the columns of the table, while the data tags will be contained within the row tags.

To get the columns, we just loop through the `th` tags (the titles are stored in the text attribute).  We also note that some of these have a newline in the title, so we will strip that for now.

In [10]:
columns = []
for header in table.find_all("th"):
    columns.append(header.text.strip("\n"))
print(columns)

['Symbol', 'Security', 'SEC filings', 'GICS Sector', 'GICS Sub-Industry', 'Headquarters Location', 'Date first added', 'CIK', 'Founded']


Now we can get the data for the rows by looping through each `td` in a given `tr` (the first `tr` is the header column, so we want to start at the index 1.

In [12]:
row = table.find_all("tr")[1]
row_data = []
for data in row.find_all("td"):
    row_data.append(data.text.strip("\n"))
print(row_data)

['MMM', '3M', 'reports', 'Industrials', 'Industrial Conglomerates', 'Saint Paul, Minnesota', '1976-08-09', '0000066740', '1902']


We see that this matches the first row on the wikipedia page.  We leave it as a followup to figure out how to set up a pandas DataFrame using these table tags to match the wikipedia page.

### Aside: pd.read_html()

In the last section, we scraped a data table row by row.  This requires multiple loops and nested loops.  In certain cases, we can let pandas do the work for us!

Using the read_html, function, we can replace our nested loops with one line of code  Remember there were two tables in the original html, so lets just look at the first one (note this requires lxml is installed).

In [3]:
print( pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")[0].head() )

  Symbol     Security SEC filings  GICS Sector         GICS Sub-Industry  \
0    MMM           3M     reports  Industrials  Industrial Conglomerates   
1    AOS  A. O. Smith     reports  Industrials         Building Products   
2    ABT       Abbott     reports  Health Care     Health Care Equipment   
3   ABBV       AbbVie     reports  Health Care           Pharmaceuticals   
4   ABMD      Abiomed     reports  Health Care     Health Care Equipment   

     Headquarters Location Date first added      CIK      Founded  
0    Saint Paul, Minnesota       1976-08-09    66740         1902  
1     Milwaukee, Wisconsin       2017-07-26    91142         1916  
2  North Chicago, Illinois       1964-03-31     1800         1888  
3  North Chicago, Illinois       2012-12-31  1551152  2013 (1888)  
4   Danvers, Massachusetts       2018-05-31   815094         1981  


This was super quick and easy!  This is not guarenteed to work depending on the html, but it never hurts to try :)

## 6. Optional

### 6.1 Scraping and the Law

[This](https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/) is an interesting article about the subject, bottom line being: when scraping web pages, don't use a very high request rate, so that the owners of the website don't get angry.

### 6.2 Scraping and JavaScript

Sometimes, when scraping web pages, you'll need to navigate from one page to the other, click buttons, or take other actions that enter the JavaScript domain. In such cases, Beautiful Soup is not enough to fill your needs. If you find yourself in this position, take a look at [Selenium](https://www.selenium.dev/).

### 6.3 Website changes

One of the biggest difficulties regarding scraping is that if there are changes to the layout of the website you're trying to scrape, you will inevitably need to rewrite part (or all) of your scraping code. This is why, for learning purposes, we are scraping a website hosted by the LDSA. If you are feeling brave, try scraping the same information from the official IMDB movie page!