In [1]:
library(tidyverse)
library(rvest)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.1     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘rvest’


The following object is masked from ‘package:readr’:

    guess_encoding




# Lecture 11: Web scraping

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand how to import data from online sources by scraping web pages.
</div>

These notes correspond to Chapter 26 of your book.


## Ethics of scraping data online
You should carefully read [Section 26.2](https://r4ds.hadley.nz/webscraping.html#scraping-ethics-and-legalities) of the book concerning various ethical and legal issues surrounding scraping web sites for data. In this class we will only look at large, public web sites like Wikipedia and IMDB, where there is no risk of anything bad happening. However, there are other situations where it may be unethical, or even illegal, to harvest data from a website, even if you are technically able. **As data scientists in the real world, it will be up to you to carefully weigh these concerns before using the tools discussed in today's lecture.**

## Reading data from the Internet
These days, it's increasingly common to pull data from online sources. For example, say I wanted to know the population of European countries. This is [easily found](https://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country) on Wikipedia. How can I get these data into R and analyze them?

## How do web pages work?

Web pages are written in a special language called HTML (**H**yper**t**ext **M**arkup **L**anguage). Here is a simple example of some HTML:

    <html>
    <head> 
      <title>Page title</title>
    </head>
    <body>
      <h1 id='first'>A heading</h1>
      <p>Some text &amp; <b>some bold text.</b></p>
      <img src='myimg.png' width='100' height='100'>
    </body>

Web scraping is possible because most web pages have a consistent, hierarchical structure. For example, if I asked you how to navigate to the title of the web page shown above, you would follow the "path"

    html > head > title
    
to arrive at "Page title".

## HTML elements

There are a lot of HTML elements that might contain interesting information. Here are a few of the most common:
- Block tags that denote sections of text: `<h1>` (heading), `<p>` (paragraph), `<ul>`/`<ol>` (un)ordered list, etc.
- `<table>` (a table), `<tr>` (a table row), `<td>` (a table cell), etc.
- Each of these elements can contain attributes such as `id=` or `class=`. For example, `<table id="movies">` is probably a table that contains movie information.

The `rvest` package is used to load a web page and extract elements and tables based on their HTML tags. Let's see how it works by scraping the Wikipedia page mentioned earlier:

In [4]:
europop <- read_html("http://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country")

In this page there are many tables:

In [18]:
europop %>% html_elements("table.wikitable") %>% html_table
wiki_table((3))%>%select (1:2)%>%slice(-1)
mutate(Year=as.integer(year)) avg_pop=parse_number(Averagepopulation)

ERROR: ignored

How can we find the correct one? One option is to use our browser to find something that uniquely identifies the table that we want. Alternatively, since there are only about 17, we can just at each table until we find the one we want:

In [None]:
# find the table that contains the population for each country

## 🤔 Quiz

What's the average population density ($\text{persons}/\text{km}^2$) for countries in Europe?

<ol style="list-style-type: upper-alpha;">
    <li>1234.5</li>
    <li>20000.0</li>
    <li>611.8</li>
    <li>6520.5</li>
    <li>101.1</li>
</ol>



In [23]:
# avg pop density
europop %>% html_elements("table.wikitable") %>% html_table
wiki_table((3))%>% select (dens=4)%>%mutate(dens=parse_number(dens))%>%summarise(mean(dens))


Year,Population(% of world total)
<chr>,<chr>
CE 1,34 (15%)
1000,40 (15%)
1500,78 (18%)
1600,112 (20%)
1700,127 (21%)
1820,224 (21%)
1913,498 (28%)
2000,742 (13%)

Country/region,1,1000,1500,1600,1700,1820,1870,1913,1950,1973,1998[6],2020
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
Austria,500.0,700.0,2000.0,2500.0,2500.0,3369.0,4520.0,6767.0,6935.0,7586.0,8078.0,8901.0
Belgium,300.0,400.0,1400.0,1600.0,2000.0,3424.0,5096.0,7666.0,8640.0,9738.0,10197.0,11493.0
Denmark,180.0,360.0,600.0,650.0,700.0,1155.0,1888.0,2983.0,4269.0,5022.0,5303.0,5823.0
Finland,20.0,40.0,300.0,400.0,400.0,1169.0,1754.0,3027.0,4009.0,4666.0,5153.0,5536.0
France,5000.0,6500.0,15000.0,18500.0,21471.0,31246.0,38440.0,41463.0,41836.0,52118.0,58805.0,67287.0
Germany,3000.0,3500.0,12000.0,16000.0,15000.0,24905.0,39231.0,65058.0,68371.0,78956.0,82029.0,83191.0
Italy,7000.0,5000.0,10500.0,13100.0,13300.0,20176.0,27888.0,37248.0,47105.0,54751.0,57592.0,59258.0
the Netherlands,200.0,300.0,950.0,1500.0,1900.0,2355.0,3615.0,6164.0,10114.0,13438.0,15700.0,17425.0
Norway,100.0,200.0,300.0,400.0,500.0,970.0,1735.0,2447.0,3265.0,3961.0,4432.0,5368.0
Sweden,200.0,400.0,550.0,760.0,1260.0,2585.0,4164.0,5621.0,7015.0,8137.0,8851.0,10379.0

Country/region,1,1000,1500,1600,1700,1820,1870,1913,1950,1973,1998,2018
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
Austria,0.2,0.3,0.5,0.4,0.4,0.3,0.4,0.4,0.3,0.2,0.1,
Belgium,0.1,0.1,0.3,0.3,0.3,0.3,0.4,0.4,0.3,0.2,0.2,
Denmark,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.2,0.2,0.1,0.1,
Finland,0.0,0.0,0.1,0.1,0.1,0.1,0.1,0.2,0.2,0.1,0.1,
France,2.2,2.4,3.4,3.3,3.6,3.0,3.0,2.3,1.7,1.3,1.0,
Germany,1.3,1.3,2.7,2.9,2.5,2.4,3.1,3.6,2.7,2.0,1.4,
Italy,3.0,1.9,2.4,2.4,2.2,1.9,2.2,2.1,1.9,1.4,1.0,
Netherlands,0.1,0.1,0.2,0.3,0.3,0.2,0.3,0.3,0.4,0.3,0.3,
Norway,0.0,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,
Sweden,0.1,0.1,0.1,0.1,0.2,0.2,0.3,0.3,0.3,0.2,0.1,

Year,Averagepopulation,Live births,Deaths,Natural change,Crude rates (per 1000),Crude rates (per 1000),Crude rates (per 1000),Total fertility rate,Life expectancy
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>.1,<chr>.2,<chr>,<chr>
Year,Averagepopulation,Live births,Deaths,Natural change,Births,Deaths,Natural change,Total fertility rate,Life expectancy
1950,549721718,12202220,6473233,5728987,22.2,11.8,10.4,2.70,62.8
1951,554559502,12112425,6609794,5502631,21.8,11.9,9.9,2.66,62.8
1952,559609904,12142368,6265135,5877233,21.7,11.2,10.5,2.66,64.0
1953,565058633,12120826,6220937,5899889,21.5,11.0,10.4,2.64,64.7
1954,570670994,12151779,6072645,6079134,21.3,10.6,10.7,2.64,65.5
1955,576304974,12134270,5987151,6147119,21.1,10.4,10.7,2.63,66.0
1956,581975516,12133583,5899594,6233989,20.8,10.1,10.7,2.62,66.9
1957,587711635,12194100,5963269,6230831,20.7,10.1,10.6,2.62,66.9
1958,593669297,12177600,5647571,6530029,20.5,9.5,11.0,2.60,68.2

Country (or territory),Population[1][2],Area.mw-parser-output .nobold{font-weight:normal}(km2)[14],Density(per km2),Capital
<chr>,<chr>,<chr>,<chr>,<chr>
Albania *,2854710,28748.0,99.0,Tirana
Andorra *,79034,468.0,169.0,Andorra la Vella
Armenia *,2790974,29743.0,94.0,Yerevan
Austria *,8922082,83871.0,106.0,Vienna
Azerbaijan *,10312992,86600.0,119.0,Baku
Belarus *,9578167,207600.0,46.0,Minsk
Belgium *,11611419,30528.0,380.0,Brussels
Bosnia and Herzegovina *,3270943,51209.0,64.0,Sarajevo
Bulgaria *,6520314,110900.0,59.0,Sofia
Croatia *,4060135,56594.0,72.0,Zagreb


ERROR: ignored

## 🤔 Quiz

Use the same page Wikipedia page (Demographics of Europe) to answer the following question:

On average, how many people were born *each day* in Europe between 2010 and 2021 (inclusive)?

<ol style="list-style-type: upper-alpha;">
    <li>90210.10</li>
    <li>23043.97</li>
    <li>7710127</li>
    <li>21123.64</li>
    <li>21109.18</li>
</ol>



In [24]:
# average births per day

wiki_table((3))%>% select(c(1:3))%>% slice(-1) %>% mutate_all(parse_number)

Year,Population(% of world total)
<chr>,<chr>
CE 1,34 (15%)
1000,40 (15%)
1500,78 (18%)
1600,112 (20%)
1700,127 (21%)
1820,224 (21%)
1913,498 (28%)
2000,742 (13%)

Country/region,1,1000,1500,1600,1700,1820,1870,1913,1950,1973,1998[6],2020
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
Austria,500.0,700.0,2000.0,2500.0,2500.0,3369.0,4520.0,6767.0,6935.0,7586.0,8078.0,8901.0
Belgium,300.0,400.0,1400.0,1600.0,2000.0,3424.0,5096.0,7666.0,8640.0,9738.0,10197.0,11493.0
Denmark,180.0,360.0,600.0,650.0,700.0,1155.0,1888.0,2983.0,4269.0,5022.0,5303.0,5823.0
Finland,20.0,40.0,300.0,400.0,400.0,1169.0,1754.0,3027.0,4009.0,4666.0,5153.0,5536.0
France,5000.0,6500.0,15000.0,18500.0,21471.0,31246.0,38440.0,41463.0,41836.0,52118.0,58805.0,67287.0
Germany,3000.0,3500.0,12000.0,16000.0,15000.0,24905.0,39231.0,65058.0,68371.0,78956.0,82029.0,83191.0
Italy,7000.0,5000.0,10500.0,13100.0,13300.0,20176.0,27888.0,37248.0,47105.0,54751.0,57592.0,59258.0
the Netherlands,200.0,300.0,950.0,1500.0,1900.0,2355.0,3615.0,6164.0,10114.0,13438.0,15700.0,17425.0
Norway,100.0,200.0,300.0,400.0,500.0,970.0,1735.0,2447.0,3265.0,3961.0,4432.0,5368.0
Sweden,200.0,400.0,550.0,760.0,1260.0,2585.0,4164.0,5621.0,7015.0,8137.0,8851.0,10379.0

Country/region,1,1000,1500,1600,1700,1820,1870,1913,1950,1973,1998,2018
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
Austria,0.2,0.3,0.5,0.4,0.4,0.3,0.4,0.4,0.3,0.2,0.1,
Belgium,0.1,0.1,0.3,0.3,0.3,0.3,0.4,0.4,0.3,0.2,0.2,
Denmark,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.2,0.2,0.1,0.1,
Finland,0.0,0.0,0.1,0.1,0.1,0.1,0.1,0.2,0.2,0.1,0.1,
France,2.2,2.4,3.4,3.3,3.6,3.0,3.0,2.3,1.7,1.3,1.0,
Germany,1.3,1.3,2.7,2.9,2.5,2.4,3.1,3.6,2.7,2.0,1.4,
Italy,3.0,1.9,2.4,2.4,2.2,1.9,2.2,2.1,1.9,1.4,1.0,
Netherlands,0.1,0.1,0.2,0.3,0.3,0.2,0.3,0.3,0.4,0.3,0.3,
Norway,0.0,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,
Sweden,0.1,0.1,0.1,0.1,0.2,0.2,0.3,0.3,0.3,0.2,0.1,

Year,Averagepopulation,Live births,Deaths,Natural change,Crude rates (per 1000),Crude rates (per 1000),Crude rates (per 1000),Total fertility rate,Life expectancy
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>.1,<chr>.2,<chr>,<chr>
Year,Averagepopulation,Live births,Deaths,Natural change,Births,Deaths,Natural change,Total fertility rate,Life expectancy
1950,549721718,12202220,6473233,5728987,22.2,11.8,10.4,2.70,62.8
1951,554559502,12112425,6609794,5502631,21.8,11.9,9.9,2.66,62.8
1952,559609904,12142368,6265135,5877233,21.7,11.2,10.5,2.66,64.0
1953,565058633,12120826,6220937,5899889,21.5,11.0,10.4,2.64,64.7
1954,570670994,12151779,6072645,6079134,21.3,10.6,10.7,2.64,65.5
1955,576304974,12134270,5987151,6147119,21.1,10.4,10.7,2.63,66.0
1956,581975516,12133583,5899594,6233989,20.8,10.1,10.7,2.62,66.9
1957,587711635,12194100,5963269,6230831,20.7,10.1,10.6,2.62,66.9
1958,593669297,12177600,5647571,6530029,20.5,9.5,11.0,2.60,68.2

Country (or territory),Population[1][2],Area.mw-parser-output .nobold{font-weight:normal}(km2)[14],Density(per km2),Capital
<chr>,<chr>,<chr>,<chr>,<chr>
Albania *,2854710,28748.0,99.0,Tirana
Andorra *,79034,468.0,169.0,Andorra la Vella
Armenia *,2790974,29743.0,94.0,Yerevan
Austria *,8922082,83871.0,106.0,Vienna
Azerbaijan *,10312992,86600.0,119.0,Baku
Belarus *,9578167,207600.0,46.0,Minsk
Belgium *,11611419,30528.0,380.0,Brussels
Bosnia and Herzegovina *,3270943,51209.0,64.0,Sarajevo
Bulgaria *,6520314,110900.0,59.0,Sofia
Croatia *,4060135,56594.0,72.0,Zagreb


ERROR: ignored

In [None]:
# number of days in 2010--2021

## The Simpsons

The Simpsons is a popular and long-running TV show. How many people still watch the Simpsons? What is their most popular episode?

In [25]:
simpsons <- read_html('https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes_(season_21–present)')

In [28]:
# parse simpsons
simpsons %>% html_elements("table.wikiepisodetable")%>% .((1)) %>% html_table

ERROR: ignored

## 🤔 Quiz

The episode with the largest number of viewers was **Once Upon a Time in Springfield**. Which episode of the Simpsons had the **smallest** number of viewers?


<ol style="list-style-type: upper-alpha;">
    <li>My Octopus and a Teacher</li>
    <li>Treehouse of Horror XXI</li>
    <li>Marge the Meanie</li>
    <li>The D'oh-cial Network</li>
    <li>The Devil Wears Nada</li>
</ol>



In [None]:
# smallest number of viewers

## IMDB top movies

Let's consider a well-known table: the [top 250 movies on IMDB](https://www.imdb.com/chart/top/).

In [None]:
imdb.250 <- read_html("https://www.imdb.com/chart/top/")

In [29]:
# parse imdb
imdbf %>% arrange(year)%>%mutate(top_rating), king = rating == top_rating)%>%
  fliter(king)%>%mutate(delta=lead(yaer)-year)


ERROR: ignored

## Exercise

"The Kid" came out in 1921 and has a rating of 8.2. Another movie that was rated at least as high didn't come out until 1927 (Metropolis), so we could say that The Kid reigned as the #1 film for six years. Metropolis reigned for four years until City Lights (rating 8.4) came out.

Which film reigned for the longest amount of time?

In [None]:
# longest reign

## Super Bowl TV ratings
We just had the Super Bowl. How have the TV ratings for the Super Bowl changed over the years?

In [None]:
sbtv <- read_html('https://en.wikipedia.org/wiki/Super_Bowl_television_ratings') %>% html_elements('table') %>% .[[1]] %>% html_table

In [None]:
# viewers over time

How does this compare with other major sports?

- https://en.wikipedia.org/wiki/World_Series_television_ratings
- https://en.wikipedia.org/wiki/NBA_Finals_television_ratings

In [None]:
# super bowl vs world series

## Scraping other types of web data

Here are some examples of other types of web data we can scrape:

### The UofM Stats department
Let's say I wanted to make a table of all the [undergraduate stats courses](https://lsa.umich.edu/stats/undergraduate-students/statistics-courses.html) offered by the department. 

In [None]:
stats <- read_html('https://lsa.umich.edu/stats/undergraduate-students/statistics-courses.html')

How should we extract the data from this web page? We notice from inspecting the page that each course title is a `<b>` (bold) element:

In [None]:
# extract statistics courses

### Reddit
Let's see how to scrape the [UofM Reddit site](https://old.reddit.com/r/uofm):

In [None]:
top.reddit <- read_html('https://old.reddit.com/r/uofm/top/?sort=top&t=all')

Let's plot the top scoring posts, when they were posted, and how many votes they have received.

In [None]:
# top posts on r/uofm