# Web Scraping

## Hierarchical Data

!![a tree of TV shows](data/tree.png)

- Hierarchical data can be represented using JSON or XML.
- JSON is just like a Python dictionary.
    - You can use basic Python to extract the information you want.
    - There are built-in functions like `pd.json_normalize` to "flatten" JSON to tabular data.
- XML is a different besat.

## XML

- Fields are represented by named *tags*.
- Each tag has an open `<tag>` and a close `</tag>`.
- Children are represented by nested tags.
- Repeated fields ard represented by repeated tags.

## HyperText Markup Language (HTML)

- HTML is the standard language for describing the layout of webpages.
- It is like XML, with special tags for hyperlinks, tables, images, etc.
- You don't need to be an HTML expert to scrape webpages, but you do need to know a few basics.

## Hyperlinks

The `<a>` tag indicates a (hyper)link.

- The `href=` attribute contains the URL.
- The displayed text is within the `<a>` tag

In [None]:
Example: 
Web Scraping<br/>
    <a href="lectures/lecture18.pdf">
         slides
    </a> |
    <a href="https://colab.research.google.com/drive/1neQvH5uqoX1j74rgCbperbi-HV3uLd8N?usp=sharing">
        colab
    </a>

## Tables

The `<table>` tag indicates a table.

- The `<tr>` tag indicates a row.
- The `<th>` and `<td>` tags indicate a cell within a row.

In [None]:
<table>
    <tr>
        <th>Rank</th>
        <th>Player</th>
        <th>Saves</th>
    </tr>
    <tr>
        <td>1</td>
        <td>Mariano Rivera</td>
        <td>652</td>
    </tr>
    <tr>
        <td>2</td>
        <td>Trevor Hoffman</td>
        <td>601</td>
    </tr>
</table>

Result:

|Rank|Player       |Saves|
|---|--------------|-----|
|1  |Mariano Rivera|652  |
|2  |Trevor Hoffman|601  |

## Web Scraping

We will scrape the hockey statistics from this website: https://www.scrapethissite.com/pages/forms/

In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://www.scrapethissite.com"
response = requests.get(url + "/pages/forms")
soup = BeautifulSoup(response.text, "html.parser")

Now let's find the main table on this page.

In [4]:
tables = soup.find_all("table")

In [5]:
tables[0]

<table class="table">
<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                            2

In [6]:
table = soup.find("table", attrs={"class": "table"})
rows = table.find_all("tr", attrs={"class": "team"})

In [7]:
rows[0]

<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                            24
                        </td>
<td class="ot-losses">
</td>
<td class="pct text-success">
                            0.55
                        </td>
<td class="gf">
                            299
                        </td>
<td class="ga">
                            264
                        </td>
<td class="diff text-success">
                            35
                        </td>
</tr>

Let's extract the info from the cells of this table.

In [8]:
data_hockey = []
for row in rows:
    # get info from row
    row_info = {}
    cells = row.find_all("td")
    for cell in cells:
        field = cell.attrs["class"][0]
        row_info[field] = cell.string.strip()
    data_hockey.append(row_info)

We can convert this information into a `DataFrame`.

In [10]:
import pandas as pd
pd.DataFrame(data_hockey)

Unnamed: 0,name,year,wins,losses,ot-losses,pct,gf,ga,diff
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
5,Edmonton Oilers,1990,37,37,,0.463,272,272,0
6,Hartford Whalers,1990,31,38,,0.388,238,276,-38
7,Los Angeles Kings,1990,46,24,,0.575,340,254,86
8,Minnesota North Stars,1990,27,39,,0.338,256,266,-10
9,Montreal Canadiens,1990,39,30,,0.487,273,249,24


But this only represents the first page of data. There are many pages of data. How do we scrape all of the data?

We can switch to different pages by modifying the `page_num` parameter in the URL.

Alternatively, we can just grab the links at the bottom of the page.

In [11]:
pagination = soup.find("ul", attrs={"class": "pagination"})
links = pagination.find_all("a")

Let's take a look at the links found.

In [12]:
for link in links:
    print(link.attrs["href"])

/pages/forms/?page_num=1
/pages/forms/?page_num=2
/pages/forms/?page_num=3
/pages/forms/?page_num=4
/pages/forms/?page_num=5
/pages/forms/?page_num=6
/pages/forms/?page_num=7
/pages/forms/?page_num=8
/pages/forms/?page_num=9
/pages/forms/?page_num=10
/pages/forms/?page_num=11
/pages/forms/?page_num=12
/pages/forms/?page_num=13
/pages/forms/?page_num=14
/pages/forms/?page_num=15
/pages/forms/?page_num=16
/pages/forms/?page_num=17
/pages/forms/?page_num=18
/pages/forms/?page_num=19
/pages/forms/?page_num=20
/pages/forms/?page_num=21
/pages/forms/?page_num=22
/pages/forms/?page_num=23
/pages/forms/?page_num=24
/pages/forms/?page_num=1


In [13]:
import time

data_hockey = []
for link in links:
    # skip previous and next buttons
    if "aria-label" in link.attrs:
        continue

    # visit each link
    response = requests.get(url + link.attrs["href"])
    soup = BeautifulSoup(response.text, "html.parser")

    # scrape the table on each page
    table = soup.find("table", attrs={"class": "table"})
    rows = table.find_all("tr")
    for row in rows:
        # skip rows that don't represent teams
        if not ("class" in row.attrs and "team" in row.attrs["class"]):
            continue

        # get info from row
        row_info = {}
        cells = row.find_all("td")
        for cell in cells:
            field = cell.attrs["class"][0]
            row_info[field] = cell.string.strip()
        data_hockey.append(row_info)

     # stagger the requests to avoid spamming the server
    time.sleep(1)

Now let's see if we got all the data!

In [14]:
pd.DataFrame(data_hockey)

Unnamed: 0,name,year,wins,losses,ot-losses,pct,gf,ga,diff
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2011,38,36,8,0.463,235,281,-46
578,Toronto Maple Leafs,2011,35,37,10,0.427,231,264,-33
579,Vancouver Canucks,2011,51,22,9,0.622,249,198,51
580,Washington Capitals,2011,42,32,8,0.512,222,230,-8


## Ethical Considerations

- Website owners have to pay a small amount each time you visit a webpage.
- This is usually offset by advertising.
- But when you do web scraping:
    - it is easy to rack up many webpage visits,
    - and you don't see any ads to offset this cost.

## `robots.txt`

- Most websites have a `robots.txt` file in the home directory that indicate which bots are allowed to scrape and which pages they can scrape.
- Here are a few examples:
    - http://www.espn.com/robots.txt
    - http://www.nytimes.com/robots.txt
- However, `robots.txt` is informational only. It doesn’t *prevent* bots from scraping a webpage.

## Preventing Web Scraping

 Some websites take more drastic measures to prevent web scraping...