In [2]:
import pandas as pd
import numpy as np
import os


import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'

# Lecture 08 – Data on the internet

### Agenda

- Introduction to HTTP.
- Making HTTP requests.
- Data formats.
- APIs and scraping
- The anatomy of HTML documents.
- Parsing HTML using Beautiful Soup

## Introduction to HTTP



### Data sources

* Sometimes, the data you need doesn't exist in "clean" `.csv` files.

* **Solution:** Collect your own data!
    - Design and administer your own survey or run an experiment.
    - Find related data on the internet.

- The internet contains **massive** amounts of historical record; for most questions you can think of, the answer exists somewhere on the internet.

### Collecting data from the internet

- There are two ways to programmatically access data on the internet:
    - through an API.
    - by scraping.

### HTTP

- HTTP stands for **Hypertext Transfer Protocol**.
    - It was developed in 1989 by Tim Berners-Lee (and friends).

- It is a **request-response** protocol.
    - Protocol = set of rules.

- HTTP allows...
    - computers to talk to each other over a network.
    - devices to fetch data from "web servers".

- The "S" in HTTPS stands for "secure".

### ARPANET 

Advanced Research Projects Agency Network (Many of the protocols used by computer networks today were developed for ARPANET, and it is considered the forerunner of the modern internet.)

<center><img src='imgs/arpanet.png'></center>

### The request-response model

HTTP follows the **request-response** model.

<center><img src='imgs/req-response.png' width=600></center>

- A **request** is made by the **client**.

- A **response** is returned by the **server**.

- **Example:** Tiktok 🎥.
    - Your phone's mobile App, a **client**, makes an HTTP **request** to view a video.
    - The **server**, Tiktok, is a computer that is sitting somewhere else.
    - The server returns a **response** that contains the video.

### Request methods

The request methods you will use most often are `GET` and `POST`; see [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list of request methods.    

- `GET` is used to request data **from** a specified resource.

- `POST` is used to **send** data to the server. 
    - e.g. uploading a video to Tiktok or entering credit card information on Amazon.

### Example `GET` request

Below is an example `GET` HTTP request made by a browser when accessing [bing.com](https://bing.com).

```HTTP
authority: assets.msn.cn
:method: GET
:path: /bundles/v1/bingHomepage/latest/card-actions-wc.544845754c7853b4c833.js
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7
if-none-match: 0x8DB1694BFE94BE8
origin: https://www.bing.com
referer: https://www.bing.com/
sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
sec-fetch-dest: script
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0
```

-  "request line"
-  "header fields". Header fields contain metadata. 

- We _could_ also provide a "body" after the header fields.

- To see HTTP requests in Google Chrome, follow [these steps](https://mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/).

### Example `GET` response



### Consequences of the request-response model

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.

- Remember, servers are computers. 
    - Someone has to pay to keep these computers running.
    - **This means that every time you access a website, someone has to pay.**

## Making HTTP requests

### Making HTTP requests

We'll see two ways to make HTTP requests outside of a browser:

- From the command line, with `curl`.

- **From Python, with the `requests` package.**

### Making HTTP requests using `curl`

[`curl`](https://curl.haxx.se/docs/httpscripting.html) is a **command-line tool** that sends HTTP requests, like a browser.

1. The client, `curl`, sends a HTTP request. 
2. The request contains a method (e.g. `GET` or `POST`).
3. The HTTP server responds with:
    - a status line, indicating if things went well, 
    - response headers, and
    - (usually) a response body, containing the requested data.

### Example: `GET` requests via `curl`

- By default, `curl` issues a `GET` request.

```zsh
# `-v` is short for verbose
curl -v https://httpbin.org/html 
```

- Remember, you can run command-line commands in a Jupyter Notebook by placing a `!` before them. Let's try that here.

In [3]:
# Compare the output to what you see when you go to https://httpbin.org/html in your browser!
!curl -v https://httpbin.org/html

<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
      <h1>Herman Melville - Moby-Dick</h1>

      <div>
        <p>
          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Host httpbin.org:443 was resolved.
* IPv6: (none)
* IPv4: 52.207.37.75, 35.173.225.247
*   Trying 52.207.37.75:443...
* Connected to httpbin.org (52.207.37.75) port 443
* schannel: disabled automatic use of client certificate
* ALPN: curl offers http/1.1
* ALPN: server accepted http/1.1
* using HTTP/1.x
> GET /html HTTP/1.1

> Host: httpbin.org

> User-Agent: curl/8.7.1

> Accept: */*

> 

* Request completely sent off
< HTTP/1.1 200 OK

< Date: Mon, 15 Jul 2024 07:38:49 GMT

< Content-Type: text/html; charset=utf-8

< Content-Length: 3741

< Connection: keep-alive

< Server: gunicorn/19.9.0

< Access-Control-Allow-Origin: *

< Access-Control-Allow-Credentials: true

< 

{ [3741 bytes data]

100  3741  100  3741    0     0   3805      0 --:--:-- --:--

### Queries in a `GET` request

- In order to request more specific information, we can include a **query string** in the URL. `?` begins a query.

<a href="https://www.bing.com/search?q=shanghai"><pre>
https://www.bing.com/search?q=shanghai
</pre></a>

- This method works well when sending small amounts of data; we will use a similiar technique when working with APIs (coming later!).

- Be on the lookout for query strings in URLs you share on social media!

In [4]:
!curl -v http://cn.bing.com/search?q=shanghai

<!DOCTYPE html><html dir="ltr" lang="zh" xml:lang="zh" xmlns="http://www.w3.org/1999/xhtml" xmlns:Web="http://schemas.live.com/Web/"><script type="text/javascript" >//<![CDATA[
si_ST=new Date
//]]></script><head><!--pc--><title>shanghai - 搜索</title><meta content="text/html; charset=utf-8" http-equiv="content-type" /><meta property="og:description" content="通过必应的智能搜索，可以更轻松地快速查找所需内容并获得奖励。" /><meta property="og:site_name" content="必应" /><meta property="og:title" content="shanghai - 必应" /><meta property="og:url" content="https://cn.bing.com/search?q=shanghai" /><meta property="fb:app_id" content="3732605936979161" /><meta property="og:image" content="http://www.bing.com/sa/simg/facebook_sharing_5.png" /><meta property="og:type" content="website" /><meta property="og:image:width" content="600" /><meta property="og:image:height" content="315" /><link href="/search?format=rss&amp;q=shanghai" data-orighref="" rel="alternate" title="XML" type="text/xml" /><link href="/search?format=rss&amp;q=sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Host cn.bing.com:80 was resolved.
* IPv6: (none)
* IPv4: 202.89.233.101, 202.89.233.100
*   Trying 202.89.233.101:80...
* Connected to cn.bing.com (202.89.233.101) port 80
> GET /search?q=shanghai HTTP/1.1

> Host: cn.bing.com

> User-Agent: curl/8.7.1

> Accept: */*

> 

* Request completely sent off
< HTTP/1.1 200 OK

< Cache-Control: private, max-age=0

< Transfer-Encoding: chunked

< Content-Type: text/html; charset=utf-8

< Expires: Mon, 15 Jul 2024 07:37:49 GMT

< P3P: CP="NON UNI COM NAV STA LOC CURa DEVa PSAa PSDa OUR IND"

< Set-Cookie: MUID=17625CD153A262560E39486C528C639E; domain=.bing.com; expires=Sat, 09-Aug-2025 07:38:49 GMT; path=/

< Set-Cookie: MUIDB=17625CD153A262560E39486C528C639E; expires=Sat, 09-Aug-2025 07:38:49 GMT; path=/; Http

var Identity = Identity || {}; (function(i) { i.wlImgSm ="https://storage.live.com/users/0x{0}/myprofile/expressionprofile/profilephoto:UserTileStatic/p?ck=1\u0026ex=720\u0026sid=11320FA80E73606031461B150F5D61B0\u0026fofoff=1"; i.wlImgLg ="https://storage.live.com/users/0x{0}/myprofile/expressionprofile/profilephoto:UserTileMedium/p?ck=1\u0026ex=720\u0026sid=11320FA80E73606031461B150F5D61B0\u0026fofoff=1";i.popupLoginUrls = {"WindowsLiveId":"https://login.live.com/login.srf?wa=wsignin1.0\u0026rpsnv=11\u0026ct=1721029130\u0026rver=6.0.5286.0\u0026wp=MBI_SSL\u0026wreply=https:%2F%2fcn.bing.com%2Fsecure%2FPassport.aspx%3Fpopup%3D1\u0026lc=2052\u0026id=264960"}; })(Identity);;
//]]>--></div><div style="display:none" "><!--//<![CDATA[
sj_evt.bind("onP1", function() { window["RewardsHeaderSVG"] && RewardsHeaderSVG.fireDefaultEvent(); }, 1, 0);;var bepcfg = bepcfg || {};;bepcfg.wb =true? '1' : '0';;
//]]>--></div><div style="display:none" "><!--//<![CDATA[
_w["EnableSappUpsellSERPCN"] =true; 

### Making HTTP requests using `requests`

- `requests` is a Python module that allows you to use Python to interact with the internet!  
- There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.

In [5]:
import requests

### Example: `GET` requests via `requests`

To access the source code of the Bing home page, all we need to run is the following:

```py
requests.get('https://bing.com').text
```

In [6]:
res = requests.get('https://bing.com')

`res` is now a `Response` object.

In [7]:
res

<Response [200]>

In [8]:
res.text



The `text` attribute of `res` is a string that containing the entire response.

In [9]:
type(res.text)

str

In [10]:
len(res.text)

9369

In [11]:
print(res.text[:1000])

<!doctype html><html lang="zh" dir="ltr"><head><meta name="theme-color" content="#4F4F4F" /><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /><title>必应</title><link rel="preconnect" href="https://r.bing.com" /><link rel="preconnect" href="https://r.bing.com" crossorigin/><link rel="dns-prefetch" href="https://r.bing.com" /><link rel="dns-prefetch" href="https://r.bing.com" crossorigin/><link rel="stylesheet" href="/rp/RCJCS6O5TykkUhXX1pwc6RsdyuI.gz.css" type="text/css"/><script type="text/javascript">//<![CDATA[
var logJSText=function(n,t){t===void 0&&(t=null);(new Image).src=_G.lsUrl+'&Type=Event.ClientInst&DATA=[{"T":"CI.ClientInst","FID":"CI","Name":"'+escape(n)+(t?'","Text":"'+escape(t):"")+'"}]'},getHref=function(){return location.href};try{var ignErr=["ResizeObserver loop","Script error"],maxErr=3,ignoreCurrentError=function(n,t){return ignErr.some(function(t){return n.includes(t)})?(t!=null&&(typeof sj_

### Example: `POST` requests via `requests`

The following call to `requests.post` makes a post request to https://httpbin.org/post, with a `'name'` parameter of `'King Triton'`.

In [12]:
post_res = requests.post('https://httpbin.org/post',
                         data={'name': 'King Triton'})

post_res

<Response [200]>

In [13]:
post_res.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "name": "King Triton"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "16", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.27.1", \n    "X-Amzn-Trace-Id": "Root=1-6694d20c-72674ce902a62e4407d9c09e"\n  }, \n  "json": null, \n  "origin": "58.247.22.201", \n  "url": "https://httpbin.org/post"\n}\n'

In [14]:
# More on this shortly!
post_res.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'name': 'King Triton'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '16',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.27.1',
  'X-Amzn-Trace-Id': 'Root=1-6694d20c-72674ce902a62e4407d9c09e'},
 'json': None,
 'origin': '58.247.22.201',
 'url': 'https://httpbin.org/post'}

What happens when we try and make a `POST` request somewhere where we're unable to?

In [15]:
yt_res = requests.post('https://www.sjtu.edu.cn',
                       data={'name': 'King Triton'})

yt_res

<Response [405]>

`yt_res.text` is a string containing HTML – we can render this in-line using `IPython.display.HTML`.

In [16]:
from IPython.display import HTML

In [17]:
HTML(yt_res.text)

### HTTP status codes

- When we **request** data from a website, the server includes an **HTTP status code** in the response.  

* The most common status code is `200`, which means there were no issues.  

* Other times, you will see a different status code, describing some sort of event or error.
    - Common examples: `400` – bad request, `404` – page not found, `500` – internal server error.
    - [The first digit of a status describes its general "category".](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

- See [https://httpstat.us](https://httpstat.us/) for a list of all HTTP status codes.
    - It also has example sites for each status code; for example, https://httpstat.us/404 returns a `404`.

In [18]:
yt_res.status_code

405

In [19]:
r = requests.get('https://httpstat.us/200')
print(r.status_code)
print(r.text)

200
200 OK


### Successful requests ✅

- You can check if a request was successful using the `ok` attribute, which returns a bool.
    - If a status is in the 200s, then it is successful.

In [20]:
yt_res.status_code, yt_res.ok

(405, False)

In [21]:
post_res.status_code, post_res.ok

(200, True)

- Unsuccessful requests can be re-tried, depending on the issue.
    - Wait a little, then try the request again.
    - You can even re-try requests programmatically (e.g. using a loop). If rate of requests is too high, slow down requests between each retry (e.g. using `time.sleep`).

## Data formats

### The data formats of the internet

Responses typically come in one of two formats: HTML or JSON.

- The response body of a `GET` request is usually either JSON (when using an API) or HTML (when accessing a webpage).

- The response body of a `POST` request is usually JSON.

- XML is also a common format, but not as popular as it once was.

### JSON

- JSON stands for **JavaScript Object Notation**. It is a lightweight format for storing and transferring data.

- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.

- Most modern languages have an interface for working with JSON objects.
    - JSON objects _resemble_ Python dictionaries (but are not the same!).

### JSON data types

| Type | Description |
| --- | --- |
| String | Anything inside double quotes. |
| Number | Any number (no difference between ints and floats). |
| Boolean | `true` and `false`. |
| Null | JSON's empty value, denoted by `null`. |
| Array | Like Python lists. |
| Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |

See [json-schema.org](https://json-schema.org/understanding-json-schema/reference/type.html) for more details.

### Example JSON object

See `data/family.json`.

<center><img src='imgs/hierarchy.png' width=50%></center>

In [22]:
import json

f = open(os.path.join('data', 'family.json'), 'r')
family_tree = json.load(f)

In [23]:
family_tree

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

In [24]:
family_tree['children'][0]['children'][0]['age']

24

In [25]:
f_other = open(os.path.join('data', 'family.json'))
s = f_other.read()
s

'{\n    "name": "Grandma",\n    "age": 94,\n    "children": [\n        {\n        "name": "Dad",\n        "age": 60,\n        "children": [{"name": "Me", "age": 24}, \n                     {"name": "Brother", "age": 22}]\n        },\n        {\n        "name": "My Aunt",\n        "children": [{"name": "Cousin 1", "age": 34}, \n                     {"name": "Cousin 2", "age": 36, "children": \n                        [{"name": "Cousin 2 Jr.", "age": 2}]\n                     }\n                    ]\n        }\n    ]\n}'

In [26]:
json.loads(s)

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

### Using the `json` module

Let's process the same file using the `json` module. Recall:
- `json.load(f)` loads a JSON file from a file object.
- `json.loads(f)` loads a JSON file from a **s**tring.

In [27]:
f_other = open(os.path.join('data', 'evil_family.json'))
s = f_other.read()
s

'{\n    "name": "Grandma",\n    "age": 94,\n    "children": [\n        {\n        "name": util.err(),\n        "age": 60,\n        "children": [{"name": "Me", "age": 24}, \n                     {"name": "Brother", "age": 22}]\n        },\n        {\n        "name": "My Aunt",\n        "children": [{"name": "Cousin 1", "age": 34}, \n                     {"name": "Cousin 2", "age": 36, "children": \n                        [{"name": "Cousin 2 Jr.", "age": 2}]\n                     }\n                    ]\n        }\n    ]\n}'

In [28]:
# json.loads(s)

- Since `util.err()` is not a string in JSON (there are no quotes around it), `json.loads` is not able to parse it as a JSON object.

- This "safety check" is intentional.

### Handling _unfamiliar_ data

- Never trust data from an unfamiliar site.

- **Never** use `eval` on "raw" data that you didn't create!

- The JSON data format needs to be **parsed**, not evaluated as a dictionary.
    - It was designed with safety in mind!

## APIs and scraping

### Programmatic requests

* We learned how to use the Python `requests` package to exchange data via HTTP.
    - `GET` requests are used to request data **from** a server.
    - `POST` requests are used to **send** data to a server.

* There are two ways of collecting data through a request:
    * By using a published API (application programming interface).
    * By scraping a webpage to collect its HTML source code.

### APIs

An API is a service that makes data directly available to the user in a convenient fashion.

Advantages:

- The data are usually clean, up-to-date, and ready to use.

- The presence of a API signals that the data provider is okay with you using their data.

- The data provider can plan and regulate data usage.
    - Some APIs require you to create an API "key", which is like an account for using the API.
    - APIs can also give you access to data that isn't publicly available on a webpage.

Disadvantages:
- APIs don't always exist for the data you want!

### API terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- An **API endpoint** is a URL of the data source that the user wants to make requests to.

- For example, on the [Reddit API](https://www.reddit.com/dev/api/):
    * the `/comments` endpoint retrieves information about comments.
    * the `/hot` endpoint retrieves data about posts labeled "hot" right now. 
    - To access these endpoints, you add the endpoint name to the base URL of the API.

### API requests

- API requests are just `GET`/`POST` requests to a specially maintained URL.
- Let's test out the [Pokémon API](https://pokeapi.co).

First, let's make a `GET` request for `'squirtle'`.

In [29]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/squirtle')
r

<Response [200]>

Remember, the 200 status code is good! Let's take a look at the **content**:

In [30]:
r.content[:1000]

b'{"abilities":[{"ability":{"name":"torrent","url":"https://pokeapi.co/api/v2/ability/67/"},"is_hidden":false,"slot":1},{"ability":{"name":"rain-dish","url":"https://pokeapi.co/api/v2/ability/44/"},"is_hidden":true,"slot":3}],"base_experience":63,"cries":{"latest":"https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/latest/7.ogg","legacy":"https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/legacy/7.ogg"},"forms":[{"name":"squirtle","url":"https://pokeapi.co/api/v2/pokemon-form/7/"}],"game_indices":[{"game_index":177,"version":{"name":"red","url":"https://pokeapi.co/api/v2/version/1/"}},{"game_index":177,"version":{"name":"blue","url":"https://pokeapi.co/api/v2/version/2/"}},{"game_index":177,"version":{"name":"yellow","url":"https://pokeapi.co/api/v2/version/3/"}},{"game_index":7,"version":{"name":"gold","url":"https://pokeapi.co/api/v2/version/4/"}},{"game_index":7,"version":{"name":"silver","url":"https://pokeapi.co/api/v2/version/5/"}},{"game_index":7,

Looks like JSON. We can extract the JSON from this request with the `json` method (or by passing `r.text` to `json.loads`).

In [31]:
r.json()

{'abilities': [{'ability': {'name': 'torrent',
    'url': 'https://pokeapi.co/api/v2/ability/67/'},
   'is_hidden': False,
   'slot': 1},
  {'ability': {'name': 'rain-dish',
    'url': 'https://pokeapi.co/api/v2/ability/44/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 63,
 'cries': {'latest': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/latest/7.ogg',
  'legacy': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/legacy/7.ogg'},
 'forms': [{'name': 'squirtle',
   'url': 'https://pokeapi.co/api/v2/pokemon-form/7/'}],
 'game_indices': [{'game_index': 177,
   'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}},
  {'game_index': 177,
   'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}},
  {'game_index': 177,
   'version': {'name': 'yellow',
    'url': 'https://pokeapi.co/api/v2/version/3/'}},
  {'game_index': 7,
   'version': {'name': 'gold', 'url': 'https://pokeapi.co/api/v2/version/4/

Let's try a `GET` request for `'billy'`.

In [32]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/bil')
r

<Response [404]>

Uh oh...

### Scraping

Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

Advantages:

* You can always do it!
    - e.g. Google scrapes webpages in order to make them searchable.

Disadvantages:

- It is often difficult to parse and clean scraped data.
    - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).

- Websites can change often, so scraping code can get outdated quickly.

- Websites may not want you to scrape their data!

- **In general, we prefer APIs.**

### Accessing HTML

**Goal**: Access information about JI research center.

Let's start by making a `GET` request to the JI research center page and see what the resulting HTML looks like. 

In [33]:
r = requests.get('https://www.ji.sjtu.edu.cn/research/research-center/')
r

<Response [200]>

In [34]:
research_text = r.text
len(research_text)

97576

In [35]:
print(research_text[:1000])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://www.ji.sjtu.edu.cn/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>

	<title>Research Centers | UM-SJTU JI</title>
<meta name='robots' content='max-image-preview:large' />
<link rel="alternate" hreflang="en" href="https://www.ji.sjtu.edu.cn/research/research-center/" />
<link rel="alternate" hreflang="zh" href="https://www.ji.sjtu.edu.cn/cn/research-2/research-center/" />
<link rel='dns-prefetch' href='//s.w.org' />
<link rel="alternate" type="application/rss+xml" title="UM-SJTU JI &raquo; Feed" href="https://www.ji.sjtu.edu.cn/feed/" />
<link rel="alternate" type="application/rss+xml" title="UM-SJTU JI &raquo; Comments Feed" href="https://www.ji.sjtu.edu.cn/comments/feed/" />
<meta content="PNXTheme v.3.21.3" name="generator"/><link rel='stylesheet' id='formidable-css'  h


In [36]:
'COO' in research_text

True

Wow, that is gross looking! 😰 

- It is **raw** HTML, which web browsers use to display websites.
- The information we are looking for – research information – is in there somewhere, but we have to search for it and extract it, which we wouldn't have to do if we had an API.
- We'll now look at how HTML documents are structured and how to extract information from them.

### Best practices for scraping

1. **Send requests slowly** and be upfront about what you are doing!
2. Respect the policy published in the page's `robots.txt` file.
    - Many sites have a `robots.txt` file in their root directory, which contains a policy that allows or disallows automatic access to their site. 
    - See [here](https://moz.com/learn/seo/robotstxt) for more details.
3. Don't spoof your User-agent (i.e. don't try to trick the server into thinking you are a person).
4. Read the Terms of Service for the site and follow it.

### Consequences of irresponsible scraping

If you make too many requests:
* The server may block your IP Address.
* You may take down the website.
   

## The anatomy of HTML documents

### What is HTML?

* HTML (HyperText Markup Language) is **the** basic building block of the internet. 


* It defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.

* See [this tutorial](http://fab.academany.org/2018/labs/fablaboshanghai/students/bob-wu/Fabclass/week2_project_management/HTML.html) for more details.

For instance, here's the content of a very basic webpage.

In [37]:
!cat data/Lec8_ex.html

<html>
	<head>
		<title>Page title</title>
	</head>

	<body>
		<h1>This is a heading</h1>
		<p>This is a paragraph.</p>
		<p>This is <b>another</b> paragraph.</p>
	</body>
</html>


Using `IPython.display.HTML`, we can render it directly in our notebook.

In [38]:
from IPython.display import HTML
HTML(os.path.join('data', 'Lec8_ex.html'))

### The anatomy of HTML documents

* **HTML document**: The totality of markup that makes up a webpage.

* **Document Object Model (DOM)**: The internal representation of a HTML document as a hierarchical **tree** structure.

* **HTML element**: An object in the DOM, such as a paragraph, header, or title.
* **HTML tags**: Markers that denote the **start** and **end** of an element, such as `<p>` and `</p>`.

<center><img src='imgs/dom.jpg'></center>

<center><a href='https://simplesnippets.tech/what-is-document-object-modeldom-how-js-interacts-with-dom/'>(source)</a></center>

### Useful tags to know


|Element|Description|
|:---|:---|
|`<html>`|the document|
|`<head>`|the header|
|`<body>`|the body|
|`<div>` |a logical division of the document|
|`<span>`|an *inline* logical division|
|`<p>`|a paragraph|
| `<a>`| an anchor (hyperlink)|
|`<h1>, <h2>, ...`| header(s) |
|`<img>`| an image |

There are many, many more. See [this article](https://en.wikipedia.org/wiki/HTML_element) for examples.

### Example: images and hyperlinks

Tags can have **attributes**, which further specify how to display information on a webpage.

For instance, `<img>` tags have `src` and `alt` attributes (among others):

```html
<img src="king-selfie.png" alt="A photograph of King Triton." width=500>
```

Hyperlinks have `href` attributes: 

```html
Click <a href="https://bing.com">this link</a> to search your keywords.
```

What do you think this webpage looks like?

In [39]:
!cat data/Lec8.html

<!DOCTYPE html>
<html>
<head><meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Lec 09</title><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script>




<style type="text/css">
    pre { line-height: 125%; }
td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
.highlight .hll { background-color: var(--jp-cell-editor-active-background) }
.highlight { background: var(--jp-cell-editor-background); color: var(--jp-mirror-editor-variable-color) }
.highlight .c { color: var(--jp-mirror-editor-comment-color); font-style: italic } /* Co

### The `<div>` tag

```html
<div style="background-color:lightblue">
  <h3>This is a heading</h3>
  <p>This is a paragraph.</p>
</div>
```

* The `<div>` tag defines a division or a "section" of an HTML document.
    * Think of a `<div>` as a "cell" in a Jupyter Notebook.

* The `<div>` element is often used as a container for other HTML elements to style them with CSS or to perform operations involving them using JavaScript.

* `<div>` elements often have attributes, **which are important when scraping**!

### Document trees

Under the document object model (DOM), HTML documents are trees. In DOM trees, child nodes are **ordered**.

<center>

<img src="imgs/webpage_anatomy.png" width="50%">

</center>    

What does the DOM tree look like for this document?

<center><img src="imgs/dom_tree.png" width="50%"></center>

### Example: Quote scraping

Consider the following webpage.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

- What do you think the DOM tree looks like?
- If you had to store the data on this page in a DataFrame, what would the rows and columns represent?

<center><img src="imgs/quote_dom.png" width="50%"></center>

## Parsing HTML using Beautiful Soup

### Beautiful Soup 🍜

* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python HTML parser.
    - To "parse" means to "extract meaning from a sequence of symbols".
* **Warning:** Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.

### Example HTML document

To start, we'll work with the source code for an HTML page with the DOM tree shown below:

<center><img src="imgs/dom_tree_1.png" width="50%"></center>

The string `html_string` contains an HTML "document".

In [67]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

In [68]:
HTML(html_string)

### `BeautifulSoup` objects

`bs4.BeautifulSoup` takes in a string or file-like object representing HTML (`markup`) and returns a **parsed** document.

In [69]:
import bs4

In [70]:
bs4.BeautifulSoup?

[1;31mInit signature:[0m
[0mbs4[0m[1;33m.[0m[0mBeautifulSoup[0m[1;33m([0m[1;33m
[0m    [0mmarkup[0m[1;33m=[0m[1;34m''[0m[1;33m,[0m[1;33m
[0m    [0mfeatures[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mbuilder[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mparse_only[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfrom_encoding[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mexclude_encodings[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0melement_classes[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m**[0m[0mkwargs[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
A data structure representing a parsed HTML or XML document.

Most of the methods you'll call on a BeautifulSoup object are inherited from
PageElement or Tag.

Internally, this class defines the basic interface called by the
tree builders when converting an HTML/XML do

Normally, we pass the result of a `GET` request to `bs4.BeautifulSoup`, but here we will pass our hand-crafted `html_string`.

In [71]:
soup = bs4.BeautifulSoup(html_string)
soup

<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [72]:
type(soup)

bs4.BeautifulSoup

`BeautifulSoup` objects have several useful attributes, e.g. `text`:

In [73]:
print(soup.text)




Heading here
My First paragraph
My second paragraph




item 1
item 2
item 3






### Traversing through `descendants`

The `descendants` attribute traverses a `BeautifulSoup` tree using **depth-first traversal**.

Why depth-first? Elements closer to one another on a page are more likely to be related than elements further away.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>

In [74]:
soup.descendants

<generator object descendants at 0x00000216DA2B30A0>

In [75]:
list(soup.descendants)

[<html>
 <body>
 <div id="content">
 <h1>Heading here</h1>
 <p>My First paragraph</p>
 <p>My <em>second</em> paragraph</p>
 <hr/>
 </div>
 <div id="nav">
 <ul>
 <li>item 1</li>
 <li>item 2</li>
 <li>item 3</li>
 </ul>
 </div>
 </body>
 </html>,
 '\n',
 <body>
 <div id="content">
 <h1>Heading here</h1>
 <p>My First paragraph</p>
 <p>My <em>second</em> paragraph</p>
 <hr/>
 </div>
 <div id="nav">
 <ul>
 <li>item 1</li>
 <li>item 2</li>
 <li>item 3</li>
 </ul>
 </div>
 </body>,
 '\n',
 <div id="content">
 <h1>Heading here</h1>
 <p>My First paragraph</p>
 <p>My <em>second</em> paragraph</p>
 <hr/>
 </div>,
 '\n',
 <h1>Heading here</h1>,
 'Heading here',
 '\n',
 <p>My First paragraph</p>,
 'My First paragraph',
 '\n',
 <p>My <em>second</em> paragraph</p>,
 'My ',
 <em>second</em>,
 'second',
 ' paragraph',
 '\n',
 <hr/>,
 '\n',
 '\n',
 <div id="nav">
 <ul>
 <li>item 1</li>
 <li>item 2</li>
 <li>item 3</li>
 </ul>
 </div>,
 '\n',
 <ul>
 <li>item 1</li>
 <li>item 2</li>
 <li>item 3</li>
 </ul

In [76]:
for child in soup.descendants:
    # print(child) # What would happen if we ran this instead?
    if isinstance(child, str):
        continue
    print(child.name)
    
    
# Traverse, BFS or DFS？


html
body
div
h1
p
p
em
hr
div
ul
li
li
li


### Finding elements in a tree

Practically speaking, you will not use the `descendants` attribute (or the related `children` attribute) directly very often. Instead, you will use the following methods:

- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)` will find **all** instances of a tag.


### Using `find`

Let's try and extract the first `<div>` subtree.

<center><img src="imgs/dom_tree_1.png" width="60%"></center>  

In [79]:
soup

<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [80]:
div = soup.find('div')
div

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

<center><img src="imgs/dom_subtree_1.png" width="30%"></center>  

Let's try and find the `<div>` element that has an `id` attribute equal to `'nav'`.

In [81]:
soup.find('div', attrs={'id': 'nav'})

<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>

`find` will return the first occurrence of a tag, regardless of its depth in the tree.

In [82]:
soup.find('ul')

<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>

In [83]:
soup.find('li')

<li>item 1</li>

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute lists all attributes of a tag.
* The `get(key)` method gets the value of a tag attribute.

In [84]:
soup.find('p')

<p>My First paragraph</p>

In [85]:
soup.find('p').text

'My First paragraph'

In [86]:
soup.find('div')

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

In [87]:
soup.find('div').attrs

{'id': 'content'}

In [88]:
soup.find('div').get('id')

'content'

The `get` method must be called directly on the node that contains the attribute you're looking for.

In [89]:
soup

<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [90]:
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')

In [91]:
soup.find('div').get('id')

'content'

### Using `find_all`

`find_all` returns a list of all matches.

In [92]:
len(soup.find_all('div'))

2

In [93]:
soup.find_all('li')[0]

<li>item 1</li>

In [95]:
soup.find_all('li')

[<li>item 1</li>, <li>item 2</li>, <li>item 3</li>]

In [94]:
[x.text for x in soup.find_all('li')]

['item 1', 'item 2', 'item 3']

### Summary

- HTTP is the protocol the internet uses for transferring information.
- Clients can make `GET` HTTP requests to ask for information and `POST` HTTP requests to send information.
- Servers send responses with the desired information.
- We can use `curl` in the command-line or the `requests` Python module to make HTTP requests.
- The two main file formats used for storing information on the internet are HTML and JSON.
    - JSON objects resemble Python dictionaries, but they are not quite the same. 
    - Use the `.json()` method of a response object or the `json` package to parse them, **not** `eval`.
- APIs allow us to request information from web servers in a convenient fashion.
- When APIs don't exist, we instead scrape webpages to access their source HTML and then parse the HTML to extract the information we care about.
- Under the document object model (DOM), HTML documents are trees.
    - Elements are defined by tags.
- Beautiful Soup is an HTML parser that allows us to (somewhat) easily extract information from HTML documents.
    - `soup.find` and `soup.find_all` are the functions you will use most often.

### Next time

- Using HTTP to make API requests and scrape the web. 
- Parsing HTML files.