## Terminology
### Web API
Web application programming interfaces (APIs) are a way of interacting with external computer systems via the internet. These APIs essentially enable the calling of functions on other computers that can return data or perform operations on the external computer system (such as registering a user, making a move in a game of chess etc.).

### HTTP
HTTP is a protocol for the most common type of API on the internet. As a protocol, all HTTP does is establish a convention for sending messages between computers on the internet. There are four main types of message:

- **GET**: GET-ing a URL returns data/a webpage. When you load a website by typing a URL, your browser sends a GET request in the background to the URL of the web application and the server decides what to return. 
- **POST**: POST-ing to a URL sends new data to the web application.
- **PUT**: PUT-ing to a URL updates the state associated with the web application.
- **DELETE**: DELETE-ing to a URL removes data associated with the web application.

Since HTTP is a protocol, it is ultimately up to the developers of the web application what actually happens when you send an HTTP request.

### Endpoint
A URL that a web application uses to receive API requests is called an 'endpoint'.

### Root URL
A dedicated URL to handle all requests for a specific web application is called 'root' URL. As a developer we can specify more granular endpoints by appending to the path of the 'root' URL.

For example, in `https://api.tfl.gov.uk/AirQuality`, `https://api.tfl.gov.uk` is the root URL and we request the `AirQuality` endpoint.

### JSON
JSON is a standard text format for encoding data to be transmitted between computers in requests and responses to APIs. 

## Some API examples in Python

`requests` is a common and simple library for working with HTTP APIs. Check out the documentation https://requests.readthedocs.io/en/latest/ for specific details. Below we send a GET request to www.google.com, the same way a browser would do when access the URL.

In [3]:
# uncomment the following line to install the library into your Jupyter kernel if you have not done so already
#! pip install requests

import requests

res = requests.get("https://www.google.com")
res

<Response [200]>

The response by the server contains the data and meta-data associated with the request. Here we explore `status_code` and `text`. `status_code` encodes if the server successfully interpreted and handled our request. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Status for details on the status codes returned.

In [5]:
res.status_code

200

The data in text form can be accessed by `text`.

In [9]:
res.text

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-GB"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="Ugg4WtTh3UPzMoJL-3lvrg">(function(){var _g={kEI:\'uQ0RZ8ORLZqB9u8Pus_1yQ0\',kEXPI:\'0,1303875,2396450,114,945,448529,90132,2872,2891,73050,16105,18161,162437,23024,6700,96770,29549,8155,23351,22435,9779,62658,31491,44717,15816,1804,7734,27535,2413,9400,1632,29279,21780,5303,5212676,996,26,113,8832397,1222,11,1,42,7439818,20539939,16673,43886,3,1603,3,2124363,23029351,8163,4636,16436,95587,11081,15164,8181,17876,45673,6968,581,6756,155,1,1,2482,13504,7736,9138,4600,328,3216,5,1238,1766,1117,1830,17654,4863,8160,687,7850,22,2761,180,13256,5785,970,371,8822,4829,57,360,1852,2,9,11640,1768,2381,2462,3296,7767,348,1557,6853,1539,4176,797,8677,1,8192,7114,3066,111,376,4144,5390,1458,5,29,1514,251,597,1801,91

This works as long as the encoding (check `res.encoding`) is correct. Otherwise, we can set the encoding and try again, or look at the data in binary form with `res.content`, which will be a string starting with `b` to indicate that this is a byte string, and which may indicate the encoding. Here it's essentially the same string as above.

In [10]:
res.content

'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-GB"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="Ugg4WtTh3UPzMoJL-3lvrg">(function(){var _g={kEI:\'uQ0RZ8ORLZqB9u8Pus_1yQ0\',kEXPI:\'0,1303875,2396450,114,945,448529,90132,2872,2891,73050,16105,18161,162437,23024,6700,96770,29549,8155,23351,22435,9779,62658,31491,44717,15816,1804,7734,27535,2413,9400,1632,29279,21780,5303,5212676,996,26,113,8832397,1222,11,1,42,7439818,20539939,16673,43886,3,1603,3,2124363,23029351,8163,4636,16436,95587,11081,15164,8181,17876,45673,6968,581,6756,155,1,1,2482,13504,7736,9138,4600,328,3216,5,1238,1766,1117,1830,17654,4863,8160,687,7850,22,2761,180,13256,5785,970,371,8822,4829,57,360,1852,2,9,11640,1768,2381,2462,3296,7767,348,1557,6853,1539,4176,797,8677,1,8192,7114,3066,111,376,4144,5390,1458,5,29,1514,251,597,1801,914

For those among you with some HTML experience will notice that `res.text` is just some HTML describing a webpage. Typically HTML is designed to format and mark up documents to be easily read by humans. As data scientists, instead we wish to process the returned data using computers, so the HTML mark up is unnecessary. Therefore, web applications typically offer API endpoints returning JSON data. 

As an example, Transport for London (TfL) offers such an endpoint for information on air quality.

In [11]:
res = requests.get("https://api.tfl.gov.uk/AirQuality")
res.text

'{"$id":"1","$type":"Tfl.Api.Presentation.Entities.LondonAirForecast, Tfl.Api.Presentation.Entities","updatePeriod":"hourly","updateFrequency":"1","forecastURL":"http://londonair.org.uk/forecast","disclaimerText":"This forecast is intended to provide information on expected pollution levels in areas of significant public exposure. It may not apply in very specific locations close to unusually strong or short-lived local sources of pollution.","currentForecast":[{"$id":"2","$type":"Tfl.Api.Presentation.Entities.CurrentForecast, Tfl.Api.Presentation.Entities","forecastType":"Current","forecastID":"46707","forecastBand":"Low","forecastSummary":"Low air pollution forecast valid from Thursday 17 October to end of Friday 18 October GMT","nO2Band":"Low","o3Band":"Low","pM10Band":"Low","pM25Band":"Low","sO2Band":"Low","forecastText":"Rain will clear overnight for Thursday morning, still with odd shower through the day. Foggy start on Friday, then mainly dry and sunny for the rest of the day. C

As we have seen in the previous week, JSON can be translated to Python objects such a dictionaries and lists. 

In [17]:
res_dict = res.json()
print(res_dict)
print(type(res_dict))

{'$id': '1', '$type': 'Tfl.Api.Presentation.Entities.LondonAirForecast, Tfl.Api.Presentation.Entities', 'updatePeriod': 'hourly', 'updateFrequency': '1', 'forecastURL': 'http://londonair.org.uk/forecast', 'disclaimerText': 'This forecast is intended to provide information on expected pollution levels in areas of significant public exposure. It may not apply in very specific locations close to unusually strong or short-lived local sources of pollution.', 'currentForecast': [{'$id': '2', '$type': 'Tfl.Api.Presentation.Entities.CurrentForecast, Tfl.Api.Presentation.Entities', 'forecastType': 'Current', 'forecastID': '46707', 'forecastBand': 'Low', 'forecastSummary': 'Low air pollution forecast valid from Thursday 17 October to end of Friday 18 October GMT', 'nO2Band': 'Low', 'o3Band': 'Low', 'pM10Band': 'Low', 'pM25Band': 'Low', 'sO2Band': 'Low', 'forecastText': 'Rain will clear overnight for Thursday morning, still with odd shower through the day. Foggy start on Friday, then mainly dry a

It is possible to use specific parameters in a request to filter the data. There are two common conventions to supply these arguments: 

    1. Via the URL path. In the Tfl example, the second part of the URL path can be used to specify which year of accident statistics to retrieve 
https://api.tfl.gov.uk/swagger/ui/index.html?url=/swagger/docs/v1#!/AccidentStats/AccidentStats_Get

In [19]:
res = requests.get("https://api.tfl.gov.uk/AccidentStats/2019")
res.json()

[{'$type': 'Tfl.Api.Presentation.Entities.AccidentStats.AccidentDetail, Tfl.Api.Presentation.Entities',
  'id': 345979,
  'lat': 51.570865,
  'lon': -0.231959,
  'location': 'On Edgware Road Near The Junction With north Circular Road',
  'date': '2019-01-04T21:22:00Z',
  'severity': 'Slight',
  'borough': 'Barnet',
  'casualties': [{'$type': 'Tfl.Api.Presentation.Entities.AccidentStats.Casualty, Tfl.Api.Presentation.Entities',
    'age': 20,
    'class': 'Driver',
    'severity': 'Slight',
    'mode': 'PoweredTwoWheeler',
    'ageBand': 'Adult'}],
  'vehicles': [{'$type': 'Tfl.Api.Presentation.Entities.AccidentStats.Vehicle, Tfl.Api.Presentation.Entities',
    'type': 'Motorcycle_500cc_Plus'},
   {'$type': 'Tfl.Api.Presentation.Entities.AccidentStats.Vehicle, Tfl.Api.Presentation.Entities',
    'type': 'Car'}]},
 {'$type': 'Tfl.Api.Presentation.Entities.AccidentStats.AccidentDetail, Tfl.Api.Presentation.Entities',
  'id': 345980,
  'lat': 51.603859,
  'lon': -0.18724,
  'location': 'On

2. Via URL 'arguments', by specifying `key=value` argument pairs in the URL by appending them after a `?` and separating multiple `key=value` pairs using `&`. For example, below we use an API to retrieve US data specifying arguments. We use `drilldowns=Nation` to indicate the granularity at which we retrieve information. Try changing this argument value to `State`. Similarly, we use `measures=Population`. Check https://datausa.io/about/api/ for more details.

In [32]:
res = requests.get("https://datausa.io/api/data?drilldowns=Nation&measures=Population")
res_json = res.json()
res_json

{'data': [{'ID Nation': '01000US',
   'Nation': 'United States',
   'ID Year': 2022,
   'Year': '2022',
   'Population': 331097593,
   'Slug Nation': 'united-states'},
  {'ID Nation': '01000US',
   'Nation': 'United States',
   'ID Year': 2021,
   'Year': '2021',
   'Population': 329725481,
   'Slug Nation': 'united-states'},
  {'ID Nation': '01000US',
   'Nation': 'United States',
   'ID Year': 2020,
   'Year': '2020',
   'Population': 326569308,
   'Slug Nation': 'united-states'},
  {'ID Nation': '01000US',
   'Nation': 'United States',
   'ID Year': 2019,
   'Year': '2019',
   'Population': 324697795,
   'Slug Nation': 'united-states'},
  {'ID Nation': '01000US',
   'Nation': 'United States',
   'ID Year': 2018,
   'Year': '2018',
   'Population': 322903030,
   'Slug Nation': 'united-states'},
  {'ID Nation': '01000US',
   'Nation': 'United States',
   'ID Year': 2017,
   'Year': '2017',
   'Population': 321004407,
   'Slug Nation': 'united-states'},
  {'ID Nation': '01000US',
   'N

URL arguments can be difficult to read due to the long verbose format, but at least the order of the additional arguments does not matter.

In [33]:
res = requests.get("https://datausa.io/api/data?measures=Population&drilldowns=Nation")
res_json2 = res.json()
res_json == res_json

True

As using arguments in URLs is such a common operation, there are helper functions in the package `requests` to simplify calling endpoints with arguments.

In [34]:
params = {
    "measures" : "Population",
    "drilldowns" : "Nation"
}
requests.get("https://datausa.io/api/data", params=params).json()
res_json3 = res.json()
res_json == res_json

True

## Web scraping

Sometimes we want to extract information from web pages that is not explicitly provided in a JSON API. In that case, we have to parse/process the HTML directly. This is called 'web scraping'. One is able to do so in an automated way, and when automating this the process is called 'web scraping'. For more info see the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) python library, which has many efficient helper methods for parsing HTML to extract information of interest.

When scraping information from web pages be considerate of the server you are requesting from, as it is designed for human interaction. Try to limit rate requests to avoid overloading the server, otherwise you might get blocked for a while.