# Python: APIs scraping

**Goal**: Collect data from an API to exploit them!

## Introduction to APIs and GET query

### What's an API?

In computer science, **API** stands for Application Programming Interface. The API is a computer solution that allows applications to communicate with each other and exchange services or data.

### Request on APIs

The **requests** module is the basic library for making requests on APIs.

In [1]:
# Example
import requests

#### The GET request

The **GET** method of the **requests** module is the one used to **get information** from an API.

Let's make a **request** to **get** the **last position** of the **ISS station** from the **OpenNotify API**: **http://api.open-notify.org/iss-now.json**.

In [2]:
response = requests.get("http://api.open-notify.org/iss-now.json")

## Status codes

In [3]:
response

<Response [200]>

Queries return status codes that give us information about the result of the query (success or failure). For each failure, there is a different code. Here are some useful codes with their meanings:

* **200**: Everything is normal and the server returned the requested result

* **301**: The server redirects to another parameter

* **400**: Bad request

* **401**: The server thinks that you are not able to authenticate

* **403**: The server indicates that you are not allowed to access the API

* **404**: The server did not find the resource

In [4]:
# Code 200
status_code = response.status_code
status_code

200

In [5]:
# Code 400
response = requests.get("http://api.open-notify.org/iss-pass.json")
response

<Response [400]>

In [6]:
# Code 404
response = requests.get("http://api.open-notify.org/iss-before.json")
response

<Response [404]>

## Query parameters

Some requests need parameters to work.

In [7]:
# Example: Latitude and longitude of Paris city
parameters = {"lat" : 48.87, "lon" : 2.33} # http://api.open-notify.org/iss-pass.json?lat=48.87&lon=2.33

In [8]:
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

To retrieve the content of the GET request, we use the method **content**.

In [9]:
response_content = response.content
response_content

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1641826890, \n    "latitude": 48.87, \n    "longitude": 2.33, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 434, \n      "risetime": 1641846180\n    }, \n    {\n      "duration": 631, \n      "risetime": 1641851838\n    }, \n    {\n      "duration": 652, \n      "risetime": 1641857627\n    }, \n    {\n      "duration": 650, \n      "risetime": 1641863448\n    }, \n    {\n      "duration": 651, \n      "risetime": 1641869259\n    }\n  ]\n}\n'

### Training

Apply the GET request to the city of San Francisco.

In [10]:
sf_parameters = {"lat" : 37.78, "lon" : -122.41}
sf_response = requests.get("http://api.open-notify.org/iss-pass.json", params=sf_parameters)
sf_content = sf_response.content
sf_content

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1641826891, \n    "latitude": 37.78, \n    "longitude": -122.41, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 521, \n      "risetime": 1641873801\n    }, \n    {\n      "duration": 648, \n      "risetime": 1641879516\n    }, \n    {\n      "duration": 562, \n      "risetime": 1641885395\n    }, \n    {\n      "duration": 483, \n      "risetime": 1641891314\n    }, \n    {\n      "duration": 566, \n      "risetime": 1641897153\n    }\n  ]\n}\n'

## JSON format

**JSON** is the main format for sending or receiving data when using an API. There is the **JSON** library with two key functions **dumps** and **loads**. The **dumps** function takes as input a Python object and returns a string. As for the **loads** function, it takes a string as input and returns a Python object (lists, dictionaries, etc).

In [11]:
# Example
data_science = ["Mathematics", "Statistics", "Computer Science"]
data_science

['Mathematics', 'Statistics', 'Computer Science']

In [12]:
type(data_science)

list

In [13]:
import json

In [14]:
# dumps
data_science_string = json.dumps(data_science)
data_science_string

'["Mathematics", "Statistics", "Computer Science"]'

In [15]:
type(data_science_string)

str

In [16]:
# loads
data_science_list = json.loads(data_science_string)
data_science_list

['Mathematics', 'Statistics', 'Computer Science']

In [17]:
type(data_science_list)

list

### Training

In [18]:
# Training with dictionaries
animals = {
    "dog" : 15,
    "cat" : 5,
    "mouse" : 25,
    "chiken" : 10
}

In [19]:
type(animals)

dict

In [20]:
animals_string = json.dumps(animals)
animals_string

'{"dog": 15, "cat": 5, "mouse": 25, "chiken": 10}'

In [21]:
type(animals_string)

str

In [22]:
animals_dict = json.loads(animals_string)
animals_dict

{'dog': 15, 'cat': 5, 'mouse': 25, 'chiken': 10}

In [23]:
type(animals_dict)

dict

## Get a json from a request

The **json()** method allows to convert the result of a query into a Python object (dictionary).

In [24]:
# Example
parameters = {"lat" : 48.87, "lon" : 2.33}
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

In [25]:
json_response = response.json()
json_response

{'message': 'success',
 'request': {'altitude': 100,
  'datetime': 1641826890,
  'latitude': 48.87,
  'longitude': 2.33,
  'passes': 5},
 'response': [{'duration': 434, 'risetime': 1641846180},
  {'duration': 631, 'risetime': 1641851838},
  {'duration': 652, 'risetime': 1641857627},
  {'duration': 650, 'risetime': 1641863448},
  {'duration': 651, 'risetime': 1641869259}]}

In [26]:
type(json_response)

dict

In [27]:
first_iss_pass_duration = json_response['response'][0]['duration']
first_iss_pass_duration

434

## Type of content

When we make a **GET** request, the server provides us with a **status code**, **data** and also **metadata** that contains information about how the data was generated. This information can be found on the **header** of the response. It is accessed with the method **hearders**.

In [31]:
# Example
response.headers

{'Server': 'nginx/1.10.3', 'Date': 'Mon, 10 Jan 2022 15:01:31 GMT', 'Content-Type': 'application/json', 'Content-Length': '518', 'Connection': 'keep-alive', 'Via': '1.1 vegur'}

The parameter that interests us most in this header data is the **Content-Type**.

In [32]:
response_content_type = response.headers['Content-Type']
response_content_type

'application/json'

## Training

Let's find the number of people in the space with the API **http://api.open-notify.org/astros.json**.

In [34]:
response = requests.get("http://api.open-notify.org/astros.json")
response

<Response [200]>

In [36]:
response.headers

{'Server': 'nginx/1.10.3', 'Date': 'Mon, 10 Jan 2022 15:19:41 GMT', 'Content-Type': 'application/json', 'Content-Length': '497', 'Connection': 'keep-alive', 'access-control-allow-origin': '*'}

In [37]:
response.content

b'{"people": [{"craft": "ISS", "name": "Mark Vande Hei"}, {"craft": "ISS", "name": "Pyotr Dubrov"}, {"craft": "ISS", "name": "Anton Shkaplerov"}, {"craft": "Shenzhou 13", "name": "Zhai Zhigang"}, {"craft": "Shenzhou 13", "name": "Wang Yaping"}, {"craft": "Shenzhou 13", "name": "Ye Guangfu"}, {"craft": "ISS", "name": "Raja Chari"}, {"craft": "ISS", "name": "Tom Marshburn"}, {"craft": "ISS", "name": "Kayla Barron"}, {"craft": "ISS", "name": "Matthias Maurer"}], "message": "success", "number": 10}'

In [38]:
response_json_data = response.json()
response_json_data

{'people': [{'craft': 'ISS', 'name': 'Mark Vande Hei'},
  {'craft': 'ISS', 'name': 'Pyotr Dubrov'},
  {'craft': 'ISS', 'name': 'Anton Shkaplerov'},
  {'craft': 'Shenzhou 13', 'name': 'Zhai Zhigang'},
  {'craft': 'Shenzhou 13', 'name': 'Wang Yaping'},
  {'craft': 'Shenzhou 13', 'name': 'Ye Guangfu'},
  {'craft': 'ISS', 'name': 'Raja Chari'},
  {'craft': 'ISS', 'name': 'Tom Marshburn'},
  {'craft': 'ISS', 'name': 'Kayla Barron'},
  {'craft': 'ISS', 'name': 'Matthias Maurer'}],
 'message': 'success',
 'number': 10}

In [39]:
nb_people_in_space = response_json_data['number']
nb_people_in_space

10