<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item">
<li><span><a href="#1.-Introduction-to-Web-Scraping" data-toc-modified-id="1.-Introduction-to-Web-Scraping-1">1. Introduction to Web Scraping</a></span><ul class="toc-item">
<li><span><a href="#1.1-Example:-Getting-information-about-the-International-Space-Station-(ISS)-from-http://api.open-notify.org" data-toc-modified-id="1.2-Example:-Getting-information-about-the-International-Space-Station-(ISS)-from-http://api.open-notify.org-1.2">1.2 Example: Getting information about the International Space Station (ISS) from <a href="http://api.open-notify.org" target="_blank">http://api.open-notify.org</a></a></span></li>
<li><span><a href="#1.2-Example:-Getting-information-about-countries-using-Rest-Countries-API" data-toc-modified-id="1.2-Example:-Getting-information-about-countries-using-Rest-Countries-API-1.3">1.3 Example: Getting information about countries using Rest Countries API</a></span></li></ul>
</li></ul></div>

---
# 1. Introduction to Web Scraping
---

Web scraping is the process of extracting data from websites. It can be performed manually or automated using software to download and store the data in an accessible format. 

In this notebook, we will be exploring web data access in Python using the built-in **requests** package, which allows us to make HTTP requests.

## 1.1 Example: Getting information about the International Space Station (ISS) from http://api.open-notify.org

In [1]:
## 
# Imports -- if this cell doesn't run then try, from command line, 
# python -m pip install requests
import requests
##

In [2]:
## 
# Request the current location of the ISS using the requests.get() method
url = r'http://api.open-notify.org/iss-now.json'
##
r = requests.get(url)

In [3]:
##
#  Check the status code of the response object
##
r.status_code

200

The response is in .json format, which looks similar to a Python dictionary. It can be converted into an actual Python dictionary using `.json()` method

Note: Some websites may also return the response in formats other than json (e.g. html or xml)

In [4]:
## 
# Get the contents of the response as plain text
##
r.text

'{"timestamp": 1709206449, "message": "success", "iss_position": {"latitude": "23.6430", "longitude": "-38.5010"}}'

In [5]:
## 
# Convert from json to a dictionary using .json() method
##
response = r.json()
response

{'timestamp': 1709206449,
 'message': 'success',
 'iss_position': {'latitude': '23.6430', 'longitude': '-38.5010'}}

Once converted into a dictionary, the response can be manipulated using standard indexing techniques

In [6]:
## 
# Getting the current latitude
##
response['iss_position']['latitude']

'23.6430'

The timestamp is a sequence of numbers ([Unix time format](https://en.wikipedia.org/wiki/Unix_time)). We can convert this into a readable date using `utcfromtimestamp` function, from the the built-in `datetime` package in Python:

In [9]:
## 
# Import the datetime package
import datetime as dt
##


In [11]:
##
#  Convert the timestamp from Unix time to a more readable format
##
print(dt.datetime.utcfromtimestamp(response['timestamp']))

2024-02-29 11:34:09


### Concept Check <a class="tocSkip">

Print out a list of all people who are currently in space on the ISS. Use `http://api.open-notify.org/astros.json`



In [13]:
api_endpoint = r"http://api.open-notify.org/astros.json"
r = requests.get(api_endpoint)
print(r.status_code)

200


In [14]:
string_or_dict = '{"hello":1}'
print(string_or_dict)

{"hello":1}


In [15]:
print(r.text)
print(type(r.text))

{"message": "success", "people": [{"name": "Jasmin Moghbeli", "craft": "ISS"}, {"name": "Andreas Mogensen", "craft": "ISS"}, {"name": "Satoshi Furukawa", "craft": "ISS"}, {"name": "Konstantin Borisov", "craft": "ISS"}, {"name": "Oleg Kononenko", "craft": "ISS"}, {"name": "Nikolai Chub", "craft": "ISS"}, {"name": "Loral O'Hara", "craft": "ISS"}], "number": 7}
<class 'str'>


In [16]:
info = r.json()
print(info)
print(type(info))

{'message': 'success', 'people': [{'name': 'Jasmin Moghbeli', 'craft': 'ISS'}, {'name': 'Andreas Mogensen', 'craft': 'ISS'}, {'name': 'Satoshi Furukawa', 'craft': 'ISS'}, {'name': 'Konstantin Borisov', 'craft': 'ISS'}, {'name': 'Oleg Kononenko', 'craft': 'ISS'}, {'name': 'Nikolai Chub', 'craft': 'ISS'}, {'name': "Loral O'Hara", 'craft': 'ISS'}], 'number': 7}
<class 'dict'>


In [18]:
info_people = info['people']
print(info_people)

[{'name': 'Jasmin Moghbeli', 'craft': 'ISS'}, {'name': 'Andreas Mogensen', 'craft': 'ISS'}, {'name': 'Satoshi Furukawa', 'craft': 'ISS'}, {'name': 'Konstantin Borisov', 'craft': 'ISS'}, {'name': 'Oleg Kononenko', 'craft': 'ISS'}, {'name': 'Nikolai Chub', 'craft': 'ISS'}, {'name': "Loral O'Hara", 'craft': 'ISS'}]


In [24]:
for people_dict in info_people:
    print(people_dict['name'], people_dict['craft'])
    if people_dict['craft'] == 'ISS':
        astro_list.append(people_dict['name'])

Jasmin Moghbeli ISS
Andreas Mogensen ISS
Satoshi Furukawa ISS
Konstantin Borisov ISS
Oleg Kononenko ISS
Nikolai Chub ISS
Loral O'Hara ISS


In [22]:
print(astro_list)

['Jasmin Moghbeli', 'Andreas Mogensen', 'Satoshi Furukawa', 'Konstantin Borisov', 'Oleg Kononenko', 'Nikolai Chub', "Loral O'Hara", 'Jasmin Moghbeli', 'Andreas Mogensen', 'Satoshi Furukawa', 'Konstantin Borisov', 'Oleg Kononenko', 'Nikolai Chub', "Loral O'Hara"]


In [12]:
# Solution to concept check
api_endpoint = r"http://api.open-notify.org/astros.json"
r = requests.get(api_endpoint)
print(f"GET request status code to {api_endpoint}: {r.status_code}")
response = r.json()

astro_list = []

for person in response["people"]:
    if person["craft"] == "ISS":
        astro_list.append(person["name"])
        
print(astro_list)

GET request status code to http://api.open-notify.org/astros.json: 200
['Jasmin Moghbeli', 'Andreas Mogensen', 'Satoshi Furukawa', 'Konstantin Borisov', 'Oleg Kononenko', 'Nikolai Chub', "Loral O'Hara"]


In [28]:
import json
print(json.dumps(info, indent=3))
 


{
   "message": "success",
   "people": [
      {
         "name": "Jasmin Moghbeli",
         "craft": "ISS"
      },
      {
         "name": "Andreas Mogensen",
         "craft": "ISS"
      },
      {
         "name": "Satoshi Furukawa",
         "craft": "ISS"
      },
      {
         "name": "Konstantin Borisov",
         "craft": "ISS"
      },
      {
         "name": "Oleg Kononenko",
         "craft": "ISS"
      },
      {
         "name": "Nikolai Chub",
         "craft": "ISS"
      },
      {
         "name": "Loral O'Hara",
         "craft": "ISS"
      }
   ],
   "number": 7
}


# Request Parameters 

There are many types of *parameters* that can be accompany the HTTP request. These can be used to provide more information about what content is requested, or provide authorization credentials, or other information such as the browser or platform type that is making the request. 

HTTP *parameters* are similar to Python function *arguments*:
- A Python function can be designed to accept positional and keyword arguments that are used to affect the processing of the function. 
- An HTTP request can be designerd to accept several types of parameters that are used to affect the processing of the function. 

There are several types of parameters, including:
- `path` parameter: it's possible to specify a string as part of the url that can be interpreted and used by the server. See the first example below for a use of this. (This is quite similar to 'positional' arguments in a function.)
- `query` parameters: these key-value pairs are appended to the url, after the `?` character. These provide the standard mechanism for providing information to the server, so that it can provide the content requested by the client. See the second example for a use of this. (These `query` parameters are very similar to the Python keyword arguments, i.e. a set of key-value pairs.)
- `header` parameters: typically used to provide authentication credentials (such as a username and password, or an API key), and other technical details. We don't use these fields in these examples. [The Wikipedia page](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields) has a useful listing of the various types. 
- `cookie` parameters: many web servers request permission to store data on the client device (via their web server). This 'cookie' data is sent back to the server by web-browser applications. We don't use these fields in these examples.



## Example: Getting information about countries using Rest Countries API
Rest Countries API: <https://restcountries.com>

The above url contains the documentation for this RESTful service.

The `v3.1/name` endpoint allows a search string to be appended to it, as a `path` parameter, as the example below shows:

In [29]:
## 
# Getting the API url for a particular country
country = 'Japan'
##


url = rf'https://restcountries.com/v3.1/name/{country}'

In [30]:
##
#  Do the request and check the status code
##
r = requests.get(url)
r.status_code

200

In [32]:
r.url

'https://restcountries.com/v3.1/name/Japan'

In [31]:
##
# Check the response
##
response = r.json()
print(response)

[{'name': {'common': 'Japan', 'official': 'Japan', 'nativeName': {'jpn': {'official': '日本', 'common': '日本'}}}, 'tld': ['.jp', '.みんな'], 'cca2': 'JP', 'ccn3': '392', 'cca3': 'JPN', 'cioc': 'JPN', 'independent': True, 'status': 'officially-assigned', 'unMember': True, 'currencies': {'JPY': {'name': 'Japanese yen', 'symbol': '¥'}}, 'idd': {'root': '+8', 'suffixes': ['1']}, 'capital': ['Tokyo'], 'altSpellings': ['JP', 'Nippon', 'Nihon'], 'region': 'Asia', 'subregion': 'Eastern Asia', 'languages': {'jpn': 'Japanese'}, 'translations': {'ara': {'official': 'اليابان', 'common': 'اليابان'}, 'bre': {'official': 'Japan', 'common': 'Japan'}, 'ces': {'official': 'Japonsko', 'common': 'Japonsko'}, 'cym': {'official': 'Japan', 'common': 'Japan'}, 'deu': {'official': 'Japan', 'common': 'Japan'}, 'est': {'official': 'Jaapan', 'common': 'Jaapan'}, 'fin': {'official': 'Japani', 'common': 'Japani'}, 'fra': {'official': 'Japon', 'common': 'Japon'}, 'hrv': {'official': 'Japan', 'common': 'Japan'}, 'hun': {'o

In [33]:
##
# Having a look at what is in the response object
# Note: The response is returned as a list of dictionaries. The above response contains only one dictionary in the list.
##
response[0].keys()

dict_keys(['name', 'tld', 'cca2', 'ccn3', 'cca3', 'cioc', 'independent', 'status', 'unMember', 'currencies', 'idd', 'capital', 'altSpellings', 'region', 'subregion', 'languages', 'translations', 'latlng', 'landlocked', 'area', 'demonyms', 'flag', 'maps', 'population', 'gini', 'fifa', 'car', 'timezones', 'continents', 'flags', 'coatOfArms', 'startOfWeek', 'capitalInfo', 'postalCode'])

In [None]:
# Getting the currency name from the response
response[0]['currencies']['JPY']['name']

In [34]:
print(response[0]['currencies'])
print(response[0]['currencies']['JPY'])
print(response[0]['currencies']['JPY']['name'])

{'JPY': {'name': 'Japanese yen', 'symbol': '¥'}}
{'name': 'Japanese yen', 'symbol': '¥'}
Japanese yen


### Concept Check  <a class="tocSkip">

1. Create a list of all the capital cities in Europe that begin with the letter 'L'?
2.  Create a dictionary that contains the capital city for each country (in the world) for which the country name begins with the letter 'L'? Use the country as the key, and the capital city as the value. 

(Stretch Challenge: put each of the above in its own function, allowing the caller of the function to specify different starting letters, e.g. 'L', 'M', 'N'...)

In [6]:
import requests
url = 'https://restcountries.com/v3.1/region/europe'
response = requests.get(url)


<Response [200]>


In [2]:
print(type(response))
print(len(response))

NameError: name 'response' is not defined

In [63]:
for i in response:
    print(i['capital'][0])

Nicosia
Bratislava
Vatican City
Belgrade
Tórshavn
Tirana
Rome
Madrid
Dublin
Zagreb
Tallinn
London
Gibraltar
Helsinki
Stockholm
Reykjavik
Bern
Riga
Warsaw
Vilnius
Andorra la Vella
Saint Helier
Valletta
Berlin
City of San Marino
Luxembourg
Bucharest
Longyearbyen
Minsk
Mariehamn
St. Peter Port
Oslo
Brussels
Lisbon
Copenhagen
Prague
Athens
Vienna
Monaco
Ljubljana
Sarajevo
Paris
Sofia
Chișinău
Douglas
Podgorica
Budapest
Skopje
Pristina
Amsterdam
Kyiv
Vaduz
Moscow


In [7]:
#1
import requests
url = 'https://restcountries.com/v3.1/region/europe'
r = requests.get(url)
response = r.json()
european_capitals = []
for country in response:
    capital_city = country.get('capital')
    if capital_city is not None:
        if capital_city[0].startswith('L'):
            european_capitals.append(capital_city[0])
european_capitals


['London', 'Luxembourg', 'Longyearbyen', 'Lisbon', 'Ljubljana']

In [68]:
#1 as a function
def get_capital_cities(beginning_letter):
    url = 'https://restcountries.com/v3.1/region/europe'
    r = requests.get(url)
    response = r.json()
    european_capitals = []
    for i in response:
        if i['capital'][0].startswith(beginning_letter):
            european_capitals.append(i['capital'][0])
    
    return european_capitals

get_capital_cities('L')

#TODO 
#definitely going to have to practice this syntax

['London', 'Luxembourg', 'Longyearbyen', 'Lisbon', 'Ljubljana']

In [67]:
#2
import requests
url = 'https://restcountries.com/v3.1/all'
r = requests.get(url)
response = r.json()
world_capitals = {}
for country in response:
    country_name = country['name']['common']
    if country_name[0] == 'L':
        capital_city = country.get('capital')
        world_capitals[country_name] = capital_city[0]
world_capitals

{'Liberia': 'Monrovia',
 'Latvia': 'Riga',
 'Lithuania': 'Vilnius',
 'Lebanon': 'Beirut',
 'Libya': 'Tripoli',
 'Liechtenstein': 'Vaduz',
 'Laos': 'Vientiane',
 'Luxembourg': 'Luxembourg',
 'Lesotho': 'Maseru'}

In [66]:
#2 as a function
def get_world_capitals(beginning_letter):
    url = 'https://restcountries.com/v3.1/all'
    r = requests.get(url)
    response = r.json()
    world_capitals = {}
    for i in response:
        if i['name']['common'].startswith(beginning_letter):
            world_capitals[i['name']['common']] = i['capital'][0]

    return world_capitals

get_world_capitals('L')

{'Liberia': 'Monrovia',
 'Latvia': 'Riga',
 'Lithuania': 'Vilnius',
 'Lebanon': 'Beirut',
 'Libya': 'Tripoli',
 'Liechtenstein': 'Vaduz',
 'Laos': 'Vientiane',
 'Luxembourg': 'Luxembourg',
 'Lesotho': 'Maseru'}

In [1]:
# Solution to concept check 1
import requests
url = 'https://restcountries.com/v3.1/region/europe'
r = requests.get(url)
response = r.json()

european_capitals = []
for country in response:
    capital_city = country.get('capital')
    if capital_city is not None:
        if capital_city[0].startswith('L'):
            european_capitals.append(capital_city[0])
european_capitals

['London', 'Lisbon', 'Ljubljana', 'Longyearbyen', 'Luxembourg']

In [2]:
# Solution to concept check 2

import requests

url = 'https://restcountries.com/v3.1/all'
r = requests.get(url)
response = r.json()

worldwide_capitals = {}

for country in response:
    country_name = country['name']['common']
    if country_name[0] =='L':
        capital_city = country.get('capital')
        worldwide_capitals[country_name] = capital_city[0]

worldwide_capitals

{'Lithuania': 'Vilnius',
 'Liechtenstein': 'Vaduz',
 'Libya': 'Tripoli',
 'Liberia': 'Monrovia',
 'Latvia': 'Riga',
 'Lebanon': 'Beirut',
 'Laos': 'Vientiane',
 'Lesotho': 'Maseru',
 'Luxembourg': 'Luxembourg'}

## Query string parameters




From Python, the best way to include query string parameters into an HTTP `GET` request is to:
- Create a Python dictionary (e.g. `my_params` that includes all the required key-value pairs)
- Pass this dictionary into the `requests.get` function call, as a keyword argument, i.e. `parameters=my_params`


To demonstrate this, we'll use the [Open-Air-Quality REST API](https://docs.openaq.org/docs/about-api)



In [69]:
##
#  Example: find the latest London air quality measurements

url = 'https://api.openaq.org/v2/latest'
##

import requests
my_params = {'city':'London'}
r = requests.get(url, params = my_params)
r.status_code

200

In [71]:
r = requests.get('https://api.openaq.org/v2/locations?page=1&offset=0&sort=desc&radius=1000&city=london&city=&order_by=lastUpdated&dump_raw=false')
print(r.json)
    

<bound method Response.json of <Response [422]>>


In [70]:
##
# Look at the response content  
## 

data = r.json()
data['results'][0].keys()


dict_keys(['location', 'city', 'country', 'coordinates', 'measurements'])

## For `openaq.org`, the term '`parameter`` has two meanings here:
- There are HTTP request parameters, such as `query` parameters (what we are using). 
- Air Quality (AQ) parameters -- the different types of AQ measurement (e.g pm25, carbon monoxide, nitrogen dioxide etc)


## Three Challenges:

- Write a function `get_parameter_count(city)` that returns a dictionary of the AQ parameters used to measure AQ in the latest readings, in a given city. In the returned dictionary, each key should be an AQ parameter (e.g. `'pm25'`, and each value should be the number of locations in which that parameter has a measurement (e.g. `'13'` pm25 readings in London, Jan 2023))
- Write a function `get_readings(city, parameter)` that returns a Pandas DataFrame containing the AQ readings from the last week, for a given city and given parameter type. Each DataFrame column should contain a different location within that city. 
- Plot the AQ (Air Quality) readings for the past week, for a given parameter type and locations within a given city.


In [None]:
# Solution to challenge 1:

import requests
def get_parameter_count(city):
    url = 'https://api.openaq.org/v2/latest'

    my_params = {'city':city}
    r = requests.get(url, params = my_params)
    if r.status_code >=300:
        raise Exception (r.reason)
    data = r.json()
    aq_parameter_count = {}
    for result in data['results']:
        for measurement in result['measurements']:
            parameter = measurement['parameter']
            if parameter in aq_parameter_count:
                aq_parameter_count[parameter] += 1
            else:
                aq_parameter_count[parameter] = 1

    return aq_parameter_count

params = get_parameter_count('London')

params


In [None]:
import datetime as dt
import pandas as pd
def get_readings(city='Liverpool', aq_parameter='no2'):
    url = 'https://api.openaq.org/v2/measurements'
    today = dt.date.today()
    one_week = dt.timedelta(days=7)
    last_week = today - one_week
    date_to = f'{today.year}-{today.month}-{today.day}'
    date_from = f'{last_week.year}-{last_week.month}-{last_week.day}'
    my_params = {
        'city':city, 
        'parameter': aq_parameter, 
        'date_from':date_from, 
        'date_to':date_to,
        'limit':9999
        }
    r = requests.get(url, params = my_params)
    if r.status_code >=300:
        raise Exception (r.reason)
    data = r.json()
    df = pd.DataFrame(index=pd.to_datetime([]))
    for reading in data['results']:
        s = reading['date']['utc']
        d = dt.datetime.strptime(s.split('+')[0],'%Y-%m-%dT%H:%M:%S')
        l = reading['location']
        x = reading['value']
        df.loc[d,l] = x

    return df

df = get_readings('London', 'no2')
df

In [None]:
ax = df.plot(figsize=(14,8))
ax.set_title('no2 readings in London')
ax.set_ylim(0, 160)
ax.set_xlabel('date')
ax.set_ylabel('parts per million')


## Further topics

### Other Request Types

This notebook has only demonstrated GET requests (to *read* data). We also use POST, PUT and DELETE requests when we wish to *create*, *update* and *delete* data, respectively. 

For example, when we created your projects on gitlab, we used [this POST request](https://docs.gitlab.com/ee/api/members.html#add-a-member-to-a-group-or-project) to add you to the group we created for you. 

### Headers

This notebook has only used resources that do not require authorization credentials. 

These credentials can sometimes be required in the `params`, alternatively they may be required in the 'Headers' (as in the above example). 

### More RESTful resources:

[Here's a list](https://github.com/public-apis/public-apis) of publically accessible restful web services. Some will require registration, others not. You may find it useful and intresting to browse these data sources!

