# Tutorial Wk 7: Using Web APIs with JSON and XML to gather data

## Introduction
In this tutorial, we will continue looking at methods to gather data from Web-based sources - particularly from Web API's.
The common source formats will be JSON or XML.

# A review of Scraping Data from a Website

Last week we used Python to scrape data from the convict records website: https://convictrecords.com.au

We introduced the following Python libraries:
- **Request**         for interacting with websites and web services
- **Beautiful Soup**  for webpage parsing

For more documentation about functions available in BautifulSoup, see here:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

We then stored the retrieved data in CSV format and inside a DBMS.


# EXERCISE 1: Using Web APIs

In this excersize, we are looking at some examples on how to access web APIs which are specifically provided for program to retrieve data. The advantage is that the data is well defined - no distracting HTML tags in between.

But the services uses two different formats - either JSON or XML.

For **JSON**, we will use the standard language support in Python and its **request** library.<br>
For **XML**, we will use the **lxml** parser library.

### Example 1: U.S. Government Website Analytic API

First some JSON example from the U.S. government website analytics:

In [1]:
# The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days
import requests
response = requests.get("https://analytics.usa.gov/data/live/ie.json")
print(response.json()['totals']['ie_version']['6.0'])

52726


Ouch - Internet Explorer 6.0 which was release in 2001 seems still in use in 2020...

Which version of Internet Explorer is used most at this moment when contacting the U.S. government website. For this, we need to look at the actual JSON response. For this it is helpful to have a 'pretty-print' of the corresponding JSON data which is returned by the analytics.use.gov website. The Python **json** library can do this for us:

In [2]:
# The raw response from the U.S. government website
import requests
response = requests.get("https://analytics.usa.gov/data/live/ie.json")

import json
print(json.dumps(response.json(), indent=4, sort_keys=True))

{
    "meta": {
        "description": "90 days of visits from Internet Explorer users broken down by version for all sites. (>100 sessions)",
        "name": "Internet Explorer"
    },
    "name": "ie",
    "query": {
        "dimensions": "ga:date,ga:browserVersion",
        "end-date": "yesterday",
        "filters": "ga:browser==Internet Explorer;ga:sessions>10",
        "max-results": 10000,
        "metrics": [
            "ga:sessions"
        ],
        "samplingLevel": "HIGHER_PRECISION",
        "sort": [
            "ga:date",
            "-ga:sessions"
        ],
        "start-date": "90daysAgo",
        "start-index": 1
    },
    "sampling": {
        "containsSampledData": false
    },
    "taken_at": "2020-05-11T10:00:06.440Z",
    "totals": {
        "ie_version": {
            "10.0": 527085,
            "10.6": 26125,
            "11.0": 235643989,
            "11.0.37": 1066,
            "11.0.38": 1062,
            "11.0.39": 1075,
            "4.0": 11,
         

From above's data, it seems that IE version 11.0 is currently the most popular version of internet explorer used.
<br>Amazingly, there are apparently still a few visits with IE 4.0 though...

### Example 2: ABS Population Clock API

The Australian Bureau of Statistics provides the following web API os a *Population Clock Web Service* which gives some statistics about the current Australian population. The meaning of the various fields are explained here: http://www.abs.gov.au/AUSSTATS/abs@.nsf/Latestproducts/1420.0.55.001Main%20Features2User%20Guide?opendocument&tabname=Summary&prodno=1420.0.55.001&issue=User%20Guide&num=&view=

In [3]:
import requests
import json
response = requests.get("http://www.abs.gov.au/api/demography/populationprojection")
print(json.dumps(response.json(), indent=4, sort_keys=True))

population = response.json()['popNow']
print("Current population in Australia: "+str(population))

{
    "attribution": "Australian Bureau of Statistics",
    "birthRate": "1 minute and 44 seconds",
    "copyRight": "Copyright Commonwealth of Australia",
    "deathRate": "3 minutes and 11 seconds",
    "growthRate": "1 minute and 27 seconds",
    "overseasMigrationRate": "2 minutes and 23 seconds",
    "popNow": "25685249",
    "projectionStartDate": "30 September 2019",
    "rateSecond": "87.70624797184293951723",
    "source": "Australian Demographic Statistics, September Quarter 2019 (cat. no. 3101.0)",
    "sourceURL": "https://www.abs.gov.au/ausstats/abs@.nsf/mf/3101.0",
    "timeStamp": "12 May 2020 11:26:57 AEST"
}
Current population in Australia: 25685249


### Example 3: Map APIs
Here's another example with parameters send to a web service:

There are several MAP API systems that allow you to convert a location address to a GPS location (and some information more). The most popular of these is Google's Maps Platform. However, this underwent an access change in June 2018 which means that it is now requires an API key and associated billing information.

So instead, we will be using the Open Street Maps project at https://www.openstreetmap.org. 
Note that this service does have however a restriction of 1 API call per second. 
The following example looks up the GPS location of the School of IT building at "1 Cleveland Street, Darlington, Australia":

In [None]:
# Lookup of a given address via Open Street Maps Wep-API:
import requests
import time
import random
import pprint

def waitrequest(base_url, my_params):
    # wait 5 second before we make the request, ideally we can prevent too many requests this way 
    # if we do too many requests, the whole uni's IP range can be locked out!
    classsize = 25
    sleeptime = random.randint(1,classsize)
    print("waiting for "+str(sleeptime)+" seconds based on a class size of "+str(classsize))
    time.sleep(sleeptime)
    return requests.get(base_url, params = my_params)

base_url = 'https://nominatim.openstreetmap.org/search'
my_params= {'q': '1 Cleveland Street,Darlington,Australia','format':'json','polygon' : 1, 'addressdetails': 1 }

response = waitrequest(base_url, my_params)
results  = response.json()
if (results):
    # Check what the results look like
    print("This is what the response looks like:")
    pprint.pprint(results)
    if (len(results) > 0):
        x_geo    = results[0]
        print("Here is the Longitude and Latitude of our school:")
        print(x_geo['lon'], x_geo['lat'])
else:
    print("no results")

## YOUR TASK: Retrieve Geo-Location of Arrival Ports of Convict Ship 'Adelaide'
 - Where lies 'Van Diemen's Land'?<br>
   Use the Open Street Maps Web-API to check for the *GPS location* of the landing locations of the first voyage of the convict transportation ship "Adelaide" (cf. Exercise 1b): **Port Phillip** and **Van Diemen's Land**
 - Also retrieve the 'boundingbox'.<br>
   For this you might need to inspect first how the JSON response is structured: Do hence first a pretty-print of the corresponding JSON response data.
 - Tip: if you want to see a map for a given GPS location, try: https://www.latlong.net/
 - Discuss: How would you store this information in your relational database next to the passenger list information?

In [None]:
# TODO: replace the content of this cell with your Python solution
## You might have to use the more generic API for your request
## my_q1_params= {'q': 'Port Phillip','format':'json'}
## my_q2_params= {'q': 'Van Diemen's Land','format':'json'}

## remember that you should always have a wait time so use the waitrequest() function to make a request.

raise NotImplementedError

### Example 4: Web API returning XML

Some web APIs return data in **XML** format.
The easiest library to work with such kind of data in Python is called **lxml**.
Its documentation can be found here:<br>
http://lxml.de

In [None]:
###### In the "Justice News" RSS feed maintained by the Justice Department, the number of items published on a Friday
from datetime import datetime
from lxml import etree
import requests
url = 'https://www.justice.gov/feeds/opa/justice-news.xml'
news= requests.get(url).content
doc = etree.fromstring(requests.get(url).content)
items = doc.xpath('//channel/item')

# how many news items on last Friday?
dates = [item.find('pubDate').text.strip() for item in items]
ts = [datetime.strptime(d[0:16], "%a, %d %b %Y") for d in dates]
# for weekday(), 4 correspond to Friday
print(len([t for t in ts if t.weekday() == 4]))

# which news items were this
titles = [item.find('title').text for item in items]
titles_df = pd.DataFrame(titles, columns=['Titles'])
titles_df

### Example 5: GitHub API and using Pandas for JSON data
Some final more complex example, extracting some information from the meta-data of a GitHub repository.

This is also an example of how to use the Pandas library to work with JSON data.

In [None]:
# From the lecture slides: list of programming languages used in PostgreSQL for the last 5 updated repositories according to GitHub repositories
import requests
import pandas as pd

base_url = 'https://api.github.com/users/postgres/repos'
response = requests.get(base_url)

df = pd.read_json(response.text)

df.sort_values(by='updated_at', ascending=False, inplace=True)
df[['name','language','updated_at']].head(5)

## Further Web API References

If you are further interested in exploring some web APIs, have a look at the following lists:

101 Data Journalist Challenges
https://github.com/stanfordjournalism/search-script-scrape

Tutorial on how to use New York Times API (needs registration with NYT)
https://stanford.edu/~vbauer/teaching/nyt.html

NSW Public Transport Events (needs registration)
https://opendata.transport.nsw.gov.au/dataset/public-transport-realtime-alerts-0

Twitter web API (needs registration):
https://developer.twitter.com/en/docs/basics/getting-started

Canvas Web API (needs an API key for cURL access): 
https://canvas.instructure.com/doc/api/

# Assignment

In Week 8 (after the break) you will be issued the group assignment for this course. There are some administrative tasks for the assignment that we started already last week's tutorial which we will follow up now, as well as some hints that can be provided regarding the content of the assignment.

## Exercise 1: Forming Groups

Form an assignment group of size 2-3 with other students who are commonly in your dual tutorial class.
Then register your group in Canvas under DATA2001 -> People -> Project Groups.
Make sure you only form a group in your assigned dual tutorial class.
If you are looking for a group, you can make an advertisement on ed using the 'Looking for Group' topics heading.

## Assignment Hints

The are three primary components to the assignment:
1. Data integration from multiple sources
1. Using the integrated data to generate a model / metric
1. Writing a report based on some analysis using your model / metric

The data integration component is likely to involve the following tasks:
1. As a starting point, we will provide you with a few initial data sets (in CSV format)
    2. You will have to load those datasets into your postgresql database
1. We will also provide some links to some GeoData (which will be covered in week 9)
1. You will have to load this GeoData into your postgresql database as well (the university server accounts you have been given should handle GIS data)
1. We will then give you a model or a metric to calculate based on the data you have been provided
1. You will have to augment the data for the model by gathering extra data from other sources using the Web scraping techniques from last week and the WebAPI access techniques from today's tutorial.

Once you have a working model/metric, we will ask that you prepare a report based on analysing some statistics

# References

Books:
- Seppe van den Broucke and Bart Baesens: "Practical Web Scraping for Data Science", Springer 2018. (available electroinically via USYD library)

# End of Tutorial. Many Thanks.