## Data Collection & Data Formats

### Downloading Data
The built-in Python *urllib.request* module has functions which help in downloading content from HTTP URLs using minimal code.

In [None]:
import urllib.request
url = "http://mlg.ucd.ie/modules/COMP41680/ucd.txt"
response = urllib.request.urlopen(url)
text = response.read().decode()
print(text)

In practice, we may often want to wrap code to fetch URLs in a try block, to handle the case where we cannot access the URL.

In [None]:
url = "http://somemissinglink.ucd.ie/ucd.txt"
try:
    response = urllib.request.urlopen(url)
    text = response.read().decode()
except:
    print("Failed to retrieve %s" % url)

### Working with CSV

The CSV ("Comma Separated Values") file format is often used to exchange tabular data between different applications, like Excel. Essentially a CSV file is a plain text file where values are split by a comma separator. Alternatively can be tab or space separated. 

We could download a CSV file using *urllib.request* and manually parse it...

In [None]:
# Download the CSV and store as a string
url = "http://mlg.ucd.ie/modules/COMP41680/goal_scorers.csv"
response = urllib.request.urlopen(url)
raw_csv = response.read().decode()
# Parse each line
lines = raw_csv.split("\n")
for l in lines:
    l = l.strip()
    if len(l) > 0:
        # split based on a comma separator
        parts = l.split(",")
        print(parts)

But we can also use Pandas to directly download and parse CSV data for us, to create a Data Frame which is ready to analyse.

In [None]:
import pandas as pd
df = pd.read_csv("http://mlg.ucd.ie/modules/COMP41680/goal_scorers.csv")
df

### Working with JSON

[JSON](http://json.org/) is a lightweight format which is becoming increasingly popular for online data exchanged. Based originally on the JavaScript language and (relatively) easy for humans to read and write

The built-in module *json* provides an easy way to encode and decode data in JSON in Python.

In [None]:
import json

Let's try downloading and parsing a simple JSON file which contains information about a number of books, originally from librarything.com:

In [None]:
url = "http://mlg.ucd.ie/modules/COMP41680/books.json"
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

In [None]:
print(raw_json)

We can now parse the JSON, converting it from a string into a useful Python data structure:

In [None]:
data = json.loads(raw_json)
print(data)

We can now iterate through the books in the list and extract the relevant information that we require.

In [None]:
for book in data:
    print( "%s = %d" % ( book["title"], book["year"] ) )

### Working with XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. XML is a widely-adopted format. Python includes several built-in modules for parsing XML data.

The *xml.etree.ElementTree* module can be used to extract data from a simple XML file based on its tree structure. 

In [None]:
# download the content
url = "http://mlg.ucd.ie/modules/COMP41680/books.xml"
response = urllib.request.urlopen(url)
raw_xml = response.read().decode()
print(raw_xml)

We can use the *xml.etree.ElementTree.fromstring()* function to parse content from a string containing XML data.

In [None]:
import xml.etree.ElementTree
tree = xml.etree.ElementTree.fromstring(raw_xml)

An XML tree has a root node (i.e. the top level of the document), with child nodes at lower levels. We can iterate over these:

In [None]:
for child in tree:
    # get the name of the tag, along with any XML attributes which the tag has
    print( child.tag, child.attrib )

We can also query to find tags with specific names, such as '<book>' and then in turn find child nodes of that tag with a specific name.

In [None]:
for book in tree.findall("book"):
    # get the text inside a <title> tag, contained within a <book> tag
    title = book.find("title").text
    print(title)

### Working with APIs

#### Example - Wikipedia

As a simple example of using an Online API, we will retrieve JSON data from the Wikipedia web API. The Wikipedia page for 'Belfield' is [here](https://en.wikipedia.org/wiki/Belfield,_Dublin). We can retrieve this data in a cleaner JSON format from the Wikipedia API endpoint (https://en.wikipedia.org/w/api.php).

In [None]:
title = "Belfield,_Dublin"
url = "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=true&titles=" + title
print(url)

In [None]:
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

Once we have downloaded the JSON data into a string, we parse it using the *loads()* function, which will convert it into an actual Python dictionary.

In [None]:
data = json.loads(raw_json)
data

The response still needs to be inspected. Note that the results we want are are in *data["query"]["pages"]*:

In [None]:
print(data["query"]["pages"])

In [None]:
result = data["query"]["pages"]["918146"]
print(result["title"])
print(result["extract"])

#### Example - Currency Exchange Rates

In the next example, we will use the *frankfurter.app* (formerly *Fixer.io*) API to get currency exchange rate information: https://frankfurter.app

To retrieve all rates in EUROs, we retrieve the following:

In [None]:
url = "https://frankfurter.app/latest"
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")
print(raw_json)

Parse the JSON data

In [None]:
data = json.loads(raw_json)
# List all the rates
data

In [None]:
# Get a specific rate
data["rates"]["CHF"]

We can change the URL to get rates for a different currency, such as US Dollars (USD):

In [None]:
url = "https://frankfurter.app/latest?base=USD"
# Retrieve the JSON
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")
# Parse the JSON
data = json.loads(raw_json)
# Display the rates data for US dollars
data["rates"]