# Data Collection

### Downloading Data
The built-in Python *urllib.request* module has functions which help in downloading content from HTTP URLs using minimal code.

In [1]:
import urllib.request
url = "http://mlg.ucd.ie/modules/COMP30760/ucd.txt"
response = urllib.request.urlopen(url)
text = response.read().decode()
print(text)

History of UCD

Originally known as the Catholic University of Ireland and subsequently as the Royal University, the university became UCD in 1908 and a constituent college of the National University of Ireland (NUI). 

In 1997, UCD became an autonomous university within the loose federal structure of the NUI and UCD students are awarded degrees of the National University of Ireland.

UCD has been a major contributor to the making of modern Ireland. Many UCD students and staff participated in the struggle for Irish independence and the university has produced numerous Irish Presidents and Taoisigh (Prime Ministers) in addition to generations of Irish business, professional, cultural and sporting leaders. 

Among UCD’s well-known graduates are authors (Maeve Binchy, Roddy Doyle, Flann O’Brien), actors (Gabriel Byrne, Brendan Gleeson), directors (Neil Jordan, Jim Sheridan) and sports stars such as Irish rugby captain Brian O’Driscoll and former Manchester United and Ireland captain Kevin

In practice, we may often want to wrap code to fetch URLs in a try block, to handle the case where we cannot access the URL.

In [2]:
url = "http://somemissinglink.ucd.ie/ucd.txt"
try:
    response = urllib.request.urlopen(url)
    text = response.read().decode()
except:
    print("Failed to retrieve %s" % url)

Failed to retrieve http://somemissinglink.ucd.ie/ucd.txt


### Working with CSV

The CSV ("Comma Separated Values") file format is often used to exchange tabular data between different applications, like Excel. Essentially a CSV file is a plain text file where values are split by a comma separator. Alternatively can be tab or space separated. 

We could download a CSV file using *urllib.request* and manually parse it...

In [3]:
# Download the CSV and store as a string
url = "http://mlg.ucd.ie/modules/COMP30760/goal_scorers.csv"
response = urllib.request.urlopen(url)
raw_csv = response.read().decode()
# Parse each line
lines = raw_csv.split("\n")
for l in lines:
    l = l.strip()
    if len(l) > 0:
        # split based on a comma separator
        parts = l.split(",")
        print(parts)

['Player', 'Team', 'Total Goals', 'Penalties', 'Home Goals', 'Away Goals']
['J Vardy', 'Leicester City', '19', '4', '11', '8']
['H Kane', 'Tottenham', '16', '4', '7', '9']
['R Lukaku', 'Everton', '16', '1', '8', '8']
['O Ighalo', 'Watford', '15', '0', '8', '7']
['S Aguero', 'Manchester City', '14', '1', '10', '4']
['R Mahrez', 'Leicester City', '14', '4', '4', '10']
['O Giroud', 'Arsenal', '12', '0', '4', '8']
['D Costa', 'Chelsea', '10', '0', '7', '3']
['J Defoe', 'Sunderland', '10', '0', '3', '7']
['G Wijnaldum', 'Newcastle Utd', '9', '0', '9', '0']
['T Deeney', 'Watford', '8', '5', '2', '6']
['R Barkley', 'Everton', '8', '2', '5', '3']
['A Ayew', 'Swansea City', '8', '0', '5', '3']
['G Sigurdsson', 'Swansea City', '7', '3', '2', '5']
['W Rooney', 'Manchester Utd', '7', '1', '3', '4']
['A Martial', 'Manchester Utd', '7', '0', '4', '3']
['D Alli', 'Tottenham', '7', '0', '1', '6']
['D Payet', 'West Ham Utd', '7', '0', '3', '4']
['M Arnautovic', 'Stoke City', '7', '2', '4', '3']
['Y Tou

But we can also use Pandas to directly download and parse CSV data for us, to create a Data Frame which is ready to analyse.

In [4]:
import pandas as pd
df = pd.read_csv("http://mlg.ucd.ie/modules/COMP30760/goal_scorers.csv")
df

Unnamed: 0,Player,Team,Total Goals,Penalties,Home Goals,Away Goals
0,J Vardy,Leicester City,19,4,11,8
1,H Kane,Tottenham,16,4,7,9
2,R Lukaku,Everton,16,1,8,8
3,O Ighalo,Watford,15,0,8,7
4,S Aguero,Manchester City,14,1,10,4
5,R Mahrez,Leicester City,14,4,4,10
6,O Giroud,Arsenal,12,0,4,8
7,D Costa,Chelsea,10,0,7,3
8,J Defoe,Sunderland,10,0,3,7
9,G Wijnaldum,Newcastle Utd,9,0,9,0


### Working with JSON

[JSON](http://json.org/) is a lightweight format which is becoming increasingly popular for online data exchanged. Based originally on the JavaScript language and (relatively) easy for humans to read and write

The built-in module *json* provides an easy way to encode and decode data in JSON in Python.

In [5]:
import json

Let's try downloading and parsing a simple JSON file which contains information about a number of books, originally from librarything.com:

In [6]:
url = "http://mlg.ucd.ie/modules/COMP30760/books.json"
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

In [7]:
print(raw_json)

[{
	"book_id": "13585350",
	"title": "The World Treasury of Science Fiction",
	"ISBN": "",
	"year": 1989,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "124205572",
	"title": "The War of the Worlds",
	"ISBN": "1936594056",
	"year": 2013,
	"rating": 4,
	"language": "eng"
}, {
	"book_id": "127360065",
	"title": "Under the Dome: A Novel",
	"ISBN": "1439149038",
	"year": 2013,
	"rating": 2,
	"language": "eng"
}, {
	"book_id": "13908800",
	"title": "The Ultimate Hitchhiker's Guide to the Galaxy",
	"ISBN": "0345453743",
	"year": 2002,
	"rating": 5,
	"language": "eng"
}, {
	"book_id": "123734934",
	"title": "The Time Traveler's Wife",
	"ISBN": "1476764832",
	"year": 2014,
	"rating": 5,
	"language": "eng"
}, {
	"book_id": "13603020",
	"title": "Salem's Lot",
	"ISBN": "0451098277",
	"year": 1976,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "124173974",
	"title": "Republic",
	"ISBN": "039395501X",
	"year": 1985,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "123102859",
	"title": "

We can now parse the JSON, converting it from a string into a useful Python data structure:

In [8]:
data = json.loads(raw_json)
print(data)

[{'book_id': '13585350', 'title': 'The World Treasury of Science Fiction', 'ISBN': '', 'year': 1989, 'rating': 3, 'language': 'eng'}, {'book_id': '124205572', 'title': 'The War of the Worlds', 'ISBN': '1936594056', 'year': 2013, 'rating': 4, 'language': 'eng'}, {'book_id': '127360065', 'title': 'Under the Dome: A Novel', 'ISBN': '1439149038', 'year': 2013, 'rating': 2, 'language': 'eng'}, {'book_id': '13908800', 'title': "The Ultimate Hitchhiker's Guide to the Galaxy", 'ISBN': '0345453743', 'year': 2002, 'rating': 5, 'language': 'eng'}, {'book_id': '123734934', 'title': "The Time Traveler's Wife", 'ISBN': '1476764832', 'year': 2014, 'rating': 5, 'language': 'eng'}, {'book_id': '13603020', 'title': "Salem's Lot", 'ISBN': '0451098277', 'year': 1976, 'rating': 3, 'language': 'eng'}, {'book_id': '124173974', 'title': 'Republic', 'ISBN': '039395501X', 'year': 1985, 'rating': 3, 'language': 'eng'}, {'book_id': '123102859', 'title': 'The Road', 'ISBN': '0307387895', 'year': 2006, 'rating': 5,

We can now iterate through the books in the list and extract the relevant information that we require.

In [9]:
for book in data:
    print( "%s = %d" % ( book["title"], book["year"] ) )

The World Treasury of Science Fiction = 1989
The War of the Worlds = 2013
Under the Dome: A Novel = 2013
The Ultimate Hitchhiker's Guide to the Galaxy = 2002
The Time Traveler's Wife = 2014
Salem's Lot = 1976
Republic = 1985
The Road = 2006


### Working with XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. XML is a widely-adopted format. Python includes several built-in modules for parsing XML data.

The *xml.etree.ElementTree* module can be used to extract data from a simple XML file based on its tree structure. 

In [None]:
# download the content
url = "http://mlg.ucd.ie/modules/COMP30760/books.xml"
response = urllib.request.urlopen(url)
raw_xml = response.read().decode()
print(raw_xml)

We can use the *xml.etree.ElementTree.fromstring()* function to parse content from a string containing XML data.

In [None]:
import xml.etree.ElementTree
tree = xml.etree.ElementTree.fromstring(raw_xml)

An XML tree has a root node (i.e. the top level of the document), with child nodes at lower levels. We can iterate over these:

In [None]:
for child in tree:
    # get the name of the tag, along with any XML attributes which the tag has
    print( child.tag, child.attrib )

We can also query to find tags with specific names, such as '<book>' and then in turn find child nodes of that tag with a specific name.

In [None]:
for book in tree.findall("book"):
    # get the text inside a <title> tag, contained within a <book> tag
    title = book.find("title").text
    print(title)

### Working with APIs

#### Example - Wikipedia

As a simple example of using an Online API, we will retrieve JSON data from the Wikipedia web API. The Wikipedia page for 'Belfield' is [here](https://en.wikipedia.org/wiki/Belfield,_Dublin). We can retrieve this data in a cleaner JSON format from the Wikipedia API endpoint (https://en.wikipedia.org/w/api.php).

In [10]:
title = "Belfield,_Dublin"
url = "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=true&titles=" + title
print(url)

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=true&titles=Belfield,_Dublin


In [11]:
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

Once we have downloaded the JSON data into a string, we parse it using the *loads()* function, which will convert it into an actual Python dictionary.

In [12]:
data = json.loads(raw_json)
data

{'batchcomplete': '',
 'query': {'normalized': [{'from': 'Belfield,_Dublin',
    'to': 'Belfield, Dublin'}],
  'pages': {'918146': {'pageid': 918146,
    'ns': 0,
    'title': 'Belfield, Dublin',
    'extract': '<p><b>Belfield</b> is a small enclave, not quite a suburb, in Dún Laoghaire–Rathdown, Ireland. It is synonymous with the main campus of University College Dublin.\n</p><p>Belfield is close to Donnybrook, Ballsbridge, Clonskeagh, Goatstown and Stillorgan and takes its name from Belfield House and Demesne, one of eight properties bought to form the main campus of University College Dublin. It is adjacent to the R138 road.\n</p>'}}}}

The response still needs to be inspected. Note that the results we want are are in *data["query"]["pages"]*:

In [13]:
print(data["query"]["pages"])

{'918146': {'pageid': 918146, 'ns': 0, 'title': 'Belfield, Dublin', 'extract': '<p><b>Belfield</b> is a small enclave, not quite a suburb, in Dún Laoghaire–Rathdown, Ireland. It is synonymous with the main campus of University College Dublin.\n</p><p>Belfield is close to Donnybrook, Ballsbridge, Clonskeagh, Goatstown and Stillorgan and takes its name from Belfield House and Demesne, one of eight properties bought to form the main campus of University College Dublin. It is adjacent to the R138 road.\n</p>'}}


In [14]:
result = data["query"]["pages"]["918146"]
print(result["title"])
print(result["extract"])

Belfield, Dublin
<p><b>Belfield</b> is a small enclave, not quite a suburb, in Dún Laoghaire–Rathdown, Ireland. It is synonymous with the main campus of University College Dublin.
</p><p>Belfield is close to Donnybrook, Ballsbridge, Clonskeagh, Goatstown and Stillorgan and takes its name from Belfield House and Demesne, one of eight properties bought to form the main campus of University College Dublin. It is adjacent to the R138 road.
</p>
