# Data Acquisition Lab

This lab is divided into short sections, one for each section of theory.

## Accessing Unprotected Web pages

In [16]:
# import the Python requests library so that you can use it in your program
import requests as rq

In [17]:
# Go to the Australian Bureau of Meteorology website and work out which page corresponds to the
# Sydney weather forecast. Store that in a variable here
url = 'http://www.bom.gov.au/nsw/forecasts/sydney.shtml'

In [18]:
# Use the requests.get() method to fetch that page
r = rq.get(url)

In [19]:
# Did that succeed? What was the .status_code?
r.status_code

200

In [20]:
# What was the .text or .content of that page? Save it in a variable, because we will be using it
# a little later
syd_weath_html = r.text

## Accessing forms

The pandas library already has a module for getting information from the Yahoo Finance pages, 
so you are unlikely to use the following code in any normal environment. But it's an example of
a simple web API

In [None]:
# There is a stock price lookup form on https://au.finance.yahoo.com (it says Enter Symbol)
# Inspect that element, and identify:
# - The <INPUT> tag with the name "s"
# - The <INPUT> tag with the name "ql" (which has a type of "hidden")
# - The <FORM> tag surrounding them with the action of "/q" and the method of GET
#
# Create a dictionary with appropriate keys to provide values for the input tags.
# Create a variable with the full URL to submit to

In [None]:
# Use requests.get to retrieve that page

## Secured pages

The username for files under http://www.ifost.org.au/ga/protected is "ga" and the password is "s3cr3t"

In this section we will fetch a file from a website that requires authentication.

In [9]:
# What happens if you use the requests library to fetch http://www.ifost.org.au/ga/protected/data.json 
# without supplying a password? What is the .status_code?
r = rq.get('http://www.ifost.org.au/ga/protected/data.json')
r.status_code

401

In [14]:
# Try again, but this time supplying a username and password
r = rq.get('http://kemek.ifost.org.au/ga/protected/data.json', auth = ('ga', 's3cr3t'))
r.status_code

200

In [15]:
r.text

u'{\n "result": "success",\n "message": "you have accessed data from a protected page"\n}\n'

## Parsing HTML

In this section we will find the prediction for tomorrow's weather.

In [21]:
# import BeautifulSoup library (version 4)
from bs4 import BeautifulSoup

In [22]:
# Create a variable called "soup" with the result of parsing the Bureau of Meteorology prediction for
# Sydney that you captured at the start of this notebook.
soup = BeautifulSoup(syd_weath_html, 'lxml')

In [25]:
def has_the_word_tuesday(x):
    return 'Tuesday' in x

# Find the first element in "soup" which has the word Tuesday in it
# You might find the function "has_the_word_tuesday" helpful
tues_elt = soup.find(string=has_the_word_tuesday)

In [38]:
# The weather prediction is obviously going to be in a <DIV> that includes it
# Display the parent of the element you found in the previous cell. You might
# find the .prettify() method makes it easier to display
tues_div = tues_elt.parent.parent
tues_div

<div class="day main">\n<h2>Tuesday 28 June</h2>\n<div class="forecast">\n<dl>\n<dt>Summary</dt>\n<dd class="image">\n<img alt="" height="42" src="/images/symbols/large/partly-cloudy.png" width="45"/>\n</dd>\n<dd>Min <em class="min">10</em></dd>\n<dd>Max <em class="max">18</em></dd>\n<dd class="summary">Cloud clearing.</dd>\n<dd class="rain">Possible rainfall: <em class="rain">0 mm</em></dd>\n<dd class="rain">Chance of any rain: <em class="pop">10%\n\t\t\t\t\t<img alt="" height="10" src="/images/ui/weather/rain_10.gif" width="69"/></em></dd>\n</dl>\n<h3>Sydney area</h3>\n<p>Sunny. Winds southwesterly 20 to 30 km/h becoming light in the late afternoon.</p>\n</div>\n<p class="alert">No UV Alert, UV Index predicted to reach 2 [Low]</p>\n</div>

In [41]:
# Can you find a <DD> element with a CSS class "summary"? (Use the parameter class_ in BeautifulSoup)
tues_div.find('dd', class_='summary')

<dd class="summary">Cloud clearing.</dd>

In [42]:
# Display the "string" attribute of this summary element. Do you need to bring an umbrella?
tues_div.find('dd', class_='summary').string

u'Cloud clearing.'

In [74]:
# getting all the days forcast
# this is a bit silly ---> [x for x in tues_div.parent.children if (x!='\n')] 
# just do this instead:
days_elts = tues_div.parent.find_all('div', class_="day")

In [75]:
[(x.find('h2'),x.find('dd', class_='summary')) for x in days_elts]

[(<h2>Forecast for the rest of Monday</h2>,
  <dd class="summary">A little rain at times.</dd>),
 (<h2>Tuesday 28 June</h2>, <dd class="summary">Cloud clearing.</dd>),
 (<h2>Wednesday 29 June</h2>, <dd class="summary">Sunny.</dd>),
 (<h2>Thursday 30 June</h2>, <dd class="summary">Partly cloudy.</dd>),
 (<h2>Friday 1 July</h2>, <dd class="summary">Partly cloudy.</dd>),
 (<h2>Saturday 2 July</h2>, <dd class="summary">Partly cloudy.</dd>),
 (<h2>Sunday 3 July</h2>, <dd class="summary">Sunny.</dd>),
 (<h2>Monday 4 July</h2>, <dd class="summary">Partly cloudy.</dd>)]

## JSON APIs

Many websites display their information in JSON format. In this section we will interact
with the Pokemon database http://pokeapi.co/

In [None]:
# Look up their documentation. What is the base URL for querying a Pokemon? What URL
# would you use to look up the Pokemon called "Groudon"? Store it in a variable

In [76]:
# Use the requests library to fetch the Groudon data
g = requests.get('http://pokeapi.co/api/v2/pokemon/groudon')

In [77]:
# Check the status code to make sure that it worked
g.status_code

200

In [78]:
# Is the content of the response in JSON format? Use the requests library function
# to decode it from JSON format into a Python dictionary
groudon_dict = g.json()

In [79]:
# What are the keys of this python dictionary?
groudon_dict.keys()

[u'is_default',
 u'abilities',
 u'stats',
 u'name',
 u'weight',
 u'held_items',
 u'location_area_encounters',
 u'height',
 u'forms',
 u'base_experience',
 u'id',
 u'game_indices',
 u'species',
 u'moves',
 u'order',
 u'sprites',
 u'types']

In [80]:
# Is "weight" listed there? If so, then the value in it should be a number
# If you play Pokemon, does this number look reasonable?
groudon_dict['weight']

9500