# Web Scraping and Visualising Data

The Bureau of Meteorology (BoM) website includes a handy map tool which allows users to zoom in to locations, pick a data type, and then view the data on a subsequent page. That tool is shown in the image below. You can view the page at the following URL:

http://www.bom.gov.au/climate/data/

While this process is easy enough to do manually, if you wish to download multiple data sets then it will quickly become tedious. We want to automate this!

In this tutorial, we will:
- Explore and understand the BoM website URL parameters
- Build a URL to download a web page
- Harvest a URL from that page to download a zipped data set
- Unpack the data set and extract a file
- Visualise the data using pandas

<div class="alert alert-info">
This notebook contains a lot of code and explanatory text. But if you wanted to build a streamlined dashboard to explore weather station data, you could extract the code into a separate Python file and just include the GUI and plots into a notebook.
</div>

![title](images/BOM_Map.PNG)

# Workshop Dependencies and Configuration

Import the packages required for this workshop, and add a little bit of configuration as appropriate.

In [None]:
import pandas as pd # building data frames from text
from requests import get  # facilitate HTTP GET request
from lxml import etree # harvest data from HTML
from io import BytesIO # reading zip data to a byte stream for ZipFile to interpret
from io import StringIO # reading weather data to a string stream for Pandas to interpret
from zipfile import ZipFile # manipulate the downloaded ZIP file

# IPython dependencies to facilitate dropdown handling
from IPython.display import clear_output, display
from ipywidgets import widgets
#from ipywidgets import interact

Configure matplotlib to use the inline rendering backend:

In [None]:
%matplotlib inline

# Downloading Web Resources

Uniform Resource Locators (URLs) often contain structures which can be deconstructed from a few minutes of experimentation in your favorite web browser. For this workshop, we will be investigating weather data provided by the [Bureau of Meteorology](http://www.bom.gov.au/) (BoM).

The BoM makes zip files of their weather stations across multiple data types available on their website. We will use a python library called 'requests' in order to download these files into python. An example URL is shown below:

http://www.bom.gov.au/jsp/ncc/cdio/wData/wdata?p_nccObsCode=193&p_display_type=dailyDataFile&p_stn_num=015623&p_startYear=

The above URL represents the page below. What are we looking at? It appears to represent Daily Rainfall data for the Melbourne (Olympic Park) weather station.


![title](images/BOM_Daily.png)

If you visit the above page, you will notice that there is a lot of text data. While we could manually parse this data if necessary, there is a better alternative, as highlighted in the below image. The following URL is the red circled link shown below:

```
http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=086338&p_c=-1490905306&p_nccObsCode=136&p_startYear=2017
```

At that URL, a ZIP file is available, which contains all of the available Daily Rainfall data for the Melbourne (Olympic Park) station.

<div class="alert alert-info">
You don't need to download this file now. The goal of this notebook is to develop a flexible way of downloading archived data for any weather station.
</div>

![title](images/BOM_Daily_Annotated.png)

# Making Sense of The URLs
The first step in building our rainfall archive tool is to understand the request parameters of the main weather station page.

## Weather Station Page URL

Inspect the URL of the weather station page we have visited:

http://www.bom.gov.au/jsp/ncc/cdio/wData/wdata?p_nccObsCode=136&p_display_type=dailyDataFile&p_stn_num=086338&p_startYear=

There are some interesting parameters which are included with the URL:
- `p_nccObsCode=136`
- `p_display_type=dailyDataFile`
- `p_stn_num=086338`
- `p_startYear=`

We can see two important numbers in this URL - `p_stn_num` and `p_nccObsCode`. What do these numbers mean? Normally, you would have to work to understand the URL structure, however today much of that work is summarised for you below. Here are two important numbers to consider in order to unlock the data scraping potential of the BoM's weather datasets:

### Station ID ( p_stn_num )
This is a six-digit zero-padded number that uniquely identifies the weather station. Some examples include:
- 086338: Melbourne (Olympic Park)
- 070247: Canberra (Australian National Botanic Gardens)
- 015590: Alice Springs Airport

### Observation Code ( p_nccObsCode )
A three-digit (maybe zero-padded?) number that uniquely identifies the weather type:
- 122: Daily Maximum Temperature (degrees celsius)
- 123: Daily Minimum Temperature (degrees celsius)
- 136: Daily Rainfall (mm)
- 193: Daily Global Solar Exposure (MJ/m^2)

**Combinations of the Station ID and the Observation Code will be used in this workshop to gather the data sets.**

### Starting Year (p_startYear)
<div class="alert alert-info">
The name `p_startYear` suggests that it provides a starting year for the requested data set. The fact that our URL works without a value suggests that this value is optional, and the logical default action is to request all years. We won't be using the start year in this notebook, but feel free to experiment on your own. One idea is to extend the `make_indirect_url()` function to accept an optional starting year and build it into the URL string. You could then add a GUI widget to select the required starting year.
</div>

## Direct ZIP File URL

The BoM makes zip files of weather station data available with the following request structure:

http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_display_type=dailyZippedDataFile&p_stn_num=086338&p_c=-1490905306&p_nccObsCode=136&p_startYear=2017

While the above URL is quite long, in reality only a handful of characters will change between. As you can see, there are multiple parameters to this URL as well, including:
- `p_display_type=dailyZippedDataFile`
- `p_stn_num=086338`
- `p_c=-1490905306`
- `p_nccObsCode=136`
- `p_startYear=2017`

While it is obvious that `p_stn_num` and `p_nccObsCode` are the station ID and observation code that we have already identified, the other parameters are unclear. For example, `p_c` does not have a clear relation to the other numbers (though it may be related to a timestamp). This means we can not build the direct ZIP download URL to harvest the required ZIP files in a single step. But we do know enough to build the URL of the information page containing the ZIP download links (the indirect URL).

**Therefore, through the magic of web scraping, we will use the indirect URL to lead us to the direct URL!**

# Understanding the Station ID and Observation Code Values
Before we can request and scrape a weather station page, we need to understand the valid values for the station ID and observation code. We need those values to build the page request.

## Station ID
Rather than figuring out the Station IDs manually by visiting the BoM map, a data set containing all of the station IDs and associated meta data is available at the following page:

http://www.bom.gov.au/climate/cdo/about/sitedata.shtml

The URL is difficult to find! Here is the direct link:

`ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/stations.zip`

This data set has already been downloaded, and an excerpt of the data set is shown in the next cell. 

In [None]:
frame = pd.read_csv('StationsComplete.csv', index_col=0)
frame.head()

<div class="alert alert-info">
Note that the file headers have been adjusted slightly in order to simplify reading the data set into Pandas. If you are curious you could download the original file and compare it, although this is not required for this notebook exercise.
</div>

As we have specified the first column as the index (the Weather Station ID), we can view a particular station by specifying the Weather Station ID, as shown below:

In [None]:
station_melbourne = 86338 # The Melbourne (Olympic Park) Station ID

frame.loc[[station_melbourne]]

## Observation Code

To save time, the detective work for the observation codes has been done for you! This value determines the type of observation data being requested. There are only a few potential numerical values which we give to you now. 

To help with the web scraping GUI that we build later, lets define the available codes as a dictionary, using a human readable name as the key:

In [None]:
mode_mapper = {
    'Maximum temperature (Degree C)': '122',
    'Minimum temperature (Degree C)': '123',
    'Rainfall amount (millimetres)': '136',
    'Daily global solar exposure (MJ/m*m)': '193'
}

Now that we understand the two essential request parameters, we can move on to building the weather station page URL and scraping the ZIP download page URL.

# Retrieving the Weather Station Page via the Indirect URL
A function to build the URL that contains both the Station ID and Observation Code is below.

Note that the Station ID will also be zero-padded to six characters here, as it is necessary for the BoM web server.

In [None]:
def make_indirect_url(station, mode):
    # creates a URL for indirect access to BoM data set resources
    url_string = ('http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?'
                  'p_nccObsCode={mode}&'
                  'p_display_type=dailyDataFile'
                  '&p_startYear='
                  '&p_c='
                  '&p_stn_num={station:06d}')
    return url_string.format(station=int(station), mode=mode)


Using the `requests.get()` function, we now define a convenience method to take a URL and retrieve the data from this URL. This function will be used twice in this workshop:
1. Downloading the indirect URL to harvest the direct URL (if it is present)
2. Downloading the ZIP file data to store in-memory

In [None]:
def download_url_content(url):
    # while this function is just two lines, it helps to separate out common processes
    # where particular steps may be forgotten. In this instance, the 'content' property
    # of response may be forgotten when downloading
    response = get(url)
    return response.content

# Harvesting the Direct URL
We now have the HTML page content from the indirect URL. Now we need to mine the HTML in order to find the data archive URL.

There are many techniques and libraries for extracting information from HTML data. [Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and Python string searching functions can be used to look for particular text patterns, while the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) library provides a full suite of tools for web scraping.

Here we are going to use a third approach. Exploiting the structure of HTML, we will use what is termed an 'XPath' to specify the location of a specific item in the HTML. An example is shown below:

`//*[@id="content-block"]/ul[2]/li[2]/a/@href`

What does this mean? And how did we gather this? The HTML page is made up of tags such as `<div></div>` (a generic container), `<li></li>` (a list element), and `<a></a>` (a hyperlink). The above XPath is a set of instructions to traverse the HTML document in the following order:
1. `//*[@id="content-block"]` : select all elements on the page with an id of 'content-block' (assumed to yield just one entry)
2. `/ul[2]` : select the second unordered list tag
3. `/li[2]` : select the second list entry tag
4. `/a` : select the hyperlink
5. `/@href` : select the 'href' tag on the hyperlink (`<a>`)

For this workshop, finding the required XPath is quite simple. Using Chrome:
1. Right click on the 'All years of data' URL, and click 'Inspect'. The Chrome developer tools will appear below or at the right of the browser window.
2. Right click on the highlighted element (coloured blue as shown below), and hover over 'Copy'
3. Click on 'Copy XPath'. This will add the XPath to your clipboard.
4. Paste the XPath in to your available editor, and add `/@href` to indicate we want the URL fragment from the hyperlink.

Often more complex handling is needed where pages contain a less predictable structure. Thank you for the consistency, BoM!

![title](images/BOM_XPath.png)

<div class="alert alert-info">
If you feel confident, or just curious, feel free to [open this page](http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=122&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=086338) in a new tab and recreate the XPath discovery method.

If you use Chrome, you may follow the method exactly. Most other browsers have comparable developer tools for inspecting the page [DOM](https://en.wikipedia.org/wiki/Document_Object_Model), but the steps will vary.
</div>

## Defining the Web Scraping Function
Let's put the XPath of the ZIP archive download link to work by defining a function to retrieve the link from the page HTML. We are using the [lxml](http://lxml.de/) library to query the HTML DOM with our XPath before building the full URL:

In [None]:
def gather_direct_url(html):
    try:
        # take the raw HTML string and convert it into something 
        # parseable by our python code
        html_parsed = etree.HTML(html)
        
        # search the parsed HTML for the XPath we have created for the BoM data page
        filtered = html_parsed.xpath('//*[@id="content-block"]/ul[2]/li[2]/a/@href')
        
        if filtered[0]:
            # if found, the URL is only a relative URL, so we have to attach
            # the BoM domain in order for it to be a complete URL
            return 'http://www.bom.gov.au{0}'.format(filtered[0])
        
        else:
            # if nothing is found, then we pass None back to the caller
            return None
        
    except Exception as e:
        # simply return nothing, as there were issues with the provided html
        print(e)
        return None

# Building a GUI for Selecting Weather Station and Observation Type

Right, we now have all the code to find the data archive URL from a given station ID and observation code. Let's use the [IPyWidgets](http://ipywidgets.readthedocs.io/en/latest/) library to build a GUI to streamline selection of those two values.

## Station Search

When building this workshop, it was obvious that searching the text on the Site Name column was the simplest method of interaction. Type in the name of local towns in your neighbourhood to discover if BoM has installed a station there.

Notice how we use the `IPython.display.clear_output()` function to animate the search results into the output cell. This function clears the output cell and replaces it with the new output. It's a simple but powerful technique.

In [None]:
columns_display = ['Site Name', 'Start', 'End']

def search_site_name(value):
    print(frame[frame['Site Name'].str.contains(value.upper())][columns_display])

def change_search(change):
    if change['type'] == 'change' and change['name'] == 'value':
        clear_output(wait=True)
        search_site_name(change.new)
    else:
        return False

# Create text widget for input
input_text = widgets.Text(description='Search Site Names', value='Melbourne')
input_text.observe(change_search)
display(input_text)

# perform the initial run to demonstrate an example entry
search_site_name(input_text.value)

<div class="alert alert-info">
Once you have found an interesting station, enter its ID in the input below for validation and to assign it for the download URL.
</div>

In [None]:
# Set the default selection to the 'Melbourne (Olympic Park)' Station ID
station_selected = 86338

In [None]:
def search_site_id(value):
    global station_selected
    if value and value.isdigit():
        value = int(value)
        if value in list(frame.index):
            print('Station found for ID {0}'.format(value))
            station_selected = int(value)
            print(frame.loc[int(value)])
        else:
            print('No Station found for ID {0}'.format(value))
            station_selected = None
    else:
        print('Please enter a Station ID (numerical)')
        station_selected = None

def change_search(change):
    if change['type'] == 'change' and change['name'] == 'value':
        clear_output(wait=True)
        search_site_id(change.new)
    else:
        return False

# Create text widget for input
input_text = widgets.Text(description='Enter Site ID', value=str(station_selected))
input_text.observe(change_search)
display(input_text)

# perform the initial run to demonstrate an example entry
search_site_id(input_text.value)

## Observation Type
Since there is just a short list of observation types, we will use a Dropdown widget from IPyWidgets.

The Dropdown widget accepts a dictionary. The keys are used for the display while the values are used as the selected value. However, when handling the change event of the Dropdown widget, we also want to go backwards from the selected value to the human-readable text. Unfortunately the Dropdown widget only let's you query the value, and Python dictionaries do not provide a reverse lookup from value to key. To get around this we build a second dictionary with the keys and values reversed. Note that this requires unique values for both value and key (*why?*).

This approach can become error prone if the dictionaries change since you must remember to modify both dictionaries. If you have a dynamic dictionary that requires reverse lookups, consider a third-party library such as [`bidict`](https://pypi.python.org/pypi/bidict/0.3.1).

An alternative approach would be to search the `mode_mapper` dictionary each time to find the human-readable text corresponding to the selected value.

To refresh your memory, here is the `mode_mapper` dictionary we defined earlier:

In [None]:
mode_mapper

And here is the inverse dictionary:

In [None]:
inverse_mode_mapper = dict([[v,k] for k,v in mode_mapper.items()])
inverse_mode_mapper

With that out of the way, let's display the Dropdown widget:

In [None]:
# Arbitrarily select the first mode as our default selection
mode_selected = mode_mapper[list(mode_mapper.keys())[0]]
mode_pretty = inverse_mode_mapper[mode_selected]

In [None]:
def change_mode(change):
    global mode_selected
    global mode_pretty
    clear_output()
    mode_selected = change.new
    mode_pretty = inverse_mode_mapper[mode_selected]
    print('Changed mode to {0} - {1}'.format(mode_selected, mode_pretty))

# build the dropdown widget to manipulate the mode option
dropdown = widgets.Dropdown(options=mode_mapper, value=mode_selected, description='Weather Data')
dropdown.observe(change_mode, 'value')
display(dropdown)

<div class="alert alert-info">
Note that not all data modes are available for each station.
</div>

# Putting It All Together: Perform The ZIP Downloads
Attempt to capture the direct URL with the functions we have created, then download the ZIP:

In [None]:
url_indirect = make_indirect_url(station_selected, mode_selected)
print('Built indirect URL - {0}'.format(url_indirect))
indirect_content = download_url_content(url_indirect)
print('Indirect URL contained HTML of {0} characters'.format(len(indirect_content)))
url_direct = gather_direct_url(indirect_content)

if url_direct != None:
    print('Found URL {0}, downloading ZIP file...'.format(url_direct))
    direct_content = download_url_content(url_direct)
    print('ZIP file downloaded')
else:
    print('A URL for Station ID {0}, Mode {1} was not found!'.format(station_selected, mode_selected))
    # Set direct_content so that subsequent cells will not produce results from last attempt
    direct_content = None

# Inspect The Zip File
If a file was successfully download in the prior cell, lets continue to inspect the ZIP file and extract a CSV file if it is found.

In [None]:
# In this workshop, we do not save the file on the hard drive.
# Instead, we read the data to StringIO which emulates a file
zip_data = BytesIO()
zip_data.write(direct_content)

input_zip = ZipFile(zip_data)

# gather the names of files in the ZIP to loop through
file_names = input_zip.namelist()

weather_data = None

for entry in file_names:
    # here we see if the file name contains '.csv'
    # we have assumed that there is only one such 'Data.csv' file in the ZIP
    if '.csv' in entry:
        print('CSV File Found - {0}'.format(entry))
        weather_data = input_zip.read(entry)
        print('Read weather data of {0} bytes'.format(len(weather_data)))
        
if weather_data == None:
    print('No CSV data set found in ZIP file!')

# Read the Data into a Pandas DataFrame
As the weather data for this workshop is all in-memory (rather than stored on the hard drive), there are a couple of extra steps needed to read the file into a Pandas DataFrame.

In [None]:
# decoding data from byte to utf-8
weather_decoded = weather_data.decode("utf-8")

# sending the decoded data to a StringIO to emulate a file on the hard drive
weather_file = StringIO()
weather_file.write(weather_decoded)
weather_file.seek(0)

# read the CSV 'file' into a DataFrame 
weather_frame = pd.read_csv(weather_file)

weather_frame.head()

# Cleaning the Data
Now that we have read the data, the next stage is to assign something more appropriate to the DataFrame index, and perform some basic data cleaning:

In [None]:
# build a datetime index out of the Year, Month, and Day columns
# it is assumed these columns are provided in the BoM data set
weather_frame.index = pd.to_datetime(weather_frame[['Year', 'Month', 'Day']])

weather_frame.head()

In [None]:
# we can discard much of the provided data, besides the particular environmental data
# column that we are interested in
weather_cleaned = weather_frame[[mode_pretty]]

# the BoM data may contain a large number of empty rows, as the weather station
# may have started operation on May 10th, while the BoM data by default starts at Jan 1st
# in addition, some rows throughout the data set may be empty due to unexpected downtime or
# maintenance of the weather station
weather_cleaned = weather_cleaned.dropna()

# some rows may have been dropped in between rows that contain data.
# after dropping the empty rows, now fix the indexing so that we again have a sequential data set
idx = pd.date_range(weather_cleaned.index[0], weather_cleaned.index[-1])
weather_cleaned = weather_cleaned.reindex(idx)

# interpolate missing entries from rows after the reindexing
# be aware that this strategy suits data with only a small number of consecutive
# missing rows, and can produce undesired results if the data is sensitive to time or
# behaves stochastically
weather_cleaned[mode_pretty] = weather_cleaned[mode_pretty].interpolate().round(1)

# Preview the data
weather_cleaned.head()

# Plotting the Data
Once the data has been appropriately cleaned, we can then use Pandas to visualise the data.

In [None]:
figsize = (16, 5)

In [None]:
ax = weather_cleaned.plot(kind='line', legend=False, figsize=figsize)

ax.set(xlabel="Year", title=mode_pretty);

In [None]:
ax = weather_cleaned.groupby(weather_cleaned.index.year).mean().plot(legend=False, figsize=figsize)

ax.set(xlabel="Year", title="Weather Data - Yearly Mean");

In [None]:
ax = weather_cleaned.groupby(weather_cleaned.index.month).mean().plot(legend=False, figsize=figsize)

ax.set(xlabel="Month", title="Weather Data - Monthly Mean");