# CAB420, Practical 1 - Question 1 Solution

## Combining and Filtering Multiple Datasets

CAB420 Tutorial1.zip contains a number of datasets, split into two directories as follows:
* BOM, which contains Bureau of Meteorology data for Brisbane City from the years 1999-2019. The data is split into three files.
 * IDCJAC0009_040913_1800_Data.csv contains daily rainfall data;
 * IDCJAC0010_040913_1800_Data.csv contains maximum daily temperature data; and
 * IDCJAC0013_040913_1800_Data.csv contains daily solar exposure data.
* BCCCyclewayCounts contains five years data (from 2014-2018) for Brisbane City Council cycleways, with data for each year being in a separate file (i.e. bike-ped-auto-counts-2014.csv contains data for the year 2014).

You are to combine these datasets into a single table using Python (or the programming language of your choice such that):
* You have a single table that spans the time period of the BCCCyclewayCounts data;
* Duplicate information is avoided (i.e. you don’t have multiple date columns, or similar);
* For the cycle way data, only columns that are available in all years data are included in the final table (i.e. if a counter is available in 2014 − 2017, but not 2018, that column should be excluded.

### Suggested Packages

The following packages are suggested, however there are many ways to approach things in python, if you'd rather use different pacakges that's cool too.

In [1]:
# numpy handles pretty much anything that is a number/vector/matrix/array
import numpy as np
# pandas handles dataframes
import pandas as pd
# matplotlib emulates Matlabs plotting functionality
import matplotlib.pyplot as plt
# stats models is a package that is going to perform the regression analysis
from statsmodels import api as sm
from scipy import stats
from sklearn.metrics import mean_squared_error
# os allows us to manipulate variables on out local machine, such as paths and environment variables
import os
# self explainatory, dates and times
from datetime import datetime, date
# a helper package to help us iterate over objects
import itertools

### Step 1: Load Weather Data

You have three weather files to load. Each of these will have more data than we do in the cycling files, so only select data in the same data range as we have for the cycle counts (2014-2018).

You may also wish to convert the data information that is in this data into a date object to make it easier to work with. If you wish to do this, you can convert the three date values in each row (Year, Month, Day) to a datetime object with the following function. You could use the ``apply()`` function that operates over a pandas dataframe to apply this to your tables.

Ideally, it'd be good to merge these into one overall weather table. It's suggested to do a quick visualisation of your BOM data after it's loaded. The ``head()`` function is good for this. You may then want to remove some redundant columns. ``drop()`` can be used to achieve this.

In [2]:
def create_date(row):
    # create string with date it format Y:m:d
  return datetime.strptime('{:04d}/{:02d}/{:02d}'.format(row.Year, row.Month, row.Day),
                           '%Y/%m/%d')

### Step 2: Load BCC Data

Now we need to load the BCC data. This will follow a broadly similar path to the BOM data:
* Load the individual files
* Convert the date. This data already has a date column, but you should check the format to make sure it's being parsed as a date object and not a string. The ``to_datetime()`` within the pandas dataframe class could be of use.
* Merge the tables. To do this, you will need to look at what columns are in common between the five tables. You can create a list of column names, and use the intersection of this list to find the set of common columns.
* Pull out the common columns from the individual tables, and merge the final results

As with the BOM data, inspect the data after it has been merged.

### Step 3: Merge the Data

Here, we can use the pandas ``merge()`` function to merge our two dataframes. Consider merging them based on the ``Date`` columns to make sure that entries line up.

Visualise the merged dataset, and consider if there are any other columns that could be removed.

Finally, you may want to save the dataset using the ``to_csv()`` function in the pandas dataframe object.