# CAB420, Practical 1 - Question 2 Template
## Linear Regression

Using the dataset from Problem 1, split the data into training, validation and testing as follows:
* Training: All data from the years 2014-2016
* Validation: All data from 2017
* Training: All data from 2018

Develop a regression model to predict one of the cycleway data series in your dataset. In developing this model you should:
* Initially, use all weather data (temperature, rainfall and solar exposure) and all other data series for a particular counter type (i.e. if you’re predicting cyclists inbound for a counter, use all other cyclist inbound counters)
* Use p-values, qqplots, and performance on the validation set to remove terms and improve the model.

When you have finished refining the model, evaluate it on test set, and compare the Root Mean Squared Error (RMSE) for the training, validation and test sets.

In training the model, you will need to ensure that you have no samples (i.e. rows) with missing data. As such, you should remove samples with missing data from the dataset before training and evaluating the model. This may also mean that you have to remove some columns that contain large amounts of missing data.

### Relevant Examples

The first linear regression example, ``CAB420_Regression_Example_1_Linear_Regression.ipynb`` is a useful starting point here.

### Suggested Packages

The following packages are suggested, however there are many ways to approach things in python, if you'd rather use different pacakges that's cool too.

In [1]:
# numpy handles pretty much anything that is a number/vector/matrix/array
import numpy as np
# pandas handles dataframes
import pandas as pd
# matplotlib emulates Matlabs plotting functionality
import matplotlib.pyplot as plt
# seaborn is another good plotting library. In particular, I like it for heatmaps (https://seaborn.pydata.org/generated/seaborn.heatmap.html)
import seaborn as sns;
# stats models is a package that is going to perform the regression analysis
from statsmodels import api as sm
from scipy import stats
from sklearn.metrics import mean_squared_error
# os allows us to manipulate variables on out local machine, such as paths and environment variables
import os
# self explainatory, dates and times
from datetime import datetime, date
# a helper package to help us iterate over objects
import itertools

### Step 1: Load the data
This may be the data you created in Q1, or the pre-baked merged data.

Use pandas and the read_csv function to load the data. It is suggested you inspect the data after loading (print some of it, use the ``head()`` function, possibly plot some series) as a sanity check.

In [3]:
combined = pd.read_csv('combined.csv')
combined['Date'] = pd.to_datetime(combined['Date'])
combined.head()

Unnamed: 0.1,Unnamed: 0,Rainfall amount (millimetres),Date,Maximum temperature (Degree C),Daily global solar exposure (MJ/m*m),North Brisbane Bikeway Mann Park Windsor Cyclists Outbound,Jack Pesch Bridge Pedestrians Outbound,Story Bridge East Pedestrian Inbound,Kedron Brook Bikeway Lutwyche Pedestrians Outbound,Kedron Brook Bikeway Mitchelton Pedestrian Outbound,...,Story Bridge East Pedestrian Outbound,North Brisbane Bikeway Mann Park Windsor Pedestrian Outbound,Story Bridge West Cyclists Inbound,Bicenntenial Bikeway,Story Bridge East Cyclists Inbound,Bishop Street Pedestrians Inbound,Story Bridge West Cyclists Outbound,North Brisbane Bikeway Mann Park Windsor Pedestrian Inbound,Kedron Brook Bikeway Mitchelton Pedestrian Inbound,Schulz Canal Bridge Cyclists Inbound
0,0,0.0,2014-01-01,30.6,31.2,,,0.0,,,...,0.0,,0.0,3333.0,0.0,,0.0,,,92.0
1,1,0.0,2014-01-02,31.8,23.4,,,0.0,,,...,0.0,,0.0,4863.0,0.0,,0.0,,,123.0
2,2,1.0,2014-01-03,34.5,29.6,,,0.0,,,...,0.0,,0.0,3905.0,0.0,,0.0,,,77.0
3,3,0.0,2014-01-04,38.7,30.5,,,0.0,,,...,0.0,,0.0,3066.0,0.0,,0.0,,,57.0
4,4,0.0,2014-01-05,33.6,15.7,,,0.0,,,...,0.0,,0.0,4550.0,0.0,,0.0,,,92.0


### Step 2: Filter the data

As you inspect the data, you may see some series have fewer samples than others. Trying to find rows that have all data series may lead to having too little data for analysis. A suggested approach is:
* Loop through the columns in the table. You can use something like ``for column in mydata.columns.values:`` to do this iteration. For each column:
  * Get the number of NaNs in the column. The ``isna()`` function that operates of a pandas series could be useful here.
  * If the column has a number of NaNs above a threshold, flag it for removal
* After the loop, remove the columns. The ``drop()`` function in the pandas dataframe class that takes column names as an input could help here.

Be sure to check what's left in the table after your operations

After this, you should remove any final NaNs. The ``dropna()`` function in the pandas dataframe class could be of use here.

In [4]:
threshold = 300 
toRemoveColumns = []
for column in combined.columns.values:
        if np.sum(combined[column].isna()) > 300:
            toRemoveColumns.append(column)
            
print(toRemoveColumns)
print(len(toRemoveColumns)

['North Brisbane Bikeway Mann Park Windsor Cyclists Outbound', 'Jack Pesch Bridge Pedestrians Outbound', 'Kedron Brook Bikeway Lutwyche Pedestrians Outbound', 'Kedron Brook Bikeway Mitchelton Pedestrian Outbound', 'Ekibin Park Pedestrians Outbound', 'Kedron Brook Bikeway Mitchelton', 'Bishop Street Cyclists Inbound', 'Riverwalk Cyclists Inbound', 'Granville Street Bridge Pedestrians Outbound', 'Riverwalk Cyclists Outbound', 'Kedron Brook Bikeway Mitchelton Cyclist Inbound', 'Granville Street Bridge Cyclists Inbound', 'Kedron Brook Bikeway Lutwyche Pedestrians Inbound', 'Ekibin Park Cyclists Inbound', 'Kedron Brook Bikeway Lutwyche Cyclists Inbound', 'Granville Street Bridge Pedestrians Inbound', 'Kedron Brook Bikeway Lutwyche', 'Ekibin Park Cyclists Outbound', 'Ekibin Park Pedestrians Inbound', 'Granville Street Bridge Cyclists Outbound', 'Bishop Street Pedestrians Outbound', 'Riverwalk Pedestrians Inbound', 'Riverwalk Pedestrians Outbound', 'Jack Pesch Bridge Cyclists Inbound', 'Jack 

### Step 3: Split into Train, Validation and Test Splits

You can split the data now. Be sure to check dataset size after splitting to make sure that you have datasets of roughly the size you expect.

As part of this you should also pull out your X and Y data, i.e. your predictors and response.

You could also visualise some of this data, and aspects such as:
* Correlation between predictors and the response
* Correlation between pairs of predictors

In [None]:
# 80 10 10 
# 70 20 10 
# Majrity of data should be in the training set

trainData = 


### Step 4: Create the Model

Using the X and Y arrays you created above, fit a regression model. 

Explore the outputs you get from the model, including:
* The resultant model, including coefficients, p-values, and $R^2$
* A QQ-Plot, to see if assumptions around residuals hold

### Step 5: Refine the Model, and Evaluate the results

Based on model outputs and other data such as correlation, try to improve the model.

Remove terms that look unhelpful. After a term is removed, evaluate the model on the validation and testing sets.