# Part 2 machine learning 

## Outline
Now that you successfully prepared the data, you will train a machine learning model to <b> predict the number of bike shares </b> `cnt`.

1. load prepared data
2. specify target value and training attributes
3. split data in in train and test sets
4. train and predict
5. performance testing

<img src="Figures/bike.png" alt="Paris" width="600" style="float:left"> 

<p style="font-size:1vw; color:#808080">Powered by TfL Open Data, 
Contains OS data © Crown copyright and database rights 2016 and Geomni UK Map data ©<br> and database rights [2019]<br>
https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset/metadata</p>

## import necessary libraries

The model you will train is a <b> decision tree </b>. For this import the scikit-learn model `from sklearn.tree import DecisionTreeRegressor` and `from sklearn.model_selection import train_test_split`. You will later test the performance with the mean squared error (MSE) `from sklearn.metrics import mean_squared_error`

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
...

In [None]:
sns.set_style('whitegrid')
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

## load prepared data
Use `pd.read_csv(...)` to load your data.

In [None]:
df = ...
df

## set target value and attributes to train on

The value you want to predict is `cnt`, the count of bike shares at a specific time of the year. Determine `y` as the target value vector. The training attributes will be in the matrix `X`. Choose the attributes `t1`, `t2`,`hum`, `wind_speed`, `weather_code`, `is_holiday`, `is_weekend`,`season`,`month`,`hour` to be in your matrix `X` by using `df[[column1,column2,..]]`.

In [None]:
y = ...
X = ...

## split data into train and test set

Since you need 'untouched' data to later test your algorithm, split the dataframe into two parts: <b> one to train </b> and <b> one to test </b>. Use the function `train_test_split(X,y,train_size,test_size)` with `train_size=0.95` and `test_size=0.05` to do so. 

In [None]:
X_train, X_test, y_train, y_test = ...

 ## train and predict

Here all the magic happens. Use the sckit-learn function `DecisionTreeRegressor(random_state=42)` to call the decision tree. Then `decisiontree.fit(X_train, y_train)` does the training. 

More information on the really cool descision tree algorithm by sckit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

In [None]:
decisiontree = ...

my_decisiontree = ...

Tadaa. The variable `my_decisiontree` now knows how your data works, it is your precious decision making tool. Look at 3 random instances of your test set, using `X_test.sample(3)`

In [None]:
X_sample = ...
X_sample

According to the respective attributes, how many bike shares `cnt` would you estimate? 

Let's see what your model says: Use `my_decisiontree.predict(X_sample)` to get the predictions of the dicision tree.

In [None]:
...

Who was closer to the truth? Use `y_test[X_sample.index]` to see the true bike shares stored in you test set.

In [None]:
pd.DataFrame(...)

Probably the estimates will be a few counts away from the true value. How well does the model perform in general?

## performance testing

Test your model on a broader scale. First, predict the `X_test` sample with `y_predict = my_decisiontree.predict(...)`.

In [None]:
...

In [None]:
x = range(len(y_predict))
plt.figure(figsize = (8,6))
plt.plot(x,y_predict,'.',label='predictions')
plt.plot(x,y_test,'.',label='true bike shares')
plt.ylabel('bike shares')
plt.xlabel('samples')
plt.legend()
plt.show()

Use the function `mean_squared_error(y_test,y_predict, squared=False)` to evaluate the rooted MSE.

In [None]:
RMSE = ...
print("RMSE = ", RMSE)

How bad is this? Determine the ratio between the `RMSE` and the mean value of `cnt` using the function `mean()` and de divide operator `/`.

In [None]:
ratio = ...
ratio

## improve

Can your decision tree do better? There are many parameters, which you can adjust in a decision tree.
For example take the parameter `min_samples_leaf`: The minimum number of samples required to be at a leaf node. Per default this parameter is set to `2`. Test it with `min_samples_leaf=10`.

In [None]:
decisiontree = DecisionTreeRegressor(random_state=42, min_samples_leaf=...)

model_fit = decisiontree.fit(X_train, y_train)
y_predict = model_fit.predict(X_test)

RMSE = mean_squared_error(y_predict,y_test, squared=False)
print("RMSE = ", RMSE)

Indeed, this is already much better. But would be the optimal value for `min_samples_leaf`?

In [None]:
error = []
for i in range(1,100):
    decisiontree = DecisionTreeRegressor(random_state=42, min_samples_leaf=i)

    model_fit = decisiontree.fit(X_train, y_train)
    y_predict = model_fit.predict(X_test)
    RMSE = mean_squared_error(y_predict,y_test, squared=False)
    error.append(RMSE)

In [None]:
plt.figure(figsize = (8,6))
plt.plot(error)
plt.ylabel('RMSE')
plt.xlabel('minimum sample leaf')
plt.show()

We see that the error first strongly decreases with the parameter. However, with a `min_samples_leaf > 10` the error starts rising again. Where is optimal (=minimum) error?

In [None]:
index = np.argwhere(error == np.amin(error))
print("The minimum error is", error[index[0][0]])
print("at a minimum sample leaf = ",index[0][0])

The conclusion is it is better to use a `min_samples_leaf ~ 8` to have a minimum error. Try it with the optimized parameter!

In [None]:
decisiontree = ...

model_fit = ...
y_predict = ...

RMSE = ...
print("RMSE = ", RMSE)

Congratulations, you have <b> trained, tested and optimized your own machine learning model! </b> The stakeholders of the city vienna will be more than satisfied!

__________
Additional:
    
Try the same with another model e.g. `RandomForestRegressor(random_state=42)`

In [None]:
from sklearn.ensemble import RandomForestRegressor

randomforest = ...

my_randomforest = ...
y_predict = ...

RMSE = ...
print("MSE = ", RMSE)