# Metro Dataset

URL: http://archive.ics.uci.edu/ml/machine-learning-databases/00477/

## Content

1) [Data preprocessing](#dataproc)

2) [Model training](#train)
    
2.a) [Linear regression](#linear)

2.b) [Lasso Regression](#lasso)

2.c) [Random Forest](#rf)

2.d) [kNN](#knn)

3) Evaluation

---

In [None]:
# Basic imports
import numpy as np
import pandas as pd
import glob
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# models for linear regression
from sklearn.linear_model import LinearRegression
from sklearn import linear_model

# models for Lasso regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# statistic tools
from sklearn import metrics
from statistics import stdev

# preprocessing
from sklearn import preprocessing

# models for kNN
from sklearn.neighbors import KNeighborsRegressor




---

<a id='dataproc'></a>

# 1) Data preprocessing

In [None]:
input_file = 'Metro.csv'
df_raw = pd.read_csv(input_file,  sep = ',', header = 0)
df_raw

# Description of data columns

holiday ... Columbus Day, Veterans Day, Columbus Day, Veterans Day, Labor Day

temp ... Temperature[K]

rain_1h ... Amount in mm of rain that occurred in the hour

snow_1h ... Amount in mm of snow that occurred in the hour

clouds_all ... Numeric Percentage of cloud cover

weather_main ... text description of the current weather situation (Clouds, Clear, Rain, Drizzle, Mist, Fog, Thunderstorm, Snow, Haze)

weather_description ... text exacter description of the current weather situation

date_time ... DateTime Hour of the data collected in local CST time

traffic_volume ... Numeric Hourly I-94 ATR 301 reported westbound traffic volume

In [None]:
col_dict = {'holiday':  'holiday', 
            'temp':  'Temperature',
            'rain_1h':  'mm rain in one hour',
            'snow_1h':  'mm snow in one hour',
            'clouds_all':  'Percentage of clouds',
            'weather_main':  'current weather situation',
            'weather_description':  'exacter current weather situation',
            'date_time': 'date time',
            'traffic_volume': 'traffic volume',
            'koff':'koff',
            'year': 'year',
            'mounth': 'mounth',
            'day': 'day',
            'hour': 'hour'
           }

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)
            
def add_RD(df):
    df['RD'] = df.apply(lambda row: row.RS - row.RA, axis = 1) 

# First look on DATA and information

In [None]:
display_all(df_raw.tail().transpose())
print('#'*40)
display('Some more info')
print('#'*40)
display(df_raw.info())

# Cornvert the dataset

In [None]:
df_raw

# date colloum in to 4 different coloums and delite the previous coloum

In [None]:
df_raw["date_time"] = pd.to_datetime(df_raw.date_time)
df_raw['year'] = df_raw.date_time.dt.year
df_raw['mounth'] = df_raw.date_time.dt.month
df_raw['day'] = df_raw.date_time.dt.day
df_raw['hour'] = df_raw.date_time.dt.hour
df_raw = df_raw.drop("date_time", axis = 1)
df_raw

# converts finally the other coloums 

In [None]:
weather_cond = []
weather_dis = []
weather_cond = df_raw.weather_main.unique() #stors the different attributes form the weather_main coloum
weather_dis = df_raw.weather_description.unique() #stors the different attributes from the weather_description coloum

for i in range(len(weather_cond)):
    df_raw.weather_main[df_raw.weather_main == weather_cond[i]] = i #set the different atributes to 0 to 7

for i in range(len(weather_dis)):
    df_raw.weather_description[df_raw.weather_description == weather_dis[i]] = i #set the different atributes to 0 to 14
df_raw.holiday[df_raw.holiday == "None"] = 0 
df_raw.holiday[df_raw.holiday != 0] = 1
df_raw.temp[df_raw.temp < 100] = np.mean(df_raw.temp)
df_raw.rain_1h[df_raw.rain_1h > 100] = 0




## Add the koff coloum 

We could include now another coloum "koff" which combines the three coloums "clouds_all","weather_main" and "weather_description". To do that we have to sclae the values from thees coloums to even all weighst of thees values.

In [None]:
df_raw_koff = df_raw
scaler = preprocessing.MinMaxScaler()

df_raw_koff[["clouds_all", "weather_main","weather_description"]] = scaler.fit_transform(df_raw_koff[["clouds_all", "weather_main","weather_description"]])
df_raw_koff["koff"] = df_raw_koff.clouds_all + df_raw_koff.weather_main + df_raw_koff.weather_description

df_raw[['holiday','temp','rain_1h','snow_1h','traffic_volume']] = df_raw[['holiday','temp','rain_1h','snow_1h','traffic_volume']].astype(float)
df_raw = df_raw_koff.drop("koff", axis = 1)

In [None]:
df_raw
#df_raw_koff

# split the raw dataset into the rushhour  

A nother way to try to predict the dataset we could concentrate our regression only at the "rushhour" which is every das from 8 to 10 am and 15 to 17.

In [None]:
rushour = [8,9,10,15,16,17]
#[for i in rushour == 1] 
#for j in rushour:
#df_raw_rush =  df_raw[df_raw.hour == [for i in rushour]]
#df_raw_rush = df_raw

df_raw_rush =  df_raw[df_raw.hour == 9]
df_raw_rush

# Preprocessing for random forest

In [None]:
# Split into train and test
def split_simple(df, n): 
    '''n... number to split at'''
    return df[:n].copy(), df[n:].copy()

# Bootstrapping:

Bootstrapping: Selecting data from a data to generate a new dataset of the same size by picking WITH replacement.

Example:

    > DS = [1,2,3,4]
    > could turn into 
    > DS_bootstrapped = [3,2,4,4]
    
Consequences:

- Instances (rows) of the original set can end up duplicated (multiple times) in the resulting dataset.
- Some instances are left out entirely (up to 1/3) --> "Out-Of-Bag Dataset" (=OOB Dataset)

## Using the OOB Dataset

The OOB dataset was not used to construct the tree, so we can actually use it to test our tree and gain some insight into the error measure of the tree.
This error is called the "Out-Of-Bag Error" (OOB error).

## Three different kinds of datasets

1) df_raw which is the raw data without preprocessing

2) df_raw_rush which contains only the time where the rushour is

3) df_raw_koff which we add a nother coloum wich all koefficients are combinet from the waether discription.

In [None]:
df_raw_koff

# Preprocessing LinReg

For the Linear regression we are interestet in coloums where are lineary dependet to each other. Our target is now "traffic_volume" and we try now to find coloums which have a linear behavior to that.

In [None]:
df_raw = df_raw_koff
sns_plot = sns.lmplot("temp","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
sns_plot.savefig("analysis/temp vs traffic_volume_koff.png")

sns_plot = sns.lmplot("rain_1h","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
sns_plot.savefig("analysis/rain_1h vs traffic_volume_koff.png")

sns_plot = sns.lmplot("snow_1h","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
sns_plot.savefig("analysis/snow_1h vs traffic_volume_koff.png")

#sns_plot = sns.lmplot("clouds_all","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
#sns_plot.savefig("analysis/clouds_all vs traffic_volume_koff.png")

#sns_plot = sns.lmplot("weather_main","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
#sns_plot.savefig("analysis/weather_main vs traffic_volume_koff.png")

#sns_plot = sns.lmplot("weather_description","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
#sns_plot.savefig("analysis/weather_description vs traffic_volume_koff.png")

sns_plot = sns.lmplot("koff","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
sns_plot.savefig("analysis/temp vs traffic_volume_koff.png")

#drop_col = ["holiday","clouds_all","weather_main","year","day","hour","mounth"]
#df_lin = df_raw.drop(drop_col, axis=1)
#df_knn = df_lin

#fig = sns_hist.get_figure()
#fig.savefig("output.png")


In [None]:
df_raw_rush

# Preprocessing LassoReg

In [None]:
df_raw_lasso = df_raw
df_raw_lasso[['holiday','temp','rain_1h','snow_1h','traffic_volume']] = scaler.fit_transform(df_raw[['holiday','temp','rain_1h','snow_1h','traffic_volume']])
df_raw_lasso

df_raw_rush_lasso = df_raw_rush
df_raw_rush_lasso[['holiday','temp','rain_1h','snow_1h','traffic_volume']] = scaler.fit_transform(df_raw_rush[['holiday','temp','rain_1h','snow_1h','traffic_volume']])
df_raw_rush_lasso

df_raw_koff_lasso = df_raw_koff
df_raw_koff_lasso[['holiday','temp','rain_1h','snow_1h','traffic_volume']] = scaler.fit_transform(df_raw_koff[['holiday','temp','rain_1h','snow_1h','traffic_volume']])
df_raw_koff_lasso

# Preprocessing kNN

---

<a id='train'></a>

# 2) Model training
---

<a id='linear'></a>

# Linear Regression explenasion

Linear regression is a type of regression analysis where the variables have a linear relationship between each other. The goal is to find a minimazed cost function such that:

$J = \frac{1}{n}\sum_{i = 0}(y_i-\hat{y}_i)^2$ is minimazed

and to calculate the two coefficients form the equation $y = a_0 + a_1 y$

To do that the gradient disent method is used.

In [None]:
 #	holiday 	temp 	rain_1h 	snow_1h 	clouds_all 	weather_main 	weather_description 	traffic_volume 	year 	mounth 	day 	hour 	koff
Y = df_raw.traffic_volume
X = df_raw[['holiday','snow_1h','temp','rain_1h','clouds_all','weather_main','weather_description','year','mounth','hour']]
#X = df_raw[['rain_1h','weather_main']]

In [None]:
Y

# a) Linear Regression

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 0)

In [None]:
linreg = LinearRegression(normalize = True)
linreg.fit(X_train,Y_train)

In [None]:
linreg.coef_
score = linreg.score(X_test,Y_test)
print("score: ",score)
Y_lin_pred = linreg.predict(X_train)

<a id='lasso'></a>

# b) Lasso Regression

For the lasso regression we have to minimaze the following equation:

$J = \frac{1}{n}\sum_{i = 1}^M(y_i-\hat{y}_i)^2 = \sum_{i = 1}^M\bigg(y_i-\sum_{j = 0}^p \omega_j \times x_{ij}\bigg)^2+\alpha \sum_{j = 0}^p|\omega_j|$

where $\alpha$ is the factor from the additional term from the linear regression. The algorythmus below chooses from the list of parameters the best and plug it in the equation.

In [None]:
Y = df_raw_lasso.traffic_volume
X = df_raw_lasso[['holiday','snow_1h','temp','rain_1h','clouds_all','weather_main','weather_description','year','mounth','hour']]

In [None]:
#Y = df_raw_lasso.traffic_volume
#X = df_raw_lasso[['temp','rain_1h','koff']]

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 0)

In [None]:
lasso = Lasso(normalize = True)
parameters = {'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,2,5,10,20,30,35,40,45,50,55,100]}
lasso_regressor = GridSearchCV(lasso,parameters,scoring = 'neg_mean_squared_error',cv = 5)

In [None]:
lasso_regressor.fit(X_train,Y_train)
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

<a id='knn'></a>

In [None]:
Y_lasso_pred = lasso_regressor.predict(X_test)

# c) kNN

In [None]:
#df_knn = df_prep_rf

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))

X_train_knn_scaled = scaler.fit_transform(X_train)
X_train_knn = pd.DataFrame(X_train_knn_scaled)

X_test_knn_scaled = scaler.fit_transform(X_test)
X_test_knn = pd.DataFrame(X_test_knn_scaled)

In [None]:
rmse_val_knn = [] # to store rmse values for different k
for k in range(36):
    k = k + 1
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train_knn, Y_train)
    pred = model.predict(X_test_knn)
    error = np.sqrt(metrics.mean_squared_error(Y_test, pred))
    rmse_val_knn.append(error)
    print("RMSE for k={}: {}".format(k, error))
    print("R^2 for k={}: {}\n".format(k, model.score(X_test_knn, Y_test)))

In [None]:
plt.figure(figsize=(20,10))
plt.plot(range(1,37), rmse_val_knn, color='blue', linestyle='dashed', marker='o',
        markerfacecolor='red', markersize=5)
plt.title('RMSE vs. k-Value')
plt.xlabel('k')
plt.ylabel('RMSE')
plt.savefig('analysis/RMSE vs k-Value')

<a id='rf'></a>

# d) Random Forest

In [None]:
# Imports for RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split
from IPython.display import display

In [None]:
import math
def rmse(x,y): 
    return math.sqrt(((x-y)**2).mean())

def print_score(m, X_train, X_valid, y_train, y_valid, score=''):
    res = {
        'RMS(train)': rmse(m.predict(X_train), y_train),
        'RMS(valid)': rmse(m.predict(X_valid), y_valid)}
    if score=='neg_mean_squared_error':
        res['Model_Score=r²'] = [np.sqrt(-m.score(X_train, y_train)), np.sqrt(-m.score(X_valid, y_valid))]
    elif score=='pos_mean_squared_error':
        res['Model_Score=r²'] = [np.sqrt(m.score(X_train, y_train)), np.sqrt(m.score(X_valid, y_valid))]
    else:
        res['Model_Score=r²'] = [m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res['oob_score_'] = m.oob_score_
    display(res)
    return res

# Feature importance
from prettytable import PrettyTable as PT # pip install PTable
def print_RF_featureImportance(rf, X):
    table = PT()
    table.field_names = ['Feature', 'Score', 'Comment']
    for name, score in zip(X.columns.values, rf.feature_importances_):
        print(f"{name}: {score:.5f}\t\t... {col_dict[name]}")
        table.add_row([name, round(score, ndigits=4), col_dict[name]])
    print(table)

def print_GridSearchResult(grid):
    print(grid.best_params_)
    print(grid.best_estimator_)

In [None]:
# Split for random forest
df = df_raw
rnd_state = 42
ratio = 0.2 # test/num_samples
#####
num_instances, _ = df.shape
print(f"From {num_instances} using {num_instances*ratio:.0f} for testing and {num_instances*(1-ratio):.0f} for training. Ratio = {ratio*100:.2f}%")
#X, y = (d.drop(['W', 'RD'], axis=1), df.W)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = ratio, random_state = rnd_state)
display(Y)

In [None]:
before = 0

In [None]:
# Simple training of RFRegressor
n_cores = 4
rf_W = RandomForestRegressor(n_jobs=n_cores)
# The following code is supposed to fail due to string values in the input data
rf_W.fit(X_train, Y_train)
print("Before:")
display(before)#
print("Now:")
before = print_score(rf_W, X_train, X_test, Y_train, Y_test)


In [None]:
print_RF_featureImportance(rf_W, X_train)

In [None]:
rf_W_prediction = rf_W.predict(X_test)

In [None]:
n_cores = 4
number_of_trees = 100 # default = 100
rf = RandomForestRegressor(n_jobs=n_cores, n_estimators=number_of_trees, bootstrap=True) #, verbose=1)

rf.fit(X_train, Y_train)
print("Before:")
display(before)#
print("Now:")
before = print_score(rf, X_train, X_test, Y_train, Y_test)
print()
print("Feature importance")
print_RF_featureImportance(rf, X_train)
rf_RD = rf

In [None]:
rfRD_prediction = rf_RD.predict(X_test)

# Optimize Hyperparameters via GridSearch

because we lazy bois

## Notes on the RandomForestRegressor from scikit-learn
-----
The default values for the parameters controlling the size of the trees
(e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets. To
reduce memory consumption, the complexity and size of the trees should be
controlled by setting those parameter values.

## Number of variables/features per tree --> 'max_features'

A good starting point is/might be: *the square root of the number of features presented to the tree*. Then, test some values below and above that starting point.

## Number of trees in the forest --> 'n_estimators'

The more the merrier

In [None]:
from numpy import sqrt
num_features = X.shape[1]
print(num_features)
sqrt_num_features = round(sqrt(num_features), 0)
sqrt_num_features

In [None]:
from sklearn.model_selection import GridSearchCV
n_cores = 4
# but since we dont have that many features...we are just gonna brute force it :D
#[3, 10, 30, 100, 1000]
param_grid = [
    {
        'n_estimators': [3, 10, 20,30], 'max_features': [i for i in range(1,num_features+1)]
    }
#,{'bootstrap': [False], 'n_estimators': [3, 30, 100, 1000], 'max_features': [2, 3, 4]},
]
k = 10
forest_reg = RandomForestRegressor(n_jobs=n_cores)
grid_search = GridSearchCV(forest_reg, param_grid, n_jobs=n_cores , cv=k, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, Y_train)


In [None]:
print_GridSearchResult(grid_search)
grid_search.scorer_()
scores = grid_search.score(X_test, Y_test)
print_score(grid_search, X_train, X_test, Y_train, Y_test)

# k-fold cross validation

In [None]:
from sklearn.model_selection import cross_val_score
from prettytable import PrettyTable

def display_scores(scores):
    print("Scores:", scores)
    table = PrettyTable()
    table.field_names = ['Run', 'Score']
    for i, score in enumerate(scores):
        table.add_row([i, round(score, 3)])
    print(table)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
k = 5
model = rf_RD
scores = cross_val_score(model, X, Y, scoring="neg_mean_squared_error", cv=k)

In [None]:
display_scores(rf_rmse_scores)

<a id='eval'></a>

# 3) Evaluation

# a) Linear Regression

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_train, Y_lin_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_train, Y_lin_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_train, Y_lin_pred)))

In [None]:
#sns.distplot(Y_train - Y_lin_pred)

In [None]:
#sns_plot = sns.lmplot("snow_1h","traffic_volume",df_raw).set(title = 'temp vs traffic_volume_koff')
#sns_plot.savefig("analysis/snow_1h vs traffic_volume_koff.png")

sns_plot = sns.distplot(Y_train).set(title = 'linear regression vs train_data')
sns_plot = sns.distplot(Y_lin_pred)
#sns_plot.savefig("analysis/linear prediction vs train_data.png")

fig = sns_plot.get_figure()
fig.savefig("analysis/linear regression vs train_data")


# b) Lasso Regression

In [None]:
#sns.distplot(Y_train - Y_lasso_pred)

In [None]:
sns_plot = sns.distplot(Y_train).set(title = 'lasso regression vs train_data')
sns_plot = sns.distplot(Y_lasso_pred)

fig = sns_plot.get_figure()
fig.savefig("analysis/lasso regression vs train_data")

# c) knn

# d) Random Forrest

In [None]:
sns_plot = sns.distplot(Y_test-rfRD_prediction)
fig = sns_plot.get_figure()
fig.savefig("analysis/different between Test_data and prediction_data")

In [None]:
sns_plot = sns.distplot(rfRD_prediction)
sns_plot = sns.distplot(Y_test)

fig = sns_plot.get_figure()
fig.savefig("analysis/different between Test_data and prediction_data")

In [None]:
# geht leider noch nicht die dimesionen passen nicht zusammen
#print('Mean Absolute Error:', metrics.mean_absolute_error(Y_train, Y_lasso_pred))  
#print('Mean Squared Error:', metrics.mean_squared_error(Y_train, Y_lasso_pred))  
#print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_train, Y_lasso_pred)))

# Save model and DF

In [None]:
# Dump model
import joblib
import os

os.makedirs('tmp', exist_ok=True)
joblib.dump(rf_RD, "tmp/rf_RD.pkl")
# To load the model
# my_model_loaded = joblib.load("my_model.pkl")

In [None]:
import os
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')
df_raw = pd.read_feather('tmp/raw')