# ML - Guideline

**Steps of a ML development pipeline**

1. Collect data
2. Split: train, validation (train can also be enough when using CV), test
3. EDA (...) exploratory data analysis: we look at features, their distributions, cleaning the data, filling missing values, lloking at correlation between features and output and in between features ...etc.
4. We fit a very straightforward simple model as our baseline (e.g. dummy classifier)
5. use train and validation data (or just apply CV to train) to iteratively improve my model/find best model (feature engineering, hyperparameter tuning, ...)
6. After steps 1-4, we end up with Best Model (best features to use, best hyperparameter values)
7. apply best model to test data to estimate how the model will perform on new data (using test score, should not vary too much from the best validation score I get in step 4-5)
8. By now, you know the transformations, encoding,...etc steps you need. Retrain best model on **all the data you have!!** (train+validation+test), and then it is ready for deployment

# Load libraries and data

In [None]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
## == plt.show() happens automatically inside this notebook
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

from friendly.jupyter import *

import seaborn as sns
import os
import numpy as np

In [None]:
sklearn.__version__

In [None]:
path=os.getcwd()
path

## Data Overview

In [None]:
df_bike= pd.read_csv(path+'/train.csv', sep=",",parse_dates=True, index_col=0) #Attention: this data file contains ALL the data, not just the train set!
df_test= pd.read_csv(path+'/test.csv', sep=",",parse_dates=True, index_col=0) #Attention: this data file contains ALL the data, not just the train set!

df_test.head() #casual + registered = count --> exclude from dataset



In [None]:
#Check shape of test and train
print(df_bike.shape, df_test.shape)

In [None]:
df_test.index

In [None]:
df_bike.info()

In [None]:
df_test.info()

- datetime - hourly date + timestamp  --> just the first 19 days per month (at leasts 2 full weeks to get a possible rental week shape per month)
- season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist (Mist=Nebel,Dunst)
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals --> sum of casual+registered

In [None]:
df_bike.isna().sum() #Perfect, no adapting of missing values necessary

## append dataframe with year/month/day/hour 

In [None]:
def append(df):
    df['hour']=df.index.hour
    df['day']=df.index.weekday
    df['month']=df.index.month
    df['year']=df.index.year
    return df

df_list=[df_bike,df_test]
for i in df_list:
    i=append(i)

df_bike
df_test

## drop already unnecessary columns

In [None]:
def drop(df,columns):
    df.drop(columns,axis=1,inplace=True)
    return df

#df_bike
drop_columns=['casual','registered']
df_bike=drop(df_bike,drop_columns)
#reset_index
#df_bike=df_bike.reset_index(drop=True)
#df_test=df_test.reset_index(drop=True)
df_bike.head()
df_test.head()

In [None]:
#Check shape
df_bike.shape, df_test.shape #--> count still included in df_bike for EDA

# EDA

## statistical overview

In [None]:
plt.figure(figsize=(18,10))

sns.heatmap(df_bike.corr(),cbar=True,annot=True,cmap="Blues");
#atemp and temp are high correlated --> nearly the same... one is not necessary?
#count is correlated temp/atemp and hour

In [None]:
#development of rented bikes
df_bike.groupby(['year','month'])[['count']].sum().plot.bar(subplots=True);

## season, holiday, workingday

In [None]:
#compare 
fig,ax=plt.subplots(1,3)

sns.barplot(data=df_bike,x=df_bike['season'].unique(),y=df_bike.groupby(['season'])['count'].sum(),ax=ax[0]).set(title='Season Sum')
sns.barplot(x=df_bike['holiday'].unique(),y=df_bike.groupby(['holiday'])['count'].mean(),ax=ax[1]).set(title='Holiday mean')
sns.barplot(x=df_bike['workingday'].unique(),y=df_bike.groupby(['workingday'])['count'].mean(),ax=ax[2]).set(title='Workingday mean');
#Season has influence, but not that big difference in rental consumption concering to holiday/working day



## Weather, Humidity, Wind and Temperature

In [None]:
df_bike[['weather','temp','atemp','windspeed','humidity']].describe()

In [None]:
fig,ax=plt.subplots(1,4)
plt.figure(figsize=(60,60))
sns.histplot(data=df_bike['weather'],ax=ax[0]);
sns.histplot(data=df_bike['atemp'],ax=ax[1],discrete=True)
sns.histplot(data=df_bike['temp'],ax=ax[1],discrete=True)
sns.histplot(data=df_bike['windspeed'],ax=ax[2],discrete=True)
sns.histplot(data=df_bike['humidity'],ax=ax[3],discrete=True);

In [None]:
#weather
df_bike.groupby(['weather'])[['count']].mean().plot.bar(subplots=True);
#high influence --> the better the wetter the higher the rentals

In [None]:
#windspeed
df_bike['wind_bin']=pd.cut(df_bike['windspeed'],[-1,10,20,30,40,50,60])
df_bike.groupby(['wind_bin'])[['count']].mean().plot.bar(subplots=True);
#average rentals are not that much influenced by the windspeed

In [None]:
df_bike['Hum_bin']=pd.cut(df_bike['humidity'],[0,20,40,60,80,100])
df_bike
df_bike.groupby(['Hum_bin'])[['count']].mean().plot.bar();
#average rentals decrease if it's getting pretty wet. 

In [None]:
#temp
df_bike['temp_bin']=pd.cut(df_bike['temp'],[-5,0,10,20,30,40,45])
df_bike
df_bike.groupby(['temp_bin'])[['count']].mean().plot.bar();
#average rentals increase if it's getting pretty wet. 

In [None]:
#compare temp and atemp with each other
#fig,ax=plt.subplots(figsize=(20,10))
#sns.lineplot(data=df_bike[['atemp','temp']],ax=ax);
df_bike[['temp','atemp']].plot.line();
#---> both properties have the same influence on rental numbers --> get rid off on of them

## hour, day, month

In [None]:
#hour
df_bike.groupby(['hour'])[['count']].mean().plot.bar();
#huge influence in time of the day

In [None]:
#day
df_bike.groupby(['day'])[['count']].mean().plot.bar(subplots=True);
#day seems to be not that much influence?!

In [None]:
#month
df_bike.groupby(['month'])[['count']].mean().plot.bar(subplots=True);
#the warmer the more

## conclusion EDA

- One Hot Encoding:
 - season, weather, year, month, day, hour --> categorical values
- Scaling:
 - windspeed, humidity, temp
- Binning after Scaling:
 - windspeed, humidity, temp
- drop datetime,atempd, 
    -maybe day and windspeed later

## Splitting 

In [None]:
# easy way to separate the whole dataframe into Feature-Set(X) and Target-Set (Y) (Y is a Series)
X=df_bike.drop(['atemp','wind_bin','Hum_bin','temp_bin','count'],axis=1)
Y=df_bike[['count']]
X_test=df_test.drop(['atemp'],axis=1)

In [None]:
#Split into Validate and Train Data Set
X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size=0.2, random_state=101) #test_size generally around 20-25%

In [None]:
print(X_train.shape, X_valid.shape,y_train.shape, y_valid.shape, X_test.shape)

In [None]:
X_train

# BaseLine Model

Steps:
- Create by hand 
- Create with a pipeline
- little effort
- just do Feature Engineering for Season (Encoding), drop aTemp
- compare RMSE by hand with RMSE by pipeline, needs to be the same
--> receive a Base-Score to have something to compare with after setting up the real model

## by hand

### as LR-Model

In [None]:
#just encode Season as Feature Engineering
#function
def encoding(df,columns):
    #print(columns)
    df=pd.DataFrame(df).copy()
    df_dummy=pd.get_dummies(data=df[columns],columns=columns,drop_first=True,dummy_na=False,prefix=columns)
    df_new=pd.concat([df,df_dummy],axis=1)
    df_new.drop(columns,axis=1,inplace=True)
    return df_new

ohe=['day','season']      
X_train_ohe=encoding(X_train,ohe)
X_valid_ohe=encoding(X_valid,ohe) 
X_test_ohe=encoding(X_test,ohe) 

In [None]:
X_train_ohe.shape, X_valid_ohe.shape, X_test_ohe.shape

In [None]:
#instanciate model
m_ohe = RandomForestRegressor()

In [None]:
#fit the model
m_ohe.fit(X_train_ohe,y_train)

In [None]:
y_pred_ohe=m_ohe.predict(X_valid_ohe)
y_pred_ohe[y_pred_ohe<0]=0

In [None]:
#get a Root Mean Squard Log Error (RMSLE)-score
rmsle_b_ohe=np.sqrt(mean_squared_log_error(y_valid,y_pred_ohe))
print(f'baseline model score: {round(rmsle_b_ohe,6)}')

### as a Pipeline

In [None]:
#Create a preprocessor ColumnTransformer (with just 1 task/step)
categorical_features=['day','season']

categorical_steps = [('onehot', OneHotEncoder(handle_unknown='ignore'))]
# Create sub-pipeline as part of preprocessor
categorical_transformer = Pipeline(categorical_steps)
preprocessor = ColumnTransformer(transformers=
                                 [('categoric', categorical_transformer, categorical_features)])
preprocessor

In [None]:
final_steps = [('preprocessor', preprocessor),
     ('RanFo', RandomForestRegressor())] # instanciation of the model class

In [None]:
pipeline = Pipeline(final_steps)

In [None]:
#have a look at the pipeline steps
pipeline.named_steps

In [None]:
#use the pipeline
pipeline.fit(X_train,y_train)

In [None]:
y_pred_pipe=pipeline.predict(X_valid)

In [None]:
X_valid.shape, X_train.shape

In [None]:
#get a Root Mean Squard Error (RMSLE)-score
rmsle_b_pipe=np.sqrt(mean_squared_log_error(y_valid,y_pred_pipe))
print(f'baseline model score: {round(rmsle_b_pipe,6)}')

### as mean out of "month-day"

In [None]:
X_valid

In [None]:
y_day_month_mean=df_bike.groupby(['month','day'])[['count']].mean()
y_day_month_mean

In [None]:
y_day_month_mean.info()
#type(y_day_month_mean.loc[1,3])

In [None]:
#y_day_month_mean #gives us the mean per day (0-6) to the corresponding month --> 12*7=84rows

#adapt these as an baseline proposal to y_b_pred

def baseline_mean(df_day_month_mean,col_day,col_month,i):
    #tuple unpacking
    a,b=i
    #Month-Value
    month=b[col_month]
    #Day-Value
    day=b[col_day]
    #Map mean for day/month to y_b_pred, take first value of series (multi-index)
    pred_value=df_day_month_mean.loc[month,day][0]
    #y_b_pred.append(pred_value)
    
    return pred_value
    
day='day'
month='month'

y_b_pred_mean=[]
#List Comprehension
y_b_pred_mean=[baseline_mean(y_day_month_mean,day,month,row) for row in X_valid.iterrows()]
y_b_pred_mean



In [None]:
#Check if it worked
df_y_b_pred_mean=pd.DataFrame(y_b_pred,columns=['day_month_mean'],index=X_valid.index)
df_test=pd.concat([df_y_b_pred_mean,X_valid],axis=1)
df_test






In [None]:
y_day_month_mean.head(20)

In [None]:
#get a Root Mean Squard Log Error (RMSLE)-score
rmsle_b_mean=np.sqrt(mean_squared_log_error(y_valid,df_y_b_pred_mean))
print(f'baseline model score: {round(rmsle_b_mean,6)}')

## compare R² Baseline, submit to Kaggle

In [None]:
print(f'baseline model pipeline score: {round(rmsle_b_pipe,6)}')
print(f'baseline model ohe manual score: {round(rmsle_b_ohe,6)}')
print(f'baseline model mean manual score: {round(rmsle_b_mean,6)}')
#ohe is the best

In [None]:
#Predict
y_pred_ohe_kaggle=m_ohe.predict(X_test_ohe)
y_pred_ohe_kaggle.shape


In [None]:
df_count_kaggle=pd.DataFrame(y_pred_ohe_kaggle,index=X_test.index)
df_count_kaggle.rename(columns = {0:'count'}, inplace = True)
df_count_kaggle[df_count_kaggle['count']<0]=0
df_count_kaggle

In [None]:
#csv
df_count_kaggle.to_csv("submission.csv",index=True)

# Train and Test the Model (Estimator)

In [None]:
#do train and test WITHOUT regarding to EDA, just to try if process is working

In [135]:
df_bike=pd.read_csv('train.csv', parse_dates=True, index_col=0 )
X=df_bike.copy()
X.drop(['casual','registered','count'],axis=1,inplace=True)
X.head(10)

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0
2011-01-01 05:00:00,1,0,0,2,9.84,12.88,75,6.0032
2011-01-01 06:00:00,1,0,0,1,9.02,13.635,80,0.0
2011-01-01 07:00:00,1,0,0,1,8.2,12.88,86,0.0
2011-01-01 08:00:00,1,0,0,1,9.84,14.395,75,0.0
2011-01-01 09:00:00,1,0,0,1,13.12,17.425,76,0.0


In [136]:
y = bike.copy()
y = y['count']#.to_frame()
#y = y.reset_index(level=0)
y

## Create Pipeline and do Feature Engineering

In [137]:
#categorical to encode 
categorical_features=['season','holiday','workingday','weather']

#numerical features to scale and bin afterwards
numerical_features= ['temp','atemp','humidity','windspeed']


### define all necessary functions (FUnction Transformer)

In [138]:
def day_period_dataframe(X):
    X = pd.DataFrame(X).copy()
    
    X["day_period"]=X["hour"].apply(day_period)
    return X

In [139]:
def day_period(hour):
    label=None
    if hour>=22 or hour<4 or hour==0:
        label="night"
    elif hour<10:
        label="morning"
    elif hour<16:
        label="afternoon"
    else:
        label="evening"
    return label

In [140]:
def timebreakdown (X):
    X = pd.DataFrame(X).copy()
    
    X['year'] = X.index.year
    X['month'] = X.index.month
    X['weekday'] = X.index.day_name()
    X['hour'] = X.index.hour

    
    return X

In [141]:
timebreakdown_step = FunctionTransformer(timebreakdown)

In [142]:
day_period_step = FunctionTransformer(day_period_dataframe)

### Main pipeline and 2 sub-pipelines

In [143]:
#Categorical
# now we define the steps we need to do for both groups of columns
categorical_steps = [('timebreakdown', timebreakdown_step),
                     ('day_period_step', day_period_step),
                     ('onehot', OneHotEncoder(handle_unknown='ignore'))]

In [144]:
#sub pipeline 1
categorical_transformer = Pipeline(categorical_steps)
categorical_transformer

In [145]:
# sub-pipeline 2
numeric_steps = [('imputer', SimpleImputer(strategy='median')), 
                 ('scaler', StandardScaler())]

numeric_transformer  = Pipeline(numeric_steps)
numeric_transformer

In [146]:
#combining both pipelines --> main pipeline
preprocessor = ColumnTransformer(transformers=[
        ('numeric', numeric_transformer, numerical_features),
        ('categorical', categorical_transformer, categorical_features)])


In [147]:
final_steps = [('preprocessor', preprocessor),
     ('LinReg', LinearRegression())]

In [148]:
pipeline = Pipeline(final_steps)

In [149]:
pipeline

## Split DF into Train and Test

In [150]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                    test_size = 0.2, random_state=42)

In [151]:
X_train.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2011-07-06 05:00:00,3,0,1,1,27.88,31.82,83,6.0032
2012-08-04 16:00:00,3,0,0,1,36.9,40.91,39,19.9995
2012-07-11 15:00:00,3,0,1,1,32.8,35.605,41,16.9979
2011-04-10 04:00:00,2,0,0,2,14.76,18.18,93,7.0015
2011-11-19 10:00:00,4,0,0,1,13.12,15.15,45,16.9979


## Fit and Run Model

In [152]:
#train
pipeline.fit(X_train,y_train)

In [153]:
preprocessor.fit_transform(X_train).shape  #necessary??

## Check Performance

In [154]:
y_pred = pipeline.predict(X_test)
y_pred

In [155]:
y_pred[y_pred<0] = 0
y_pred

In [156]:
y_pred_series=pd.Series(y_pred)

In [163]:
test_score = mean_squared_log_error(y_test, y_pred) #equals mean_squared_log_error(y_test,y_pred2,squared=False)
print(f'Training score: {test_score}')

#Defintion of MSLE:
#https://peltarion.com/knowledge-center/modeling-view/build-an-ai-model/loss-functions/mean-squared-logarithmic-error-(msle)
#cares more abput the percentual difference than the absolute difference
#weights underestimates more than overestimates

Training score: 1.190801094399844


In [164]:
np.sqrt(mean_squared_log_error(y_test,y_pred))#,squared=False)

# Cross-Validation

In [None]:
cross_acc = cross_val_score(estimator=RandomForestRegressor(), # estimator: # the model you want to evaluate
                            X=X_train,                           # the training input data/features
                            y=y_train,                           # the training output data/target  
                            cv=5,                               # number of cross validation datasets/folds  
                     https://peltarion.com/knowledge-center/modeling-view/build-an-ai-model/loss-functions/mean-squared-logarithmic-error-(msle)       scoring='neg_mean_squared_error'                  # evaluation metric 
                            ) 

In [None]:
# these are the validation scores for the k fitted models in cross validation
cross_acc

In [None]:
cross_acc.mean() 

In [None]:
cross_acc.std()

# Hyperparameter - Tuning

In [None]:
#Let's find best hyperparameters for current model

In [None]:
RandomForestRegressor().get_params()

In [None]:
# define our hyperparameters to combine, 5*4*3 = 60 combinations of hyperparamters, for each combination we ate fitting
# k=5 models

# python dict

hyperparam_grid = {
    'max_depth': [3, 5, 10, 20, 31], 
    'n_estimators': [5, 10, 30, 50],
    'min_samples_leaf': [5, 10, 20]
}

In [None]:
grid_cv = GridSearchCV(estimator=RandomForestRegressor(),            # unfitted model/estimator
                       param_grid=hyperparam_grid,                    # hyperparameters dict
                       cv=5,                                          # number of folds, k
                       scoring='neg_mean_squared_error')   

In [None]:
# fit all models with all the different hyperparamters
grid_cv.fit(X_train, y_train)

# Evaluation