# Bike Sharing Linear Regression Model

## Steps
1. Reading , understanding and Visualizing the data
2. Preparing the data for Modelling
    - Train - Test split
    - Rescaling
3. Training the Model 
4. Residual Analysis
5. Predictions and Evaluations on the test set

### Step 1: Reading , understanding and Visualizing the data

In [66]:
# Import Needed Libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import calendar


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor


In [None]:
# Read the dataset
bike_sharing = pd.read_csv('day.csv')
bike_sharing.shape

In [None]:
bike_sharing.info()

"""No Missing Values and no datatype conversions required"""

In [None]:
# Drop Insignificant Columns
# instant is unique as it is a record index - it doesn't add any value
# casual and registered are already captured in cnt and the target column is cnt, hence dropping casual and registered
# dropping dteday and yr as the date doesn't add significance because it just represents history and will not make any significance in predicting for current date/month/yr
insig_cols = ['instant','dteday','yr','casual','registered']
bike_sharing.drop(insig_cols,axis=1,inplace=True)
bike_sharing.columns

In [None]:
# Visualizing the data for linearity and multi collinearity
plt.figure()
sns.pairplot(bike_sharing)
plt.show()

"""At this point,temp and atemp may be multi collinear (+vely correlated) and is obviously explainable because temp is the actual temperature and atemp is feeling temperatue. """

In [None]:
# Visualizing the data: Continuous Independent Variables
plt.figure()
sns.pairplot(data=bike_sharing,x_vars=['temp', 'atemp', 'hum', 'windspeed'],y_vars='cnt')
plt.show()

"""There seems to be linear correlation between temp vs cnt and atemp vs cnt"""

In [None]:
# Visualizing the data: Categorical Independent Variables
#TODO: Adjust the figuresize
plt.figure()
plt.subplot(2,3,1)
sns.boxplot(x='season',y='cnt',data=bike_sharing)
plt.subplot(2,3,2)
sns.boxplot(x='mnth',y='cnt',data=bike_sharing)
plt.subplot(2,3,3)
sns.boxplot(x='holiday',y='cnt',data=bike_sharing)
plt.subplot(2,3,4)
sns.boxplot(x='weekday',y='cnt',data=bike_sharing)
plt.subplot(2,3,5)
sns.boxplot(x='workingday',y='cnt',data=bike_sharing)
plt.subplot(2,3,6)
sns.boxplot(x='weathersit',y='cnt',data=bike_sharing)
plt.show()


#### Inferences
- season seems to have influence on number of people opting for total rental bikes thereby months also have influence
- workingday & weekday doesn't seem to influence total rental bikes 
- Weather situation seems to influence on total rental bikes
- holiday parameter seems to little influence on total rental bikes (by looking at median)

### Step 2:  Preparing the data for Modelling

#### Encoding
 - yes/no variables are already encoded with 1/0. No change needed
 - Certain Nominal Variables are represented as Ordinal variables like season, month, weekday, weathersit. Those has to be converted and dummy encoded

In [None]:
## Listing Categorical columns and its unique values
column_values = {}
col_list = ['season', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']
for row in col_list:
    column_values[row] = list(bike_sharing[row].value_counts().index)

print(column_values)

In [None]:
## Plug in string values from data dict for Nominal Variables which are represented as Ordinal values in the dataset
season_mappings = {1:'spring', 2:'summer', 3:'fall', 4:'winter'}
weathersit_mappings = {1:'Clear',2:'Mist_Cloudy',3:'Light_Snow',4:'Heavy_Rain'}

bike_sharing['season'] = bike_sharing[['season']].apply(lambda x : x.map(season_mappings))
bike_sharing['weathersit'] = bike_sharing[['weathersit']].apply(lambda x : x.map(weathersit_mappings))
bike_sharing['mnth'] = bike_sharing['mnth'].apply(lambda x : calendar.month_abbr[x])
bike_sharing['weekday'] = bike_sharing['weekday'].apply(lambda x : calendar.day_abbr[x])
bike_sharing.head()

In [None]:
## Dummy encoding
var_list = ['season','mnth','weekday','weathersit']
dummy_encoded_values = pd.get_dummies(data=bike_sharing[var_list],drop_first=True)

# Add the new encoded cols to original dataframe and drop the source columns
bike_sharing = pd.concat([bike_sharing,dummy_encoded_values],axis=1)
bike_sharing.drop(var_list,axis=1,inplace=True)

#### Split test train dataset

In [None]:
df_train, df_test = train_test_split(bike_sharing,train_size=0.7,random_state=100)

In [None]:
df_train.columns

#### Scaling the features using MinMaxScaler

In [None]:
# Normalize the numerical columns other than categorical dummy cols
num_vars = ['temp','atemp','hum','windspeed','cnt']
scaler = MinMaxScaler()
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])


### Step 3:  Model Building

#### Dividing training set to X and y

In [None]:
y_train = df_train.pop('cnt')
X_train = df_train

In [None]:
# Create Linear Regression Model
lm = LinearRegression()
lm.fit(X_train,y_train)

# Running RFE with output number of values as 15
output_var_count = 15
rfe = RFE(lm,n_features_to_select=output_var_count)
rfe = rfe.fit(X_train, y_train)


In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [50]:
resulting_rfe_cols = X_train.columns[rfe.support_]
resulting_rfe_cols

Index(['holiday', 'workingday', 'temp', 'atemp', 'hum', 'windspeed',
       'season_summer', 'season_winter', 'mnth_Jul', 'mnth_Jun', 'mnth_May',
       'mnth_Sep', 'weekday_Sun', 'weathersit_Light_Snow',
       'weathersit_Mist_Cloudy'],
      dtype='object')

In [48]:
X_train.columns[~rfe.support_]

Index(['season_spring', 'mnth_Aug', 'mnth_Dec', 'mnth_Feb', 'mnth_Jan',
       'mnth_Mar', 'mnth_Nov', 'mnth_Oct', 'weekday_Mon', 'weekday_Sat',
       'weekday_Thu', 'weekday_Tue', 'weekday_Wed'],
      dtype='object')

#### Building using stats model to get detailed statistics

In [63]:
# Keeping only the columns from RFE
X_train_rfe = X_train[resulting_rfe_cols]

In [62]:
## Build Model and return summary
def build_model(X_train,y_train):
    X_train_sm = sm.add_constant(X_train) # Add constant
    lm = sm.OLS(y_train,X_train_sm).fit() # Fitting the model
    return lm.summary() # Return summary

In [72]:
# Compute VIF
def compute_vif(X_train):
    vif = pd.DataFrame()
    vif['Features'] = X_train.columns
    vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2) # Rounding to 2 decimal values
    vif = vif.sort_values(by = "VIF", ascending = False)
    return vif
    

In [74]:
# Building Model with 15 params from RFE
build_model(X_train_rfe,y_train)

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.597
Model:,OLS,Adj. R-squared:,0.584
Method:,Least Squares,F-statistic:,48.69
Date:,"Mon, 07 Feb 2022",Prob (F-statistic):,3.21e-87
Time:,22:42:50,Log-Likelihood:,270.35
No. Observations:,510,AIC:,-508.7
Df Residuals:,494,BIC:,-441.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3256,0.046,7.045,0.000,0.235,0.416
holiday,-0.0727,0.044,-1.657,0.098,-0.159,0.013
workingday,0.0424,0.019,2.258,0.024,0.006,0.079
temp,0.7711,0.207,3.728,0.000,0.365,1.177
atemp,-0.0647,0.220,-0.295,0.768,-0.496,0.367
hum,-0.2946,0.061,-4.824,0.000,-0.415,-0.175
windspeed,-0.2021,0.042,-4.819,0.000,-0.285,-0.120
season_summer,0.0934,0.020,4.594,0.000,0.053,0.133
season_winter,0.1408,0.017,8.234,0.000,0.107,0.174

0,1,2,3
Omnibus:,20.467,Durbin-Watson:,2.055
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9.636
Skew:,0.079,Prob(JB):,0.00808
Kurtosis:,2.345,Cond. No.,80.0


In [75]:
# Computing VIF for 15 variables from RFE
compute_vif(X_train_rfe)

Unnamed: 0,Features,VIF
3,atemp,361.04
2,temp,353.97
4,hum,17.61
1,workingday,5.01
5,windspeed,3.98
6,season_summer,2.48
14,weathersit_Mist_Cloudy,2.12
12,weekday_Sun,1.9
10,mnth_May,1.78
7,season_winter,1.75


##### Interpretations:
- RFE chosen 15 variables is able to explain 58% variance in the target variable (Adjusted R square is 58%)
- atemp has very high p value 0.768 and also very high VIF 361.04

Let's remove atemp and rebuild model and recompute VIF

In [80]:
# Remove atemp which has very high p value
cols_to_be_removed = ['atemp']
X = X_train_rfe.drop(cols_to_be_removed,axis=1)

In [81]:
build_model(X,y_train)

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.596
Model:,OLS,Adj. R-squared:,0.585
Method:,Least Squares,F-statistic:,52.26
Date:,"Mon, 07 Feb 2022",Prob (F-statistic):,4.54e-88
Time:,22:52:20,Log-Likelihood:,270.31
No. Observations:,510,AIC:,-510.6
Df Residuals:,495,BIC:,-447.1
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3248,0.046,7.047,0.000,0.234,0.415
holiday,-0.0722,0.044,-1.649,0.100,-0.158,0.014
workingday,0.0425,0.019,2.263,0.024,0.006,0.079
temp,0.7112,0.039,18.163,0.000,0.634,0.788
hum,-0.2957,0.061,-4.855,0.000,-0.415,-0.176
windspeed,-0.1999,0.041,-4.850,0.000,-0.281,-0.119
season_summer,0.0925,0.020,4.608,0.000,0.053,0.132
season_winter,0.1400,0.017,8.294,0.000,0.107,0.173
mnth_Jul,-0.0853,0.030,-2.832,0.005,-0.144,-0.026

0,1,2,3
Omnibus:,20.458,Durbin-Watson:,2.057
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9.599
Skew:,0.075,Prob(JB):,0.00823
Kurtosis:,2.345,Cond. No.,19.8


In [78]:
compute_vif(X)

Unnamed: 0,Features,VIF
3,hum,17.18
2,temp,12.21
1,workingday,5.01
4,windspeed,3.86
5,season_summer,2.42
13,weathersit_Mist_Cloudy,2.11
11,weekday_Sun,1.9
9,mnth_May,1.78
6,season_winter,1.71
7,mnth_Jul,1.69


##### Interpretations:
- Adjusted R square has remained the same indicating that atemp could be a redundant variable
- As indicated by EDA, atemp has very high correlation with temp, removing atemp also brought down vif of temp
- mnth_May has high p value although it has low VIF

As a rule of thumb, remove the variable with high p value, let's remove mnth_May

In [82]:
# Remove atemp which has very high p value
# Remove mnth_May which has high p value
cols_to_be_removed = ['atemp','mnth_May']
X = X_train_rfe.drop(cols_to_be_removed,axis=1)

In [83]:
build_model(X,y_train)

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.596
Model:,OLS,Adj. R-squared:,0.585
Method:,Least Squares,F-statistic:,56.2
Date:,"Mon, 07 Feb 2022",Prob (F-statistic):,9.77e-89
Time:,22:53:19,Log-Likelihood:,269.79
No. Observations:,510,AIC:,-511.6
Df Residuals:,496,BIC:,-452.3
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3307,0.046,7.230,0.000,0.241,0.421
holiday,-0.0714,0.044,-1.631,0.104,-0.157,0.015
workingday,0.0425,0.019,2.265,0.024,0.006,0.079
temp,0.7034,0.038,18.329,0.000,0.628,0.779
hum,-0.3023,0.061,-4.994,0.000,-0.421,-0.183
windspeed,-0.1972,0.041,-4.795,0.000,-0.278,-0.116
season_summer,0.0821,0.017,4.765,0.000,0.048,0.116
season_winter,0.1408,0.017,8.347,0.000,0.108,0.174
mnth_Jul,-0.0812,0.030,-2.721,0.007,-0.140,-0.023

0,1,2,3
Omnibus:,22.118,Durbin-Watson:,2.061
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10.328
Skew:,0.1,Prob(JB):,0.00572
Kurtosis:,2.332,Cond. No.,19.7


In [84]:
compute_vif(X)

Unnamed: 0,Features,VIF
3,hum,17.17
2,temp,11.82
1,workingday,5.0
4,windspeed,3.76
12,weathersit_Mist_Cloudy,2.11
10,weekday_Sun,1.89
5,season_summer,1.79
6,season_winter,1.7
7,mnth_Jul,1.66
8,mnth_Jun,1.4


##### Interpretations:
- Adjusted R square has remained the same indicating that atemp,mnth_May may not be good value add to the fitness of the model
- holiday has high p value although it has low VIF

let's remove holiday

In [86]:
# Remove atemp which has very high p value
# Remove mnth_May which has high p value
# Remove holiday which has high p value
cols_to_be_removed = ['atemp','mnth_May','holiday']
X = X_train_rfe.drop(cols_to_be_removed,axis=1)

In [87]:
build_model(X,y_train)

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.593
Model:,OLS,Adj. R-squared:,0.584
Method:,Least Squares,F-statistic:,60.46
Date:,"Mon, 07 Feb 2022",Prob (F-statistic):,4.529999999999999e-89
Time:,22:56:14,Log-Likelihood:,268.42
No. Observations:,510,AIC:,-510.8
Df Residuals:,497,BIC:,-455.8
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3179,0.045,7.043,0.000,0.229,0.407
workingday,0.0532,0.018,3.016,0.003,0.019,0.088
temp,0.7045,0.038,18.331,0.000,0.629,0.780
hum,-0.3004,0.061,-4.954,0.000,-0.419,-0.181
windspeed,-0.1979,0.041,-4.805,0.000,-0.279,-0.117
season_summer,0.0831,0.017,4.816,0.000,0.049,0.117
season_winter,0.1405,0.017,8.318,0.000,0.107,0.174
mnth_Jul,-0.0793,0.030,-2.655,0.008,-0.138,-0.021
mnth_Jun,-0.0472,0.027,-1.719,0.086,-0.101,0.007

0,1,2,3
Omnibus:,18.838,Durbin-Watson:,2.057
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9.253
Skew:,0.088,Prob(JB):,0.00979
Kurtosis:,2.364,Cond. No.,19.6


In [88]:
compute_vif(X)

Unnamed: 0,Features,VIF
2,hum,16.68
1,temp,11.82
0,workingday,4.52
3,windspeed,3.7
11,weathersit_Mist_Cloudy,2.1
4,season_summer,1.79
9,weekday_Sun,1.78
5,season_winter,1.7
6,mnth_Jul,1.66
7,mnth_Jun,1.4


##### Interpretations:
- Adjusted R square has remained the same indicating that removed variables may not be good value add to the fitness of the model
- mnth_Jun has high p value although it has low VIF

let's remove mnth_Jun

In [89]:
# Remove atemp which has very high p value
# Remove mnth_May which has high p value
# Remove holiday which has high p value
# Remove mnth_Jun which has high p value
cols_to_be_removed = ['atemp','mnth_May','holiday','mnth_Jun']
X = X_train_rfe.drop(cols_to_be_removed,axis=1)

In [91]:
build_model(X,y_train)

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.591
Model:,OLS,Adj. R-squared:,0.582
Method:,Least Squares,F-statistic:,65.43
Date:,"Mon, 07 Feb 2022",Prob (F-statistic):,2.34e-89
Time:,22:58:43,Log-Likelihood:,266.91
No. Observations:,510,AIC:,-509.8
Df Residuals:,498,BIC:,-459.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3150,0.045,6.970,0.000,0.226,0.404
workingday,0.0539,0.018,3.052,0.002,0.019,0.089
temp,0.6799,0.036,19.026,0.000,0.610,0.750
hum,-0.2856,0.060,-4.749,0.000,-0.404,-0.167
windspeed,-0.1934,0.041,-4.695,0.000,-0.274,-0.112
season_summer,0.0795,0.017,4.634,0.000,0.046,0.113
season_winter,0.1420,0.017,8.401,0.000,0.109,0.175
mnth_Jul,-0.0660,0.029,-2.284,0.023,-0.123,-0.009
mnth_Sep,0.0848,0.026,3.296,0.001,0.034,0.135

0,1,2,3
Omnibus:,22.692,Durbin-Watson:,2.065
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10.396
Skew:,0.093,Prob(JB):,0.00553
Kurtosis:,2.325,Cond. No.,19.4


In [92]:
compute_vif(X)

Unnamed: 0,Features,VIF
2,hum,16.17
1,temp,10.06
0,workingday,4.52
3,windspeed,3.69
10,weathersit_Mist_Cloudy,2.1
8,weekday_Sun,1.78
4,season_summer,1.76
5,season_winter,1.7
6,mnth_Jul,1.55
7,mnth_Sep,1.29


##### Interpretations:
- Adjusted R square has remained the same indicating that removed variables may not be good value add to the fitness of the model
- hum has high vif

let's remove hum

In [93]:
# Remove atemp which has very high p value
# Remove mnth_May which has high p value
# Remove holiday which has high p value
# Remove mnth_Jun which has high p value
# Remove hum which has high VIF
cols_to_be_removed = ['atemp','mnth_May','holiday','mnth_Jun','hum']
X = X_train_rfe.drop(cols_to_be_removed,axis=1)

In [94]:
build_model(X,y_train)

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.573
Model:,OLS,Adj. R-squared:,0.564
Method:,Least Squares,F-statistic:,66.83
Date:,"Mon, 07 Feb 2022",Prob (F-statistic):,1.4900000000000002e-85
Time:,23:01:04,Log-Likelihood:,255.62
No. Observations:,510,AIC:,-489.2
Df Residuals:,499,BIC:,-442.7
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1509,0.030,5.072,0.000,0.092,0.209
workingday,0.0591,0.018,3.284,0.001,0.024,0.094
temp,0.6383,0.035,18.039,0.000,0.569,0.708
windspeed,-0.1380,0.040,-3.420,0.001,-0.217,-0.059
season_summer,0.0775,0.018,4.424,0.000,0.043,0.112
season_winter,0.1275,0.017,7.510,0.000,0.094,0.161
mnth_Jul,-0.0546,0.029,-1.857,0.064,-0.112,0.003
mnth_Sep,0.0779,0.026,2.967,0.003,0.026,0.129
weekday_Sun,0.0611,0.023,2.636,0.009,0.016,0.107

0,1,2,3
Omnibus:,26.746,Durbin-Watson:,2.079
Prob(Omnibus):,0.0,Jarque-Bera (JB):,11.473
Skew:,0.096,Prob(JB):,0.00323
Kurtosis:,2.29,Cond. No.,11.2


In [95]:
compute_vif(X)

Unnamed: 0,Features,VIF
1,temp,5.93
0,workingday,4.03
2,windspeed,3.4
3,season_summer,1.76
7,weekday_Sun,1.69
9,weathersit_Mist_Cloudy,1.54
5,mnth_Jul,1.52
4,season_winter,1.47
6,mnth_Sep,1.29
8,weathersit_Light_Snow,1.08


##### Interpretations:


In [97]:
# Remove atemp which has very high p value
# Remove mnth_May which has high p value
# Remove holiday which has high p value
# Remove mnth_Jun which has high p value
# Remove hum which has high VIF
cols_to_be_removed = ['atemp','mnth_May','holiday','mnth_Jun','hum','mnth_Jul']
X = X_train_rfe.drop(cols_to_be_removed,axis=1)

In [98]:
build_model(X,y_train)

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.57
Model:,OLS,Adj. R-squared:,0.562
Method:,Least Squares,F-statistic:,73.52
Date:,"Mon, 07 Feb 2022",Prob (F-statistic):,9.190000000000001e-86
Time:,23:02:36,Log-Likelihood:,253.86
No. Observations:,510,AIC:,-487.7
Df Residuals:,500,BIC:,-445.4
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1562,0.030,5.260,0.000,0.098,0.214
workingday,0.0600,0.018,3.328,0.001,0.025,0.095
temp,0.6088,0.032,19.208,0.000,0.547,0.671
windspeed,-0.1369,0.040,-3.384,0.001,-0.216,-0.057
season_summer,0.0880,0.017,5.295,0.000,0.055,0.121
season_winter,0.1332,0.017,7.955,0.000,0.100,0.166
mnth_Sep,0.0905,0.025,3.559,0.000,0.041,0.140
weekday_Sun,0.0612,0.023,2.636,0.009,0.016,0.107
weathersit_Light_Snow,-0.3288,0.040,-8.250,0.000,-0.407,-0.251

0,1,2,3
Omnibus:,26.096,Durbin-Watson:,2.096
Prob(Omnibus):,0.0,Jarque-Bera (JB):,11.138
Skew:,0.08,Prob(JB):,0.00381
Kurtosis:,2.294,Cond. No.,10.9


In [99]:
compute_vif(X)

Unnamed: 0,Features,VIF
1,temp,4.43
0,workingday,4.0
2,windspeed,3.37
6,weekday_Sun,1.69
3,season_summer,1.57
8,weathersit_Mist_Cloudy,1.53
4,season_winter,1.39
5,mnth_Sep,1.2
7,weathersit_Light_Snow,1.08


##### Interpretations:
