# <font color = brown>Problem Statement</font>

**A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.<br>In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people,
Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market.**

<font color = red><br>
- The company wants to know:
    - Which variables are significant in predicting the demand for shared bikes.
    - How well those variables describe the bike demands</font>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

##### <font color = green>importing statsmodels library</font>

In [2]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

##### <font color = green>importing sklearn library</font>

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [4]:
import warnings
warnings.filterwarnings("ignore")

## <font color = brown>Reading data</font>

In [5]:
raw_data = pd.read_csv("day.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'day.csv'

In [None]:
pd.set_option("display.max_columns",500)
pd.set_option("display.max_rows",raw_data.shape[0])
#pd.set_option("display.float_format", lambda x : "%.3f" %x)

In [None]:
raw_data.head()

In [None]:
raw_data.shape

In [None]:
raw_data.info()

-  we have **no null values** to deal with.
- Variables $"season", "mth", "weathersit"$ are in int datatype we have to convert it accordingly for further analysis.

In [None]:
# Cross checking the null values
raw_data.isnull().sum()

In [None]:
# Checking for duplicate records

raw_data.duplicated().sum()

In [None]:
raw_data.describe()

- variables like $"temp", "atemp", "hum", "windspeed", "registered", "cnt"$ have the very similar mean and meadian values, which also indicates that these variables are not skewed much. and the outliers are trival in these variables.

## <font color = brown>Data Understanding</font>

In [None]:
num_vars = ["temp","atemp","hum","windspeed","casual","registered","cnt"]

In [None]:
corr =  raw_data[num_vars].corr()
corr

In [None]:
plt.figure(figsize = (10,6))
sns.heatmap(corr,cmap = "Greens",annot = True)
plt.show()

- variable "registered" is highly correlated with the target variable $"cnt"$ 
- The variables $"casual", "atemp" and "temp"$ are almost equally correlated with target variable $"cnt"$

-  We can observe a tremndous correlation value of 0.99 between variables $"temp" and "atemp"$, So definitely this is a clear sign of **Multicollinearity** we have to definitely drop any one variable.

<font color = navy><br>
**Checking for linear relation among highly correlated variables**</font>

In [None]:
plt.figure(figsize = (12,10))

plt.subplot(2,2,1)
sns.regplot(raw_data["registered"], raw_data["cnt"], fit_reg = False)


plt.subplot(2,2,2)
sns.regplot(raw_data["casual"], raw_data["cnt"], fit_reg = False)

plt.subplot(2,2,3)
sns.regplot(raw_data["atemp"], raw_data["cnt"], fit_reg = False)

plt.subplot(2,2,4)
sns.regplot(raw_data["temp"], raw_data["cnt"], fit_reg = False)

plt.show()

-  This clearly shows these variables have a positive trend and undergo a linear correlation with target variable.
-  Thus confirms as the potential variable for the model building.

### <font color = navy>Checking for any sign of relationship between categorical variables and target variable "cnt" 

In [None]:
plt.figure(figsize = [16,12])

plt.subplot(3,3,1)
sns.boxplot(x = "season", y = "cnt", data = raw_data)

plt.subplot(3,3,2)
sns.boxplot(x = "yr", y = "cnt", data = raw_data)

plt.subplot(3,3,3)
sns.boxplot(x = "mnth", y = "cnt", data = raw_data)

plt.subplot(3,3,4)
sns.boxplot(x = "holiday", y = "cnt", data = raw_data)

plt.subplot(3,3,5)
sns.boxplot(x = "weekday", y = "cnt", data = raw_data)

plt.subplot(3,3,6)
sns.boxplot(x = "workingday", y = "cnt", data = raw_data)

plt.subplot(3,3,7)
sns.boxplot(x = "weathersit", y = "cnt", data = raw_data)

plt.show()

-  Except $"weekday" and "workingday"$ everyother category has noticible influence over the target variable.

## <font color = brown>Data preparation

##### <font color = navy>variable "atemp"</font>

-  As variable $"temp" and "atemp"$ are highly correlated with eachother it becomes a redundant variable in terms  of model building.So we can drop any one variable. 
-  Here Iam **dropping variable "atemp"**

In [None]:
raw_data = raw_data.drop(["atemp"], axis = 1)

In [None]:
raw_data.head()

In [None]:
raw_data.instant.unique()

-  **It is clear that the variale instance is just a serial number for the records and doesn't have any analytical purpose in this model building, So we can drop $"instance"$ variable.**

In [None]:
raw_data.drop(["instant"],axis = 1,inplace = True)

##### <font color = navy>variable "dteday"</font>

- This variable is indicating the date of the data, As we have a seperate variable $"yr"$ which is indicating the year as 2018 or 2019, and a seperate variable $"mnth"$ which is indicating the month for the data and a seperate variable $"season"$ which is indicating the season of the data we can definitely drop the variable $"dteday"$.

In [None]:
raw_data = raw_data.drop(["dteday"], axis = 1)

In [None]:
raw_data.head()

##### <font color = navy>variable "season"</font>

In [None]:
raw_data.season.value_counts()

-  As per the data dictionary each of these numbers represents a specific season of the year, So for better analysis purpose we have to convert this variable into categorical variable and later we can create dummy variables according to their respective character levels.

In [None]:
raw_data["season"] = raw_data["season"].map({1 : "spring", 2 : "summer", 3 : "fall", 4 : "winter"})

In [None]:
# cross checking the variable that we have mapped.

raw_data["season"].value_counts()

In [None]:
raw_data["season"].dtype

##### <font color = navy>variable "yr"</font>

In [None]:
raw_data["yr"].value_counts()

-  Here the year $2018 & 2019$ have be **encoded as 1 and 0** it is adviseable to convert it to seperate categories, for building better model

In [None]:
raw_data["yr"] = raw_data["yr"].map({1 : "2019", 0 : "2018"})

In [None]:
# cross checking the variable that we have mapped.

raw_data["yr"].value_counts()

##### <font color = navy>variable "mnth"</font>

In [None]:
raw_data["mnth"].value_counts()

-  According to the data dictionary these numbers represent the months of the year, so building model by keeping, these as integers will add no meaning to the data so we are converting these to categorical variables with categories as their respective months.<br><br>
- later we can create dummy variables for these categories.

In [None]:
 months_dict = {1 : "jan", 2 : "feb", 3 : "mar", 4 : "apr", 5 : "may", 6 : "jun", 
                7 : "jul", 8 : "aug", 9 : "sep", 10 : "oct", 11 : "nov", 12 : "dec"}  

In [None]:
raw_data["mnth"] = raw_data["mnth"].map(months_dict)

In [None]:
# cross checking the variable that we have mapped.

raw_data["mnth"].value_counts()

In [None]:
raw_data["mnth"].dtype

##### <font color = navy>variable "holiday"</font>

In [None]:
raw_data["holiday"].value_counts()

-  It is perfect for analysis

##### <font color = navy>variable "weekday"</font>

In [None]:
raw_data["weekday"].value_counts()

-  As mentioned in the data dictionary variable $"weekday"$ is representing the days of the week so keeping them as integers adds no value in terms of model building. So converting them to categorical variables.


In [None]:
 week_dict = {0 : "Sun", 1 : "Mon", 2 : "Tue", 3 : "Wed", 4 : "Thr", 5 : "Fri", 6 : "Sat"}

In [None]:
raw_data["weekday"] = raw_data["weekday"].map(week_dict)

In [None]:
# cross checking the data type of the variable "weekend", It has been perfectly changed to object data type.

raw_data["weekday"].dtype

##### <font color = navy>variable "workingday"</font>

In [None]:
raw_data["workingday"].value_counts()

-  It is perfect for analysis

##### <font color = navy>variable "weathersit"</font>

In [None]:
raw_data["weathersit"].value_counts()

- As per the data dictionary they have grouped certain weather conditions together and **encoded as 1,2,3 so lets convert this variable to a categorical variable** for the purpose of model building.

In [None]:
raw_data["weathersit"] = raw_data["weathersit"].astype("object")

In [None]:
# cross checking the data type of the variable "weathersit", It has been perfectly changed to object data type.

raw_data["weathersit"].dtype

##### <font color = navy>variable "casual" & "registered"</font>

-  As our target variable is just the count of these two variables, **Keeping these variables for model building will explain all the variance** in the target variable and which is not reliable.
-  And our business problem wants to find the driving variables to boost their business after the lift of lockdown we have to focus more on driving variables.
-  We are droping both $"casual" and "registered"$ variables.

In [None]:
raw_data.drop(["casual", "registered"], axis = 1, inplace = True)

In [None]:
raw_data.head()

### <font color = brown>creating dummy variables for categorical variables</font>

In [None]:
cat_var = raw_data.select_dtypes(exclude = ["int64", "float64"])
cat_var.columns

In [None]:
dum_df = pd.get_dummies(raw_data[cat_var.columns], drop_first = True)

In [None]:
dum_df.head()

In [None]:
dum_df.shape

-  So we have 23 dummy variables now

##### <font color = brwon>concating the dum_df & raw_data and droping the categorical variables</font>

In [None]:
bike_data = pd.concat([raw_data, dum_df], axis = 1)

In [None]:
bike_data.head()

In [None]:
bike_data.shape

In [None]:
bike_data.drop(cat_var.columns, axis = 1, inplace = True)

In [None]:
bike_data.head()

In [None]:
bike_data.shape

-  Totally we have 29 columns.

## <font color = brown>Now the Data is ready for model building</font>

### <font color = navy>Creating train and test data</font>

In [None]:
df_train, df_test = train_test_split(bike_data, test_size = 0.3, random_state = 100)

In [None]:
print(df_train.shape)

print(df_test.shape)

## <font color = brown>Scaling the variables</font>

-  **As the business problem seeks to find the major driver factors that determine the count of their Bikes, we would require to interpret the coefficients of the model.**
-  **So the best way is to scale the variables in a common scale.**

In [None]:
scaler = MinMaxScaler()

In [None]:
num_var_scale = ["temp", "hum", "windspeed", "cnt"]

In [None]:
df_train[num_var_scale] = scaler.fit_transform(df_train[num_var_scale])

In [None]:
# Cross checking the scaled variables

df_train.describe()

-  The **min and max** values shows that all the numerical variables have been **perfectly scaled between 0 and 1.**

## <font color = brown>Model Building</font>

In [None]:
X_train = df_train.drop(["cnt"], axis = 1, inplace = False)
y_train = df_train["cnt"]

In [None]:
lm = LinearRegression()

In [None]:
# Initial training of the model with all the features.

lm.fit(X_train,y_train)

In [None]:
lm.coef_

In [None]:
lm.intercept_

-  **So the model has learnt the coefficients for all the variables, as well as the intercept.**

## <font color = navy>Obtaining the top 15 features using RFE ( Coarse tuning )</font>

$$NOTE$$
-  **I have selected the top 15 features after many trial and error menthod for choosing the potential number of features for building optimal model.**

In [None]:
# Automated feature selection using Recursive Feature Elimination

rfe = RFE(lm, 15)

In [None]:
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Filtering only the top 10 features obtained by RFE (Coarse tuning)

new_col = X_train.columns[rfe.support_]
new_col

## <font color = brown>Manual feature elimination ( Fine Tuning )</font>

#### <font color = navy>building the model using statsmodels for a detailed statistics summary</font>

In [None]:
X_train = X_train[new_col]

In [None]:
X_train_sm = sm.add_constant(X_train)

In [None]:
lr_sm = sm.OLS(y_train, X_train_sm).fit()

In [None]:
lr_sm.summary()

-  All the variables are very **highly significant**, except a **subtle** increase in the variable **"mnth_dec"**
-  we can drop this variable and check wheather we have a significant drop in R2 and adj R2 values so that we can be confident about the model even after droping this variable.
-  Before droping the variable we can visualize VIF vlaues also for multicollinearity checks.

##### <font color = navy>Checking the Multicollinearity between the independant variables</font>

In [None]:
# Variance Inflation Factor

def vif_value(X_train_sm):
    vif = pd.DataFrame()
    vif["Features"] = X_train_sm.columns
    vif["VIF"] = [variance_inflation_factor(X_train_sm.values, i) for i in range (X_train_sm.shape[1])]
    vif["VIF"] = round(vif["VIF"],2)
    vif = vif.sort_values(by="VIF", ascending = False)
    return vif

In [None]:
vif_value(X_train_sm)

-  It seems **VIF** values are **good for all variables.** except **"season_spring"**
-  Which strongly implements there exist **very trivial Multicollinearity** between the selected features for model building.
-  So dropping the variable with insignificance will give a even better summary statistics further.

##### <font color = navy>droping the variable "mnth_dec"</font>

-  There is only a subtle difference in the p-value, this is a **trial and error method for checking the model significance after droping this variale.**<br><br>
-  **By checking the R2 and adjusted R2 values after droping the variable** we can conclude wheather to keep or drop this variable in the model

In [None]:
X_train = X_train.drop("mnth_dec", axis = 1)

new_col = X_train.columns

X_train = X_train[new_col]

In [None]:
X_train_sm = sm.add_constant(X_train)

lr_sm2 = sm.OLS(y_train, X_train_sm).fit()

lr_sm2.summary()

In [None]:
vif_value(X_train_sm)

-  It seems **VIF** values are **good for all variables.** except **"season_spring"**
-  Which strongly implements there exist **very trivial Multicollinearity** between the selected features for model building.
-  So dropping the variable with insignificance will give a even better summary statistics further.

##### <font color = navy>droping the variable "mnth_nov"</font>
-  Due to its high p-value which indicates the variable "mnth_nov" is not so significant for the model

In [None]:
X_train = X_train.drop("mnth_nov", axis = 1)

new_col = X_train.columns

X_train = X_train[new_col]

In [None]:
X_train_sm = sm.add_constant(X_train)

lr_sm3 = sm.OLS(y_train, X_train_sm).fit()

lr_sm3.summary()

In [None]:
vif_value(X_train_sm)

-  It seems VIF values are good for all variables except "season_spring"
-  Which strongly implements there exist **very trivial Multicollinearity** between the selected features for model building.
-  So dropping the variable with insignificance will give a even better summary statistics further.

##### <font color = navy>droping the variable "mnth_jan"</font>
-  "mnth_jan" is above the threshold of the p-value which represents the significance of the variable for the model

In [None]:
X_train = X_train.drop("mnth_jan", axis = 1)

new_col = X_train.columns

X_train = X_train[new_col]

In [None]:
X_train_sm = sm.add_constant(X_train)

lr_sm4 = sm.OLS(y_train, X_train_sm).fit()

lr_sm4.summary()

In [None]:
vif_value(X_train_sm)

##### <font color = navy>droping the variable "season_spring"</font>
-  It shows a VIF value above the threshold which indicates that **more than 80% of the data in variable "season_spring" can be explained by the other variables combined.**

In [None]:
X_train = X_train.drop("season_spring", axis = 1)

new_col = X_train.columns

X_train = X_train[new_col]

In [None]:
X_train_sm = sm.add_constant(X_train)

lr_sm5 = sm.OLS(y_train, X_train_sm).fit()

lr_sm5.summary()

In [None]:
vif_value(X_train_sm)

-  It seems **VIF** values are **good for all variables.**
-  Which strongly implements there exist **very trivial Multicollinearity** between the selected features for model building.

## <font color = brown>Linear model assumption checking</font>
<font color = navy><br>
-  **Checking normally distributed residuals.**</font>

In [None]:
y_train_pred = lr_sm5.predict(X_train_sm)

In [None]:
res = y_train - y_train_pred

-  Its very close to zero, indicating the mean is zero

In [None]:
sns.distplot(res)
plt.show()

the model shows very good picture of the **normally distributed residuals with mean 0 and standard deviation sigma.**

<font color = navy><br>
-  **Checking for Homoscedasticity**</font>

In [None]:
plt.scatter(y_train, y_train_pred)
plt.show()

the model shows **almost a constant variance of the residual terms.**

## <font color = brown>Testing the model in the test set (Unseen data)</font>

In [None]:
df_test[num_var_scale] = scaler.transform(df_test[num_var_scale])

In [None]:
df_test.describe()

-  The **Min and Max** values are **around 0 and 1** respectively, the **reason** behind this is the **MinMaxScaler have scaled them using the Min and Max value it learned from the training data**.

##### <font color = navy>Preparing test data</font>

In [None]:
X_test = df_test.drop("cnt",axis = 1)
y_test = df_test["cnt"]

In [None]:
X_test = X_test[X_train.columns]

In [None]:
X_test.head()

In [None]:
y_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
X_test_sm.head()

### <font color = navy>predicting y_test values</font>

In [None]:
y_test_pred = lr_sm5.predict(X_test_sm)

In [None]:
y_test_pred.head()

In [None]:
test_res = y_test - y_test_pred
test_res.head()

## <font color = brown>Model validation</font>

<font color = navy><br>
**R2 value of test data**</font>

In [None]:
test_R2 = r2_score(y_true = y_test, y_pred = y_test_pred)
print("test_R2 value =",round(test_R2,3))

<font color = navy><br>
**R2 value of train data**</font>

In [None]:
train_R2 = r2_score(y_true = y_train, y_pred = y_train_pred)
print("train_R2 value =",round(train_R2,3))

<font color = navy><br>
**adjusted R2 value of test data**</font>

In [None]:
# N -> number of rows in the dataset
N = X_test_sm.shape[0]

# p -> number of columns in the dataset
p = X_test_sm.shape[1]

# formula for adjusted R2

adjusted_test_R2 = 1 - ((1-test_R2)*(N-1)/(N-p-1))
print("adjusted_test_R2 value =",round(adjusted_test_R2,3))

<font color = navy><br>
**adhjusted R2 value of train data**</font>

In [None]:
# N -> number of rows in the dataset
N = X_train_sm.shape[0]

# p -> number of columns in the dataset
p = X_train_sm.shape[1]

# formula for adjusted R2

adjusted_train_R2 = 1 - ((1-train_R2)*(N-1)/(N-p-1))
print("adjusted_train_R2 value =",round(adjusted_train_R2,3))

<font color = navy><br>
**The model generalizes the unseen test data very well**</font>

# <font color = brown>Final interpretation of the model</font>

-  These are the driving factors of the target variable, So for the business to boom back they have to concentrate on these key driving factors.

<font color = navy><br>
**cnt = 0.2265 X const - 0.0994 X holiday + 0.5982 X temp - 0.1850 X hum - 0.1895 X windspeed + 0.0815 X season_summer + 0.1358 X season_winter + 0.2284 X yr_2019 - 0.0481 X mnth_jul + 0.0959 X mnth_sep - 0.0505 X weathersit_2 - 0.2322 X weathersit_3**</font>

***

### Top driving factors of the business

##### variable "temp"

-  **If the temperature increases by 1 unit then the count of the bike sharing will <font color = red>Increase</font> 0.59 units**

##### variable "weathersit_3"

-  **If the weathersit_3 increases by 1 unit then the count of the bike sharing will <font color = red>Decrease</font> 0.23 units**