## Bike Sharing Case Study

#### Problem Statement:

A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 
Essentially, the company wants —


- To identify the variables which are significant in predicting the demand for shared bikes.

- To create a linear model that quantitatively relates how well those variables describe the bike demands

- To know the accuracy of the model, i.e. how well these variables can predict bikes demand.

**So interpretation is important!**

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd

In [None]:
day = pd.read_csv("../input/boom-bike-dataset/bike_sharing_data.csv")

In [None]:
day.head()

Inspect the various aspects of the housing dataframe

In [None]:
day.shape

In [None]:
day.info()

In [None]:
day.describe()

## Step 2: Visualising the Data


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#### Visualising Numeric Variables

Let's make a pairplot of all the numeric variables

In [None]:
sns.pairplot(day)
plt.show()

#### Visualising Categorical Variables

As you might have noticed, there are a few categorical variables as well. Let's make a boxplot for some of these variables.

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'weathersit', y = 'cnt', data = day)
plt.subplot(2,3,2)
sns.boxplot(x = 'season', y = 'cnt', data = day)
plt.subplot(2,3,3)
sns.boxplot(x = 'mnth', y = 'cnt', data = day)
plt.subplot(2,3,4)
sns.boxplot(x = 'holiday', y = 'cnt', data = day)
plt.subplot(2,3,5)
sns.boxplot(x = 'weekday', y = 'cnt', data = day)
plt.show()

## Step 3: Data Preparation

Dropping unnecessary variables - 'instant', 'dteday', 'casual', 'registered'

In [None]:
day.drop(['instant','dteday', 'casual', 'registered'], axis = 1, inplace = True)

In [None]:
day.head()

### Dummy Variables

The variable 'season', 'mnth', 'weekday', 'weathersit' has multiple levels.

For this, we will use dummy variables.

The variable 'season' has four levels
- `1` will correspond to `spring`
- `2` will correspond to `summer`
- `3` will correspond to `fall`
- `4` will correspond to `winter`

In [None]:
# Let's drop the first column from season using 'drop_first = True'
season_status = pd.get_dummies(day['season'], drop_first = True)

season_status = season_status.rename(columns ={ 1:'spring',2:'summer',
                                                3:'fall',
                                                4:'winter'})
# Add the results to the original housing dataframe
day = pd.concat([day, season_status], axis = 1)

# Now let's see the head of our dataframe.
day.head()


In [None]:
# Dropping 'season' as we have created the dummies for it
day.drop(['season'], axis = 1, inplace = True)

day.head()

The variable 'mnth' has 12 levels from 1 to 12

- 1 will correspond to January .....
- 12 will correspond to December


In [None]:
# Let's drop the first column from month using 'drop_first = True'
month = pd.get_dummies(day['mnth'], drop_first = True)

month = month.rename(columns ={ 1:'January',2:'February',3:'March',4:'April',5:'May',6:'June',7:'July',8:'August',9:'September',10:'October',11:'November',12:'December'})
# Add the results to the original housing dataframe
day = pd.concat([day, month], axis = 1)

# Dropping 'mnth' as we have created the dummies for it
day.drop(['mnth'], axis = 1, inplace = True)

# Now let's see the head of our dataframe.
day.head()

The variable 'weekday' has 7 levels from 0 to 6

- 0 will correspond to Sunday 
- 1 will correspond to Monday
.
.
.
- 6 will correspond to Saturday


In [None]:
# Let's drop the first column from weekday using 'drop_first = True'
new_weekday = pd.get_dummies(day['weekday'], drop_first = True)

new_weekday = new_weekday.rename(columns ={ 0:'Sunday',1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday'})

# Add the results to the original housing dataframe
day = pd.concat([day, new_weekday], axis = 1)

# Dropping 'weekday' as we have created the dummies for it
day.drop(['weekday'], axis = 1, inplace = True)

# Now let's see the head of our dataframe.
day.head()

The variable 'weathersit' has 4 levels

- 1 will correspond to Clear, Few clouds, Partly cloudy, Partly cloudy 
- 2 will correspond to Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3 will correspond to Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4 will correspond to Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog



In [None]:
# Let's drop the first column from weekday using 'drop_first = True'
weathersituation = pd.get_dummies(day['weathersit'], drop_first = True)

weathersituation = weathersituation.rename(columns ={ 1:'Clear',2:'Mist',3:'Light Snow',4:'Heavy Rain'})

# Add the results to the original housing dataframe
day = pd.concat([day, weathersituation], axis = 1)

# Dropping 'weekday' as we have created the dummies for it
day.drop(['weathersit'], axis = 1, inplace = True)

# Now let's see the head of our dataframe.
day.head()

## Step 4: Splitting the Data into Training and Testing Sets

The first basic step for regression is performing a train-test split.

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
df_train, df_test = train_test_split(day, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling the Features
We will use MinMax scaling.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Applying scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['temp','atemp', 'hum', 'windspeed', 'cnt' ]

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

df_train.head()

In [None]:
df_train.describe

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated

plt.figure(figsize = (26, 15))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

As noticed here, 'atemp' seems to be correlated to 'cnt' the most. Let's see a pairplot for 'atemp' vs 'cnt'.

In [None]:
plt.figure(figsize=[6,6])
plt.scatter(df_train.atemp, df_train.cnt)
plt.show()

So, we pick 'atemp' as the first variable and we'll try to fit a regression line to that.

### Dividing into X and Y sets for the model building


In [None]:
y_train = df_train.pop('cnt')
X_train = df_train

## Step 5: Building a model using RFE

Using the LinearRegression function from SciKit Learn for its compatibility with RFE(Recursive feature elimination)

### RFE

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with output number of variable equal to 10
lm = LinearRegression()
lm.fit(X_train,y_train)

rfe = RFE(lm,10)    # running RFE
rfe = rfe.fit(X_train,y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

### Building model using statsmodel, for the detailed statistics

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

In [None]:
# Adding a constant variable
import statsmodels.api as sm
X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
lm = sm.OLS(y_train,X_train_rfe).fit()  # Running the Linear Model

In [None]:
# Summary of the Linear Model
print(lm.summary())

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by="VIF", ascending = False)
vif

    

## Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_cnt = lm.predict(X_train_rfe)

In [None]:
# Importing the required libraries for plots.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

## Step 6 - Making Predictions

#### Applying the scaling on the test sets

In [None]:
num_vars = ['temp','atemp','hum', 'windspeed', 'cnt' ]
df_test[num_vars] = scaler.transform(df_test[num_vars])

#### Dividing into X_test and y_test

In [None]:
y_test = df_test.pop('cnt')
X_test = df_test

In [None]:
# Now let's use our model to make predictions.


# Adding a constant variable 
X_test_new = sm.add_constant(X_test)


# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test_new[X_train_rfe.columns]



In [None]:
# Making predictions
y_pred = lm.predict(X_test_new)


## Step 7: Model Evaluation

Let's now plot the graph for actual versus predicted values.

In [None]:
# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_pred)
fig.suptitle('y_test vs y_pred', fontsize = 20)              # Plot heading 
plt.xlabel('y_test', fontsize = 18)                          # X-label
plt.ylabel('y_pred', fontsize = 16)      


We can see that the equation of our best fitted line is:

$ cnt = 0.2265  \times  yr - 0.0897  \times  holiday + 0.5666  \times temp - 0.2864 \times hum - 0.2014 \times windspeed + 0.1002 \times summer + 0.1521 \times winter + 0.0494 \times August + 0.1187 \times September - 0.1917 \times Light Snow $


'temp','yr','winter' are some of the  variables which help to increse the count significantly.