The dataset used is the Bike Sharing Counts Dataset available [here](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset). 

In [25]:
# importing libraries
import pandas as pd

from sklearn.model_selection import train_test_split

In [44]:
df = pd.read_csv("Bike_Sharing_Dataset_Cleaned.csv", index_col=0)

df.head()

Unnamed: 0,season,yr,holiday,workingday,weathersit,temp,hum,windspeed,cnt,days_since_2011
0,1,0,0,0,2,0.344167,0.805833,0.160446,985,0
1,1,0,0,0,2,0.363478,0.696087,0.248539,801,1
2,1,0,0,1,1,0.196364,0.437273,0.248309,1349,2
3,1,0,0,1,1,0.2,0.590435,0.160296,1562,3
4,1,0,0,1,1,0.226957,0.436957,0.1869,1600,4


The two columns which have 2+ categories are dummy coded.

In [45]:
df1 = pd.get_dummies(df, columns=['season', 'weathersit'])

df1.head()

Unnamed: 0,yr,holiday,workingday,temp,hum,windspeed,cnt,days_since_2011,season_1,season_2,season_3,season_4,weathersit_1,weathersit_2,weathersit_3
0,0,0,0,0.344167,0.805833,0.160446,985,0,1,0,0,0,0,1,0
1,0,0,0,0.363478,0.696087,0.248539,801,1,1,0,0,0,0,1,0
2,0,0,1,0.196364,0.437273,0.248309,1349,2,1,0,0,0,1,0,0
3,0,0,1,0.2,0.590435,0.160296,1562,3,1,0,0,0,1,0,0
4,0,0,1,0.226957,0.436957,0.1869,1600,4,1,0,0,0,1,0,0


From a categorical feature with l levels, we only need (l-1) columns, else the model is unnecessarily overparameterised. 

Here, we use the linear regression model to predict the bike rentals on a day, given weather and calendrical information. The feature selection is in accordance with the book.

In [20]:
df1 = df1.drop(['yr', 'season_4', 'weathersit_3'], axis=1)

df1.head()

Unnamed: 0,holiday,workingday,temp,hum,windspeed,cnt,days_since_2011,season_1,season_2,season_3,weathersit_1,weathersit_2
0,0,0,0.344167,0.805833,0.160446,985,0,1,0,0,0,1
1,0,0,0.363478,0.696087,0.248539,801,1,1,0,0,0,1
2,0,1,0.196364,0.437273,0.248309,1349,2,1,0,0,1,0
3,0,1,0.2,0.590435,0.160296,1562,3,1,0,0,1,0
4,0,1,0.226957,0.436957,0.1869,1600,4,1,0,0,1,0


In [22]:
X = df1.drop('cnt', axis=1)
y = df1.cnt

In [23]:
X.head()

Unnamed: 0,holiday,workingday,temp,hum,windspeed,days_since_2011,season_1,season_2,season_3,weathersit_1,weathersit_2
0,0,0,0.344167,0.805833,0.160446,0,1,0,0,0,1
1,0,0,0.363478,0.696087,0.248539,1,1,0,0,0,1
2,0,1,0.196364,0.437273,0.248309,2,1,0,0,1,0
3,0,1,0.2,0.590435,0.160296,3,1,0,0,1,0
4,0,1,0.226957,0.436957,0.1869,4,1,0,0,1,0


In [24]:
y.head()

0     985
1     801
2    1349
3    1562
4    1600
Name: cnt, dtype: int64

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Explore the dataset

The target variable can be assumed to follow a normal distribution, as it seems to be approximately symmetric.

In [27]:
from interpret import show
from interpret.data import Marginal

marginal = Marginal().explain_data(X_train, y_train, name = 'Train Data')
show(marginal)

All the categories show an imbalance, the target variable shares a moderately positive correlation with temp and days_since_2011. 

# Train the linear model

In [29]:
from interpret.glassbox import LinearRegression

lr = LinearRegression(random_state=42)
lr.fit(X_train, y_train)

<interpret.glassbox.linear.LinearRegression at 0x7eff593f5b50>

# Evaluate performance

In [31]:
from interpret.perf import RegressionPerf

lr_perf = RegressionPerf(lr.predict).explain_perf(X_test, y_test, name='Linear Regression')

In [32]:
show(lr_perf)

This model explains 79% of the total variance of the target variable. 

It is also observed that the residuals follow a positively skewed distribution and not symmetric, and hence, they seem to violate the assumption of homoscedasticity, which says that the variance of the residuals is assumed to be constant over the whole feature space.

# Global Explanations: What the model learned overall

In [33]:
lr_global = lr.explain_global(name='Linear Regression')

show(lr_global)

Among the continuous features, temp gets the highest weightage. An increase of the temperature by 1 degree Celsius increases the expected number of bikes by 5282, given all the features stay the same.

The estimated number of bikes is 1730 more when it is clear, compared to other weather conditions, given that all other features stay the same.

One thing to note is that days_since_2011 does not seem to get any importance in the model in spite of sharing a correlation with the target variable, for reasons unknown.

The individual effect plots also help to understand how much the combination of a weight and a feature contributes to the predictions of our data. The x-axis represents the feature values and the score represents the product of the respective weight and feature value.

# Local Explanations: How an individual prediction was made

In [34]:
lr_local = lr.explain_local(X_test[:5], y_test[:5], name='Linear Regression')

show(lr_local)

A look at the effect realisation for the rental bike count of one instance(i.e, one day)(X_test[0]). 

Some features contribute unusually little or much to the predicted bike count when compared to the overall dataset: Temperature(0.48 degrees) contributes less towards the predicted value compared to the average and the trend feature days_since_2011 unusually much, because this instance is from late 2012(703 days).

It could be that since this dataset is not scaled, day_since_2011 shows this erratic behaviour or the inherent trend present in the values of days_since_2011 itself accounts for it. A further analysis with a scaled datasets could lead to better results.

