## Prediction of bike rental count daily

In [1]:
import pandas as pd

df = pd.read_csv('day.csv')
df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
        - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
        - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
        - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
        - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

Predication of bike rental count daily based on the **environmental** and **seasonal** settings.

In [2]:
y = df['cnt'].values

In [3]:
X = df.loc[:, 'season':'windspeed'].values

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [5]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
stdsc.fit(X_train)
X_test = stdsc.transform(X_test)
X_train = stdsc.transform(X_train)

In [6]:
from sklearn.linear_model import LinearRegression


lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [7]:
lr.coef_

array([ 570.24224207,  988.50252958, -151.26970154,  -68.56680044,
        127.44053837,   58.90911908, -329.44361269, -563.06264412,
       1546.68659247, -154.07905434, -156.98683907])

In [8]:
lr.score(X_test, y_test)

0.8002307500142452

In [9]:
from sklearn.metrics import mean_squared_error

mean_squared_error(lr.predict(X_test), y_test)

777068.4148123121

## SGDRegressor
Here we use stochastic gradient descent with L1 regularization to remove redundant features.

In [81]:
from sklearn.linear_model import SGDRegressor

lrsgd = SGDRegressor(alpha=20, max_iter=10000, penalty='l1')
lrsgd.fit(X_train, y_train)

SGDRegressor(alpha=20, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=10000,
             n_iter_no_change=5, penalty='l1', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False)

In [82]:
lrsgd.coef_

array([ 434.52802807,  968.76935342,    0.        ,  -66.48142947,
        108.77526368,   51.31362526, -335.3885948 ,  429.19137139,
        549.59553314, -129.55902757, -158.42102406])

In [83]:
print(f'Training accuracy {lrsgd.score(X_train, y_train)}')
print(f'Test accuracy {lrsgd.score(X_test, y_test)}')

Training accuracy 0.7885556743750166
Test accuracy 0.8125808500409872


In [60]:
from sklearn.metrics import mean_squared_error

mean_squared_error(lrsgd.predict(X_test), y_test)

720592.3055799959