## Exercise 3

First we import python modules:

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures

import warnings
warnings.simplefilter('ignore')

[Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand/overview)

In [2]:
path = 'bike-sharing-demand/'
rides = pd.read_csv(path + 'train.csv')
rides.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


**Data Fields**

* datetime - hourly date + timestamp    
* season -  1 = spring, 2 = summer, 3 = fall, 4 = winter   
* holiday - whether the day is considered a holiday  
* workingday - whether the day is neither a weekend nor holiday  
* weather -   
    1: Clear, Few clouds, Partly cloudy, Partly cloudy   
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist   
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds   
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog   
* temp - temperature in Celsius  
* atemp - "feels like" temperature in Celsius  
* humidity - relative humidity  
* windspeed - wind speed  
* casual - number of non-registered user rentals initiated  
* registered - number of registered user rentals initiated  
* count - number of total rentals  

Let us look at the *datetime* values.

In [3]:
rides['datetime'].values[:5]

array(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
       '2011-01-01 02:00:00', '2011-01-01 03:00:00',
       '2011-01-01 04:00:00'], dtype=object)

We extract 'month', 'hour', 'weekday' from the 'datetime' column. 

In [4]:
import calendar
from datetime import datetime

def extract_from_datetime(rides):

    rides["date"] = rides["datetime"].apply(lambda x : x.split()[0])
    rides["hour"] = rides["datetime"].apply(lambda x : x.split()[1].split(":")[0])
    rides["weekday"] = rides["date"].apply(lambda dateString : 
        calendar.day_name[datetime.strptime(dateString,"%Y-%m-%d").weekday()])
    rides["month"] = rides["date"].apply(lambda dateString : 
        calendar.month_name[datetime.strptime(dateString,"%Y-%m-%d").month])
    return rides

We one-hot encode the categorical features.

In [5]:
# Ask to fill in
def one_hot_encoding(rides):
    dummy_fields = ['season', 'weather', 'month', 'hour', 'weekday']
    for each in dummy_fields:
        dummies = pd.get_dummies(rides[each], prefix=each, drop_first=False)
        rides = pd.concat([rides, dummies], axis=1)
    return rides

We drop the columns that are redundant now.

In [6]:
# Ask to fill in
def drop_features(rides):
    features_to_drop = ['datetime', 'date', 
                        'month', 'hour', 'weekday', 
                        'season', 'weather']

    rides = rides.drop(features_to_drop, axis=1)
    return rides
rides.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


Now we apply all the above defined functions to the `rides` dataframe.

In [7]:
def feature_engineering(rides):
    rides = extract_from_datetime(rides)
    rides = one_hot_encoding(rides)
    rides = drop_features(rides)
    return rides

rides = feature_engineering(rides)
target = rides[['casual', 'registered', 'count']]
rides = rides.drop(['casual', 'registered', 'count'], axis=1)

The reason we define the feature engineering as a function because we do not want to repeat the same steps for the dataset for `test.csv` file for which we will make predictions at the end.

In [8]:
rides.head()

Unnamed: 0,holiday,workingday,temp,atemp,humidity,windspeed,season_1,season_2,season_3,season_4,...,hour_21,hour_22,hour_23,weekday_Friday,weekday_Monday,weekday_Saturday,weekday_Sunday,weekday_Thursday,weekday_Tuesday,weekday_Wednesday
0,0,0,9.84,14.395,81,0.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,9.02,13.635,80,0.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,9.02,13.635,80,0.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,9.84,14.395,75,0.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,9.84,14.395,75,0.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [9]:
rides.columns

Index(['holiday', 'workingday', 'temp', 'atemp', 'humidity', 'windspeed',
       'season_1', 'season_2', 'season_3', 'season_4', 'weather_1',
       'weather_2', 'weather_3', 'weather_4', 'month_April', 'month_August',
       'month_December', 'month_February', 'month_January', 'month_July',
       'month_June', 'month_March', 'month_May', 'month_November',
       'month_October', 'month_September', 'hour_00', 'hour_01', 'hour_02',
       'hour_03', 'hour_04', 'hour_05', 'hour_06', 'hour_07', 'hour_08',
       'hour_09', 'hour_10', 'hour_11', 'hour_12', 'hour_13', 'hour_14',
       'hour_15', 'hour_16', 'hour_17', 'hour_18', 'hour_19', 'hour_20',
       'hour_21', 'hour_22', 'hour_23', 'weekday_Friday', 'weekday_Monday',
       'weekday_Saturday', 'weekday_Sunday', 'weekday_Thursday',
       'weekday_Tuesday', 'weekday_Wednesday'],
      dtype='object')

In [10]:
rides.shape

(10886, 57)

In [11]:
X_train, X_valid, y_train, y_valid = train_test_split(rides, target['count'],
                                        random_state = 0)

In [12]:
linreg = LinearRegression().fit(X_train, y_train)

print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (validation): {:.3f}'
     .format(linreg.score(X_valid, y_valid)))

R-squared score (training): 0.641
R-squared score (validation): 0.625


You should get the output   
`R-squared score (training): 0.641
R-squared score (validation): 0.625`

In [13]:
poly2 = PolynomialFeatures(degree=2)
X_poly2 = poly2.fit_transform(rides)
X_train_poly2, X_valid_poly2, y_train_poly2, y_valid_poly2 = train_test_split(X_poly2, 
                                                    target['count'], random_state = 0)

In [14]:
polyreg_ridge = Ridge(alpha=1000).fit(X_train_poly2, y_train_poly2)

print('R-squared score (training): {:.3f}'
     .format(polyreg_ridge.score(X_train_poly2, y_train_poly2)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_ridge.score(X_valid_poly2, y_valid_poly2)))

R-squared score (training): 0.736
R-squared score (validation): 0.710


In [15]:
polyreg = LinearRegression().fit(X_train_poly2, y_train_poly2)

polyreg_train_score = polyreg.score(X_train_poly2, y_train_poly2)
polyreg_valid_score = polyreg.score(X_valid_poly2, y_valid_poly2)

print('R-squared score (training): {:.3f}'
     .format(polyreg_train_score))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_valid_score))

R-squared score (training): 0.870
R-squared score (validation): 0.827


In [16]:
polyreg_ridge = Ridge(alpha=100).fit(X_train_poly2, y_train_poly2)

print('R-squared score (training): {:.3f}'
     .format(polyreg_ridge.score(X_train_poly2, y_train_poly2)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_ridge.score(X_valid_poly2, y_valid_poly2)))

R-squared score (training): 0.827
R-squared score (validation): 0.797


In [17]:
polyreg_lasso = Lasso(alpha=5).fit(X_train_poly2, y_train_poly2)

print('R-squared score (training): {:.3f}'
     .format(polyreg_lasso.score(X_train_poly2, y_train_poly2)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_lasso.score(X_valid_poly2, y_valid_poly2)))

R-squared score (training): 0.691
R-squared score (validation): 0.675


In [18]:
polyreg_elasticnet = ElasticNet(alpha=1000).fit(X_train_poly2, y_train_poly2)

print('R-squared score (training): {:.3f}'
     .format(polyreg_elasticnet.score(X_train_poly2, y_train_poly2)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_elasticnet.score(X_valid_poly2, y_valid_poly2)))

R-squared score (training): 0.255
R-squared score (validation): 0.266


In [20]:
poly3 = PolynomialFeatures(degree=3)
X_poly3 = poly3.fit_transform(rides)
X_train_poly3, X_valid_poly3, y_train_poly3, y_valid_poly3 = train_test_split(X_poly3, 
                                                    target['count'], random_state = 0)

In [21]:
polyreg3 = LinearRegression().fit(X_train_poly3, y_train_poly3)

polyreg3_train_score = polyreg3.score(X_train_poly3, y_train_poly3)
polyreg3_valid_score = polyreg3.score(X_valid_poly3, y_valid_poly3)

print('R-squared score (training): {:.3f}'
     .format(polyreg3_train_score))
print('R-squared score (validation): {:.3f}'
     .format(polyreg3_valid_score))

R-squared score (training): 0.972
R-squared score (validation): -5059817.445


In [22]:
polyreg_ridge = Ridge(alpha=100).fit(X_train_poly3, y_train_poly3)

print('R-squared score (training): {:.3f}'
     .format(polyreg_ridge.score(X_train_poly3, y_train_poly3)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_ridge.score(X_valid_poly3, y_valid_poly3)))

R-squared score (training): 0.924
R-squared score (validation): 0.822


In [23]:
polyreg_ridge = Ridge(alpha=3000).fit(X_train_poly3, y_train_poly3)

print('R-squared score (training): {:.3f}'
     .format(polyreg_ridge.score(X_train_poly3, y_train_poly3)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_ridge.score(X_valid_poly3, y_valid_poly3)))

R-squared score (training): 0.904
R-squared score (validation): 0.841


In [24]:
polyreg_ridge = Ridge(alpha=5000).fit(X_train_poly3, y_train_poly3)

print('R-squared score (training): {:.3f}'
     .format(polyreg_ridge.score(X_train_poly3, y_train_poly3)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_ridge.score(X_valid_poly3, y_valid_poly3)))

R-squared score (training): 0.901
R-squared score (validation): 0.841


In [25]:
polyreg_lasso = Lasso(alpha=5).fit(X_train_poly3, y_train_poly3)

print('R-squared score (training): {:.3f}'
     .format(polyreg_lasso.score(X_train_poly3, y_train_poly3)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_lasso.score(X_valid_poly3, y_valid_poly3)))

R-squared score (training): 0.858
R-squared score (validation): 0.824


In [26]:
polyreg_elasticnet = ElasticNet(alpha=1000).fit(X_train_poly3, y_train_poly3)

print('R-squared score (training): {:.3f}'
     .format(polyreg_elasticnet.score(X_train_poly3, y_train_poly3)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_elasticnet.score(X_valid_poly3, y_valid_poly3)))

R-squared score (training): 0.644
R-squared score (validation): 0.635


- "In gradient based learning method, it is common to normalize the numerical variables to speed up the training"
- Optional PCA for degree 3?
- Add learning rate
- Add scaling
- Three predictors
- Add Bootstrapping and/or cross-validation

Credit: Udacity

Hyper-parameters tuning is an essential part of optimizing a model. It address both overfitting and underfitting. Tuning the hyper-parameters requires a good understand of the algorithm used, especially in context of reducing/increasing its complexity.
is to give you some confidence to tackle the datasets and apply your knowledge while you are still learning and building the conceptual understanding. 

In [None]:
polyreg_ridge = Ridge(alpha=1).fit(X_train_poly, y_train_poly)

print('R-squared score (training): {:.3f}'
     .format(polyreg_ridge.score(X_train_poly, y_train_poly)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_ridge.score(X_valid_poly, y_valid_poly)))

In [None]:
polyreg_lasso = Lasso(alpha=1).fit(X_train_poly, y_train_poly)

print('R-squared score (training): {:.3f}'
     .format(polyreg_lasso.score(X_train_poly, y_train_poly)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_lasso.score(X_valid_poly, y_valid_poly)))

In [None]:
polyreg_elasticnet = ElasticNet(alpha=1).fit(X_train_poly, y_train_poly)

print('R-squared score (training): {:.3f}'
     .format(polyreg_elasticnet.score(X_train_poly, y_train_poly)))
print('R-squared score (validation): {:.3f}'
     .format(polyreg_elasticnet.score(X_valid_poly, y_valid_poly)))