Bike sharing polynomial features
---

Exercise - Load and split the data, set the baseline
---

> **Exercise**: Load the data set. Encode categorical variables with one-hot encoding. Split the data into train/test sets with the `train_test_split()` function from Scikit-learn (50-50 split, `random_state=0`). Fit a linear regression and compare its performance to the median baseline using the mean absolute error (MAE) measure.

In [2]:
# Load data
import pandas as pd

# Load the data
data_df = pd.read_csv('bike-sharing.csv')

# First five rows
data_df.head()

Unnamed: 0,temp,hum,windspeed,yr,workingday,holiday,weekday,season,weathersit,casual
0,0.344,0.806,0.16,2011,no,no,6,spring,cloudy,331
1,0.363,0.696,0.249,2011,no,no,0,spring,cloudy,131
2,0.196,0.437,0.248,2011,yes,no,1,spring,clear,120
3,0.2,0.59,0.16,2011,yes,no,2,spring,clear,108
4,0.227,0.437,0.187,2011,yes,no,3,spring,clear,82


In [4]:
# Encode categorical variables
data_df2 = pd.get_dummies(data_df, columns=['weekday','season','weathersit'])
data_df2.head()

Unnamed: 0,temp,hum,windspeed,yr,workingday,holiday,casual,weekday_0,weekday_1,weekday_2,...,weekday_4,weekday_5,weekday_6,season_fall,season_spring,season_summer,season_winter,weathersit_clear,weathersit_cloudy,weathersit_rainy
0,0.344,0.806,0.16,2011,no,no,331,0,0,0,...,0,0,1,0,1,0,0,0,1,0
1,0.363,0.696,0.249,2011,no,no,131,1,0,0,...,0,0,0,0,1,0,0,0,1,0
2,0.196,0.437,0.248,2011,yes,no,120,0,1,0,...,0,0,0,0,1,0,0,1,0,0
3,0.2,0.59,0.16,2011,yes,no,108,0,0,1,...,0,0,0,0,1,0,0,1,0,0
4,0.227,0.437,0.187,2011,yes,no,82,0,0,0,...,0,0,0,0,1,0,0,1,0,0


In [17]:
# Split into train/test sets

X = data_df2.drop('casual', axis=1).values
y = data_df2.casual.values


from sklearn.model_selection import train_test_split

# Split data
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, train_size=0.5, test_size=0.5, random_state=0)

In [15]:
import numpy as np

# Mean absolute error (MAE)
def MAE(y, y_pred):
    return np.mean(np.abs(y - y_pred))

In [16]:
# Median baseline
mae_baseline = MAE(y_te, np.median(y_te))

# Linear regression

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

# Fit it
linreg.fit(X_tr, y_tr);
y_pred_te = linreg.predict(X_te)

mae_lr = MAE(y, y_pred_te)

print('MAE baseline: {:.3f}'.format(mae_baseline))
print('MAE linear regression: {:.3f}'.format(mae_lr))

ValueError: could not convert string to float: 'yes'

Exercise - Add polynomial features
---

> **Exercise**: Add the `temp^2` and `temp^3` polynomial features. Then fit and evaluate a linear regression. Plot your model with a scatter plot of temperatures vs. number of users. Feel free to add other features.

In [None]:
# Add polynomial features
???

# Fit a linear regression
mae_lr2 = ???
print('MAE lr with new features: {:.3f}'.format(mae_lr2))

In [None]:
# Plot predictions
???

Exercise - Separate sources
---

In the last exercise, we saw that we can identify two sources in the data.

1. Data points collected during working days
1. Data points collected during non-working days

The goal of this exercise is to create a model for each source using your extended set of features, e.g., the original features plus the `temp^2`, `temp^3` polynomial features.

> **Exercise**: Create a model for each source with the extended set of features, and evaluate the overall performance on the test set using MAE. Plot the two models with a scatter plot of temperatures vs. number of users. Create a final comparison using a bar chart.

In [None]:
# Separate data points
???

In [None]:
# Fit a linear regression for working days (wd)
# and one for non-working days (nwd)
???

# Compute overall performance with MAE
mae_wdnwd = ???
print('MAE two sources: {:.3f}'.format(mae_wdnwd))

In [None]:
# Plot predictions
???

In [None]:
# Final comparison
???