# 12 Regression Exercise
### Q1
Assess the performance on the bike sharing dataset of Linear Regression models that use:
- ‘temp’ as the only input variable (`X = bikes_df[['temp']].values`)
- ‘hum’ as the only input
- all features except ‘casual’, ‘registered', ‘instant’ and ‘dteday’
(as set up in notebook 12 Regression Part 2)

Use all the data for training and test.  
You may use LinearRegression or SGDRegressor.  
Score performance using R2, MAPE and `mean_absolute_error`.  

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_percentage_error as MAPE

bikes_df = pd.read_csv('bike_sharing.csv')

y = bikes_df.pop('count').values
bikes_df.pop('casual').values
bikes_df.pop('registered').values
bikes_df.pop('instant').values
bikes_df.pop('dteday').values

X_all = bikes_df.values
X_t = bikes_df[["temp"]].values
X_h = bikes_df[["hum"]].values

In [2]:
bikes_df

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
0,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446
1,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539
2,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309
3,1,0,1,0,2,1,1,0.200000,0.212122,0.590435,0.160296
4,1,0,1,0,3,1,1,0.226957,0.229270,0.436957,0.186900
...,...,...,...,...,...,...,...,...,...,...,...
726,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133
727,1,1,12,0,5,1,2,0.253333,0.255046,0.590000,0.155471
728,1,1,12,0,6,0,2,0.253333,0.242400,0.752917,0.124383
729,1,1,12,0,0,0,1,0.255833,0.231700,0.483333,0.350754


In [3]:
from sklearn.preprocessing import StandardScaler
X_all_scal = StandardScaler().fit_transform(X_all)
X_t_scal = StandardScaler().fit_transform(X_t)
X_h_scal = StandardScaler().fit_transform(X_h)

In [4]:
X_all_scal

array([[-1.34821315, -1.00136893, -1.60016072, ..., -0.67994602,
         1.25017133, -0.38789169],
       [-1.34821315, -1.00136893, -1.60016072, ..., -0.74065231,
         0.47911298,  0.74960172],
       [-1.34821315, -1.00136893, -1.60016072, ..., -1.749767  ,
        -1.33927398,  0.74663186],
       ...,
       [-1.34821315,  0.99863295,  1.58866019, ..., -1.42434419,
         0.87839173, -0.85355213],
       [-1.34821315,  0.99863295,  1.58866019, ..., -1.49004895,
        -1.01566357,  2.06944426],
       [-1.34821315,  0.99863295,  1.58866019, ..., -1.54048197,
        -0.35406086, -0.46020122]])

In [5]:
from sklearn.linear_model import SGDRegressor
SGD_all = SGDRegressor(max_iter=50, tol=1e-3).fit(X_all_scal, y)
SGD_t = SGDRegressor(max_iter=50, tol=1e-3).fit(X_t_scal, y)
SGD_h = SGDRegressor(max_iter=50, tol=1e-3).fit(X_h_scal, y)

In [6]:
from sklearn.metrics import mean_absolute_error

print('All - R squared: {:.2f}'.format(SGD_all.score(X_all_scal, y)))
print('All - MAE: {:.2f}'.format(mean_absolute_error(y, SGD_all.predict(X_all_scal))))
print('All - MAPE: {:.2f}'.format(MAPE(y, SGD_all.predict(X_all_scal))), end="\n\n")

print('Temp - R squared: {:.2f}'.format(SGD_t.score(X_t_scal, y)))
print('Temp - MAE: {:.2f}'.format(mean_absolute_error(y, SGD_t.predict(X_t_scal))))
print('Temp - MAPE: {:.2f}'.format(MAPE(y, SGD_t.predict(X_t_scal))), end="\n\n")

print('Hum - R squared: {:.2f}'.format(SGD_h.score(X_h_scal, y)))
print('Hum - MAE: {:.2f}'.format(mean_absolute_error(y, SGD_h.predict(X_h_scal))))
print('Hum - MAPE: {:.2f}'.format(MAPE(y, SGD_h.predict(X_h_scal))))

All - R squared: 0.80
All - MAE: 644.08
All - MAPE: 0.45

Temp - R squared: 0.39
Temp - MAE: 1246.37
Temp - MAPE: 0.66

Hum - R squared: 0.01
Hum - MAE: 1568.34
Hum - MAPE: 0.85


### Q2
For the bike sharing dataset, the weather features are normalized but the calendar features are not. Find the regression coefficients for the following models:
1. Partially normalized: i.e. original format with only the weather features normalized. (provided in notebook 12 Regression Exercise). 
2. Fully normalized: (also provided) - what happens to the `mnth` coefficient? 
3. A model that uses only the normlaized `temp` and `mnth` features - what has happened to the mnth coefficient?



In [7]:
X_all = bikes_df

In [8]:
X_all.head()

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
0,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446
1,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539
2,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309
3,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296
4,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869


In [9]:
SGD_all_raw = SGDRegressor(tol=1e-3).fit(X_all, y)

In [10]:
for i,j in zip(SGD_all_raw.coef_, X_all.columns):
    print(i,j)

595.5572673012142 season
2079.2361850200173 yr
-39.48705978630515 mnth
-266.17257424795173 holiday
70.15050428690218 weekday
169.37189906287404 workingday
-872.4196187028776 weathersit
2424.0233355061905 temp
2210.042310276023 atemp
178.47578174118982 hum
-285.42521714092715 windspeed


**2.2.** Fully normalized: (also provided) - what happens to the `mnth` coefficient? 

In [11]:
X_all_scal = StandardScaler().fit_transform(X_all)

In [12]:
SGD_all_scal = SGDRegressor(tol=1e-3).fit(X_all_scal, y)
for i,j in zip(SGD_all_scal.coef_, bikes_df.columns):
    print(i,j)

547.5739137161171 season
1020.1671466748388 yr
-124.90082959479324 mnth
-89.20803629667397 holiday
136.70856095501856 weekday
56.919235650186316 workingday
-332.7109229898094 weathersit
453.66616525555105 temp
504.6938596758846 atemp
-146.37025534327879 hum
-205.83357059749943 windspeed


**2.3.** A model that uses only the normlaized `temp` and `mnth` features - what has happend to the `mnth` coefficient?


In [13]:
X_mt = bikes_df[['mnth', 'temp']]

In [18]:
X_mt_scaled = StandardScaler().fit_transform(X_mt)

In [20]:
SGD_mt_scal = SGDRegressor(tol=1e-3).fit(X_mt_scaled, y)
for i,j in zip(SGD_mt_scal.coef_, X_mt.columns):
    print(i,j)

286.44657510752694 mnth
1146.271517799063 temp


### Q3
Calculate error scores for the following weather predictions:  
`temp =   [12,13,15,12,11,11,17,13,12,14]`  
`t_pred = [12,12,14,13,12,11,15,12,12,13]`  
`rain =   [0,0,5,7,1,1,0,8,0,4]`  
`r_pred = [0,1,4,7,1,0,0,0,1,4]`  
`t_pred` and `r_pred` are the predicted values. 
Calculate, R2 MAE, MAPE and RMSE. What do we learn from the different scores? There is code in the notebook 12 Regression Exercise to get you started. 


In [14]:
from sklearn.metrics import mean_absolute_percentage_error as MAPE
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import r2_score as r2
from sklearn.metrics import mean_squared_error as RMSE

In [15]:
temp   = [12,13,15,12,11,11,17,13,12,14]
t_pred = [12,12,14,13,12,11,15,12,12,13]
rain   = [0,0,5,7,1,1,0,8,0,4]
r_pred = [0,1,4,7,1,0,0,0,1,4]

In [16]:
print('Temp MAPE', MAPE(temp, t_pred).round(3))
print('Temp MAE', MAE(temp, t_pred))
print('Temp R2 Score', r2(temp, t_pred).round(3))
print('Temp RMSE', RMSE(temp, t_pred))

Temp MAPE 0.058
Temp MAE 0.8
Temp R2 Score 0.688
Temp RMSE 1.0


Rain scores:

In [17]:
print('Temp MAPE', MAPE(rain, r_pred).round(3))
print('Temp MAE', MAE(rain, r_pred))
print('Temp R2 Score', r2(rain, r_pred).round(3))
print('Temp RMSE', RMSE(rain, r_pred))

Temp MAPE 900719925474099.4
Temp MAE 1.2
Temp R2 Score 0.231
Temp RMSE 6.8
