# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science: 
## Homework 4 - Regularization 



**Harvard University**<br/>
**Fall 2018**<br/>
**Instructors**: Pavlos Protopapas, Kevin Rader

<hr style="height:2pt">

### INSTRUCTIONS

- **This homework must be completed individually.**

- To submit your assignment follow the instructions given in Canvas.
- Restart the kernel and run the whole notebook again before you submit. 
- As much as possible, try and stick to the hints and functions we import at the top of the homework, as those are the ideas and tools the class supports and is aiming to teach. And if a problem specifies a particular library you're required to use that library, and possibly others from the import list.


Names of people you have worked with goes here: 

<hr style="height:2pt">

In [109]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

import these libraries

In [110]:
import warnings
#warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import KFold

import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS

from pandas.core import datetools
%matplotlib inline

# Continuing Bike Sharing Usage Data

In this homework, we will focus on regularization and cross validation. We will continue to build regression models for the [Capital Bikeshare program](https://www.capitalbikeshare.com) in Washington D.C.  See homework 3 for more information about the Capital Bikeshare data that we'll be using extensively. 



<div class='exercise'> <b> Question 1 [20pts]  Data pre-processing </b> </div>

**1.1** Read in the provided `bikes_student.csv` to a data frame named `bikes_main`. Split it into a training set `bikes_train` and a validation set `bikes_val`. Use `random_state=90`, a test set size of .2, and stratify on month. Remember to specify the data's index column as you read it in.

**1.2** As with last homework, the response will be the `counts` column and we'll drop `counts`, `registered` and `casual` for being trivial predictors, drop `workingday` and `month` for being multicollinear with other columns, and `dteday` for being inappropriate for regression. Write code to do this.

Encapsulate this process as a function with appropriate inputs and outputs, and **test** your code by producing `practice_y_train` and `practice_X_train`.

**1.3** Write a function to standardize a provided subset of columns in your training/validation/test sets. Remember that while you will be scaling all of your data, you must learn the scaling parameters (mean and SD) from only the training set.

Test your code by building a list of all non-binary columns in your `practice_X_train` and scaling only those columns. Call the result `practice_X_train_scaled`. Display the `.describe()` and verify that you have correctly scaled all columns, including the polynomial columns.

**Hint: employ the provided list of binary columns and use `pd.columns.difference()`**

`binary_columns = [ 'holiday', 'workingday','Feb', 'Mar', 'Apr',
       'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec', 'spring',
       'summer', 'fall', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat',
       'Cloudy', 'Snow', 'Storm']`


**1.4** Write a code to augment your a dataset with higher-order features for `temp`, `atemp`, `hum`,`windspeed`, and `hour`. You should include ONLY the pure powers of these columns. So with degree=2 you should produce `atemp^2` and `hum^2` but not `atemp*hum` or any other two-feature interactions. 


Encapsulate this process as a function with appropriate inputs and outputs, and test your code by producing `practice_X_train_poly`, a training dataset with quadratic and cubic features built from `practice_X_train_scaled`, and printing `practice_X_train_poly`'s column names and `.head()`.

**1.5** Write code to add interaction terms to the model. Specifically, we want interactions between the continuous predictors (`temp`,`atemp`, `hum`,`windspeed`) and the month and weekday dummies (`Feb`, `Mar`...`Dec`, `Mon`, `Tue`, ... `Sat`). That means you SHOULD build `atemp*Feb` and `hum*Mon` and so on, but NOT `Feb*Mar` and NOT `Feb*Tue`. The interaction terms should always be a continuous feature times a month dummy or a continuous feature times a weekday dummy.


Encapsulate this process as a function with appropriate inputs and outputs, and test your code by adding interaction terms to `practice_X_train_poly` and show its column names and `.head()`**

**1.6** Combine all your code so far into a function that takes in `bikes_train`, `bikes_val`, the names of columns for polynomial, the target column, the columns to be dropped and produces computation-ready design matrices `X_train` and `X_val` and responses `y_train` and `y_val`. Your final function should build correct, scaled design matrices with the stated interaction terms and any polynomial degree.



### Solutions 

**1.1** Read in the provided `bikes_student.csv` to a data frame named `bikes_main`. Split it into a training set `bikes_train` and a validation set `bikes_val`. Use `random_state=90`, a test set size of .2, and stratify on month. Remember to specify the data's index column as you read it in.

In [111]:
# read in the data
bikes_main = pd.read_csv('./data/bikes_student.csv', index_col= 0)
bikes_main.head()

Unnamed: 0,dteday,hour,year,holiday,workingday,temp,atemp,hum,windspeed,casual,...,Mon,Tue,Wed,Thu,Fri,Sat,Cloudy,Snow,Storm,month
5887,2011-09-07,19,0,0,1,0.64,0.5758,0.89,0.0,14,...,0,0,1,0,0,0,1,0,0,9
10558,2012-03-21,1,1,0,1,0.52,0.5,0.83,0.0896,4,...,0,0,1,0,0,0,0,0,0,3
14130,2012-08-16,23,1,0,1,0.7,0.6515,0.54,0.1045,58,...,0,0,0,1,0,0,0,0,0,8
2727,2011-04-28,13,0,0,1,0.62,0.5758,0.83,0.2985,18,...,0,0,0,1,0,0,1,0,0,4
8716,2012-01-04,0,1,0,1,0.08,0.0606,0.42,0.3284,0,...,0,0,1,0,0,0,0,0,0,1


In [112]:
print(bikes_main.columns)

Index(['dteday', 'hour', 'year', 'holiday', 'workingday', 'temp', 'atemp',
       'hum', 'windspeed', 'casual', 'registered', 'counts', 'Feb', 'Mar',
       'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec',
       'spring', 'summer', 'fall', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat',
       'Cloudy', 'Snow', 'Storm', 'month'],
      dtype='object')


In [113]:
bikes_train, bikes_val = train_test_split(bikes_main, test_size = 0.2, stratify = bikes_main.month)

**1.2** As with last homework, the response will be the `counts` column and we'll drop `counts`, `registered` and `casual` for being trivial predictors, drop `workingday` and `month` for being multicolinear with other columns, and `dteday` for being inappropriate for regression. Write code to do this.

Encapsulate this process as a function with appropriate inputs and outputs, and test your code by producing `practice_y_train` and `practice_X_train`


In [114]:
# your code here
def get_X_and_y(df, response_column, columns_to_drop):
    response_column = ['counts']
    columns_to_drop = ['counts', 'registered', 'casual','workingday','month','dteday']
    
    df_X = df.drop(columns_to_drop, axis = 1)
    df_y = df[response_column]
    
    return df_X, df_y

In [115]:
response_column = ['counts']
columns_to_drop = ['counts', 'registered', 'casual','workingday','month','dteday']
practice_X_train, practice_y_train = get_X_and_y(bikes_train, response_column, columns_to_drop)

In [116]:
print(practice_X_train.columns)

Index(['hour', 'year', 'holiday', 'temp', 'atemp', 'hum', 'windspeed', 'Feb',
       'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec',
       'spring', 'summer', 'fall', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat',
       'Cloudy', 'Snow', 'Storm'],
      dtype='object')


In [117]:
print(practice_y_train.columns)

Index(['counts'], dtype='object')


**1.3** Write a function to standardize a provided subset of columns in your training/validation/test sets. Remember that while you will be scaling all of your data, you must learn the scaling parameters (mean and SD) from only the training set.

Test your code by building a list of all non-binary columns in your `practice_X_train` and scaling only those columns. Call the result `practice_X_train_scaled`. Display the `.describe()` and verify that you have correctly scaled all columns, including the polynomial columns.

**Hint: employ the provided list of binary columns and use `pd.columns.difference()`**

`binary_columns = [ 'holiday', 'workingday','Feb', 'Mar', 'Apr',
       'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec', 'spring',
       'summer', 'fall', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat',
       'Cloudy', 'Snow', 'Storm']`


<font color = 'red'> WHAT POLYNOMIAL COLUMN?? </FONT>

In [118]:
# your code here
def scale_col(df_to_scale, df_train, col_to_scale):
    #select subsets
    df_train = df_train[col_to_scale]
    df_to_scale = df_to_scale[col_to_scale]
    
    #get means and standard dev. for training data
    means = df_train.mean()
    stds = df_train.std()
    
    #standardize columns
    return (df_to_scale - means)/stds

In [119]:
binary_columns = [ 'holiday', 'workingday','Feb', 'Mar', 'Apr',
       'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec', 'spring',
       'summer', 'fall', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat',
       'Cloudy', 'Snow', 'Storm']
non_binary_columns = practice_X_train.columns.difference(binary_columns)
practice_X_train_scaled  = scale_col(practice_X_train, practice_X_train, non_binary_columns)
practice_X_train_scaled.describe()

Unnamed: 0,atemp,hour,hum,temp,windspeed,year
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,5.222489e-15,-5.329071000000001e-17,2.813749e-15,-4.462208e-15,6.590284e-15,-7.105427e-18
std,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.415738,-1.69161,-3.415218,-2.374607,-1.617293,-1.027889
25%,-0.8334458,-0.8123948,-0.742276,-0.8155112,-0.775293,-1.027889
50%,0.04560565,0.06682035,0.05960661,0.01600672,-0.05415467,0.9718949
75%,0.837042,0.7994997,0.8080304,0.8475247,0.6677894,0.9718949
max,2.50753,1.678715,1.930666,2.302681,5.237147,0.9718949


**1.4** Write a code to augment your a dataset with higher-order features for `temp`, `atemp`, `hum`,`windspeed`, and `hour`. You should include ONLY pure powers of these columns. So with degree=2 you should produce `atemp^2` and `hum^2` but not `atemp*hum` or any other two-feature interactions. 


Encapsulate this process as a function with apropriate inputs and outputs, and test your code by producing `practice_X_train_poly`, a training dataset with qudratic and cubic features built from `practice_X_train_scaled`, and printing `practice_X_train_poly`'s column names and `.head()`.

In [120]:
# your code here
def add_poly_columns(df, columns, degree):
    """
    df pandas DataFrame: the df to add polynomial terms to
    columns list(str): the list of columns for which to add polynomial terms
    degree int: add polynomials from 2 to this degree (inclusive). Interactions are not included.
    """
    df_poly = df.copy()
    for d in range(2, degree + 1):
        for col in columns:
            df_poly[col+'^'+str(d)] = df[col].apply(lambda x : x**d)
    return df_poly

In [123]:
columns = practice_X_train_scaled.columns
practice_X_train_poly = add_poly_columns(practice_X_train_scaled, columns, 3)
practice_X_train_poly.head()

Unnamed: 0,atemp,hour,hum,temp,windspeed,year,atemp^2,hour^2,hum^2,temp^2,windspeed^2,year^2,atemp^3,hour^3,hum^3,temp^3,windspeed^3,year^3
10090,0.309611,0.06682,-1.437241,0.327826,0.186762,0.971895,0.095859,0.004465,2.065661,0.10747,0.03488,0.94458,0.029679,0.000298,-2.968853,0.035231,0.006514,0.918032
8441,-0.39363,-0.372787,0.113065,-0.503692,0.787845,-1.027889,0.154945,0.13897,0.012784,0.253706,0.620699,1.056556,-0.060991,-0.051806,0.001445,-0.12779,0.489015,-1.086022
8835,-1.360877,1.678715,-0.421523,-1.23127,0.426873,0.971895,1.851985,2.818083,0.177682,1.516026,0.18222,0.94458,-2.520323,4.730759,-0.074897,-1.866638,0.077785,0.918032
9270,-0.39363,-1.105467,1.930666,-0.503692,0.787845,0.971895,0.154945,1.222056,3.727471,0.253706,0.620699,0.94458,-0.060991,-1.350942,7.196502,-0.12779,0.489015,0.918032
9458,-1.185066,-1.69161,0.326901,-1.23127,-0.775293,0.971895,1.404382,2.861544,0.106864,1.516026,0.601079,0.94458,-1.664286,-4.840617,0.034934,-1.866638,-0.466013,0.918032


In [124]:
print(practice_X_train_poly.columns)

Index(['atemp', 'hour', 'hum', 'temp', 'windspeed', 'year', 'atemp^2',
       'hour^2', 'hum^2', 'temp^2', 'windspeed^2', 'year^2', 'atemp^3',
       'hour^3', 'hum^3', 'temp^3', 'windspeed^3', 'year^3'],
      dtype='object')


**1.5** Write code to add interaction terms to the model. Specifically, we want interactions between the continuous predictors (`temp`,`atemp`, `hum`,`windspeed`) and the month and weekday dummies (`Feb`, `Mar`...`Dec`, `Mon`, `Tue`, ... `Sat`). That means you SHOULD build `atemp*Feb` and `hum*Mon` and so on, but NOT `Feb*Mar` and NOT `Feb*Tue`. The interaction terms should always be a continuous feature times a month dummy or a continuous feature times a weekday dummy. <font color = 'red'> **CHECK THIS** </font>


Encapsulate this process as a function with appropriate inputs and outputs, and test your code by adding interaction terms to `practice_X_train_poly` and show its column names and `.head()`**


In [129]:
# your code here
def add_interaction_terms(df_to_add_interactions,
                          df_original,
                          continuous_columns = ['temp','atemp','hum','windspeed'],
                          dummy_columns = ['Feb', 'Mar', 'Apr','May', 'Jun', 'Jul',
                                           'Aug', 'Sept', 'Oct', 'Nov', 'Dec','Mon', 
                                           'Tue', 'Wed', 'Thu', 'Fri', 'Sat']):
    """
    df_to_add_interactions pandas DataFrame: dataframe to add interaction terms
    df_original pandas DataFrame: dataframe holding the terms to form interactions with
    continuous_columns list(str): names of continuous predictors
    dummy_columns: names of dummy predictors (0/1)
    """
    df_interact = df_to_add_interactions.copy()
    for cont_col in continuous_columns:
        for other_col in dummy_columns:
            if cont_col is not other_col:
                df_interact[cont_col+"*"+other_col] = df_original[cont_col]*df_original[other_col]
    return df_interact

In [130]:
practice_X_train_interact = add_interaction_terms(practice_X_train_poly, practice_X_train)

In [131]:
practice_X_train_interact.columns

Index(['atemp', 'hour', 'hum', 'temp', 'windspeed', 'year', 'atemp^2',
       'hour^2', 'hum^2', 'temp^2', 'windspeed^2', 'year^2', 'atemp^3',
       'hour^3', 'hum^3', 'temp^3', 'windspeed^3', 'year^3', 'temp*Feb',
       'temp*Mar', 'temp*Apr', 'temp*May', 'temp*Jun', 'temp*Jul', 'temp*Aug',
       'temp*Sept', 'temp*Oct', 'temp*Nov', 'temp*Dec', 'temp*Mon', 'temp*Tue',
       'temp*Wed', 'temp*Thu', 'temp*Fri', 'temp*Sat', 'atemp*Feb',
       'atemp*Mar', 'atemp*Apr', 'atemp*May', 'atemp*Jun', 'atemp*Jul',
       'atemp*Aug', 'atemp*Sept', 'atemp*Oct', 'atemp*Nov', 'atemp*Dec',
       'atemp*Mon', 'atemp*Tue', 'atemp*Wed', 'atemp*Thu', 'atemp*Fri',
       'atemp*Sat', 'hum*Feb', 'hum*Mar', 'hum*Apr', 'hum*May', 'hum*Jun',
       'hum*Jul', 'hum*Aug', 'hum*Sept', 'hum*Oct', 'hum*Nov', 'hum*Dec',
       'hum*Mon', 'hum*Tue', 'hum*Wed', 'hum*Thu', 'hum*Fri', 'hum*Sat',
       'windspeed*Feb', 'windspeed*Mar', 'windspeed*Apr', 'windspeed*May',
       'windspeed*Jun', 'windspeed*Jul', 'w

In [132]:
practice_X_train_interact.head()

Unnamed: 0,atemp,hour,hum,temp,windspeed,year,atemp^2,hour^2,hum^2,temp^2,...,windspeed*Sept,windspeed*Oct,windspeed*Nov,windspeed*Dec,windspeed*Mon,windspeed*Tue,windspeed*Wed,windspeed*Thu,windspeed*Fri,windspeed*Sat
10090,0.309611,0.06682,-1.437241,0.327826,0.186762,0.971895,0.095859,0.004465,2.065661,0.10747,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2239,0.0,0.0
8441,-0.39363,-0.372787,0.113065,-0.503692,0.787845,-1.027889,0.154945,0.13897,0.012784,0.253706,...,0.0,0.0,0.0,0.2985,0.0,0.0,0.0,0.0,0.2985,0.0
8835,-1.360877,1.678715,-0.421523,-1.23127,0.426873,0.971895,1.851985,2.818083,0.177682,1.516026,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9270,-0.39363,-1.105467,1.930666,-0.503692,0.787845,0.971895,0.154945,1.222056,3.727471,0.253706,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2985,0.0
9458,-1.185066,-1.69161,0.326901,-1.23127,-0.775293,0.971895,1.404382,2.861544,0.106864,1.516026,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1045


**1.6** Combine all your code so far into a function that takes in `bikes_train`, `bikes_val`, the names of columns for polynomial, the target column, the columns to be dropped and produces computation-ready design matrices `X_train` and `X_val` and responses `y_train` and `y_val`. Your final function should build correct, scaled design matrices with the stated interaction terms and any polynomial degree.

<font color = 'red'> **WHICH COLUMNS FOR SCALING? ARE THESE THE ONLY ONE DEGREE TERMS?** </font>

In [133]:
def get_design_mats(train_df, val_df,  degree, 
                    columns_forpoly=['temp', 'atemp', 'hum','windspeed', 'hour'],
                    target_col='counts', 
                    bad_columns=['counts', 'registered', 'casual', 'workingday', 'month', 'dteday']):
    # add code here 
    
    # get predictors and target
    x_val,y_val = get_X_and_y(train_df, target_col, bad_columns)
    x_train,y_train = get_X_and_y(train_df, target_col, bad_columns)
    
    # scale columns
    x_val_scaled = scale_col(x_val, x_train, columns_forpoly)
    x_train_scaled = scale_col(x_train, x_train, columns_forpoly)
    
    # add polynomial terms
    x_val_poly = add_poly_columns(x_val_scaled, columns_forpoly, degree)
    x_train_poly = add_poly_columns(x_train_scaled, columns_forpoly, degree)
    
    # add interaction terms
    x_val_interact = add_interaction_terms(x_val_poly, x_val)
    x_train_interact = add_interaction_terms(x_train_poly, x_train)
    
    x_train, x_val = x_train_interact, x_val_interact
    
    return x_train,y_train, x_val,y_val

In [134]:
# your code here
x_train,y_train, x_val,y_val = get_design_mats(bikes_train, bikes_val,  3)

In [135]:
x_train.columns

Index(['temp', 'atemp', 'hum', 'windspeed', 'hour', 'temp^2', 'atemp^2',
       'hum^2', 'windspeed^2', 'hour^2', 'temp^3', 'atemp^3', 'hum^3',
       'windspeed^3', 'hour^3', 'temp*Feb', 'temp*Mar', 'temp*Apr', 'temp*May',
       'temp*Jun', 'temp*Jul', 'temp*Aug', 'temp*Sept', 'temp*Oct', 'temp*Nov',
       'temp*Dec', 'temp*Mon', 'temp*Tue', 'temp*Wed', 'temp*Thu', 'temp*Fri',
       'temp*Sat', 'atemp*Feb', 'atemp*Mar', 'atemp*Apr', 'atemp*May',
       'atemp*Jun', 'atemp*Jul', 'atemp*Aug', 'atemp*Sept', 'atemp*Oct',
       'atemp*Nov', 'atemp*Dec', 'atemp*Mon', 'atemp*Tue', 'atemp*Wed',
       'atemp*Thu', 'atemp*Fri', 'atemp*Sat', 'hum*Feb', 'hum*Mar', 'hum*Apr',
       'hum*May', 'hum*Jun', 'hum*Jul', 'hum*Aug', 'hum*Sept', 'hum*Oct',
       'hum*Nov', 'hum*Dec', 'hum*Mon', 'hum*Tue', 'hum*Wed', 'hum*Thu',
       'hum*Fri', 'hum*Sat', 'windspeed*Feb', 'windspeed*Mar', 'windspeed*Apr',
       'windspeed*May', 'windspeed*Jun', 'windspeed*Jul', 'windspeed*Aug',
       'windspeed*Se

In [106]:
y_train.columns

Index(['counts'], dtype='object')

In [107]:
x_val.columns

Index(['temp', 'atemp', 'hum', 'windspeed', 'hour', 'temp^2', 'atemp^2',
       'hum^2', 'windspeed^2', 'hour^2', 'temp^3', 'atemp^3', 'hum^3',
       'windspeed^3', 'hour^3', 'temp*atemp', 'temp*hum', 'temp*windspeed',
       'temp*Feb', 'temp*Mar', 'temp*Apr', 'temp*May', 'temp*Jun', 'temp*Jul',
       'temp*Aug', 'temp*Sept', 'temp*Oct', 'temp*Nov', 'temp*Dec', 'temp*Mon',
       'temp*Tue', 'temp*Wed', 'temp*Thu', 'temp*Fri', 'temp*Sat',
       'atemp*temp', 'atemp*hum', 'atemp*windspeed', 'atemp*Feb', 'atemp*Mar',
       'atemp*Apr', 'atemp*May', 'atemp*Jun', 'atemp*Jul', 'atemp*Aug',
       'atemp*Sept', 'atemp*Oct', 'atemp*Nov', 'atemp*Dec', 'atemp*Mon',
       'atemp*Tue', 'atemp*Wed', 'atemp*Thu', 'atemp*Fri', 'atemp*Sat',
       'hum*temp', 'hum*atemp', 'hum*windspeed', 'hum*Feb', 'hum*Mar',
       'hum*Apr', 'hum*May', 'hum*Jun', 'hum*Jul', 'hum*Aug', 'hum*Sept',
       'hum*Oct', 'hum*Nov', 'hum*Dec', 'hum*Mon', 'hum*Tue', 'hum*Wed',
       'hum*Thu', 'hum*Fri', 'hum*Sat', 

In [108]:
y_val.columns

Index(['counts'], dtype='object')

<div class='exercise'> <b> Question 2 [20pts]: Regularization via Ridge </b></div>

**2.1** For each degree in 1 through 8:

1.  Build the training design matrix and validation design matrix using the function `get_design_mats` with polynomial terms up through the specified degree.

2.  Fit a regression model to the training data.

3.  Report the model's score on the validation data.

**2.2** Discuss patterns you see in the results from 2.1. Which model would you select, and why?

**2.3** Let's try regularizing our models via ridge regression. Build a table showing the validation set $R^2$ of polynomial models with degree from 1-8, regularized at the levels $\lambda = (.01, .05, .1,.5, 1, 5, 10, 50, 100)$. Do not perform cross validation at this point, simply report performance on the single validation set. 

**2.4** Find the best-scoring degree and regularization combination.

**2.5** It's time to see how well our selected model will do on future data. Read in the provided test dataset, do any required formatting, and report the best model's $R^2$ score. How does it compare to the validation set score that made us choose this model? 

**2.6** Why do you think our model's test score was quite a bit worse than its validation score? Does the test set simply contain harder examples, or is something else going on?

### Solutions 

**2.1** For each degree in 1 through 8:

1.  Build the training design matrix and validation design matrix using the function `get_design_mats` with polynomial terms up through the specified degree.

2.  Fit a regression model to the training data.

3.  Report the model's score on the validation data.

In [20]:
# your code here


**2.2** Discuss patterns you see in the results from 2.1. Which model would you select, and why?**

*your answer here*


**2.3** Let's try regularizing our models via ridge regression. Build a table showing the validation set $R^2$ of polynomial models with degree from 1-8, regularized at the levels $\lambda = (.01, .05, .1,.5, 1, 5, 10, 50, 100)$. Do not perform cross validation at this point, simply report performance on the single validation set. 


In [21]:
# your code here


**2.4** Find the best-scoring degree and regularization combination.

In [22]:
# your code here


**2.5** It's time to see how well our selected model will do on future data. Read in the provided test dataset `data/bikes_test.csv`, do any required formatting, and report the best model's $R^2$ score. How does it compare to the validation set score that made us choose this model? 

In [23]:
# your code here


In [24]:
# your code here


**2.6** Why do you think our model's test score was quite a bit worse than its validation score? Does the test set simply contain harder examples, or is something else going on?

In [25]:
# your code here


*your answer here*


<div class='exercise'><b> Question 3 [20pts]: Comparing Ridge, Lasso, and OLS </b> </div>

**3.1** Build a dataset with polynomial degree 1 and fit an OLS model, a Ridge model, and a Lasso model. Use `RidgeCV` and `LassoCV` to select the best regularization level from among `(.1,.5,1,5,10,50,100)`. 

Note: On the lasso model, you will need to increase `max_iter` to 100,000 for the optimization to converge.

**3.2** Plot histograms of the coefficients found by each of OLS, ridge, and lasso. What trends do you see in the magnitude of the coefficients?

**3.3** The plots above show the overall distribution of coefficient values in each model, but do not show how each model treats individual coefficients. Build a plot which cleanly presents, for each feature in the data, 1) The coefficient assigned by OLS, 2) the coefficient assigned by ridge, and 3) the coefficient assigned by lasso.

**Hint: Bar plots are a possible choice, but you are not required to use them**

**Hint: use `xticks` to label coefficients with their feature names**

**3.4** What trends do you see in the plot above? How do the three approaches handle the correlated pair `temp` and `atemp`?

### Solutions

**3.1** Build a dataset with polynomial degree 1 and fit an OLS model, a Ridge model, and a Lasso model. Use `RidgeCV` and `LassoCV` to select the best regularization level from among `(.1,.5,1,5,10,50,100)`. 

Note: On the lasso model, you will need to increase `max_iter` to 100,000 for the optimization to converge.

In [26]:
#your code here



**3.2** Plot histograms of the coefficients found by each of OLS, ridge, and lasso. What trends do you see in the magnitude of the coefficients?

In [27]:
# your code here


*your answer here*


**3.3** The plots above show the overall distribution of coefficient values in each model, but do not show how each model treats individual coefficients. Build a plot which cleanly presents, for each feature in the data, 1) The coefficient assigned by OLS, 2) the coefficient assigned by ridge, and 3) the coefficient assigned by lasso.

**Hint: Bar plots are a possible choice, but you are not required to use them**

**Hint: use `xticks` to label coefficients with their feature names**

In [189]:
# your code here


**3.4** What trends do you see in the plot above? How do the three approaches handle the correlated pair `temp` and `atemp`?

In [191]:
# your code here 


*your answer here*


<div class='exercise'> <b> Question 4 [20 pts]: Reflection </b></div>
These problems are open-ended, and you are not expected to write more than 2-3 sentences. We are interested in seeing that you have thought about these issues; you will be graded on how well you justify your conclusions here, not on what you conclude.

**4.1** Reflect back on the `get_design_mats` function you built. Writing this function useful in your analysis? What issues might you have encountered if you copy/pasted the model-building code instead of tying it together in a function? Does a `get_design_mat` function seem wise in general, or are there better options?

*your answer here*

**4.2** What are the costs and benefits of applying ridge/lasso regularization to an overfit OLS model, versus setting a specific degree of polynomial or forward selecting features for the model?

*your answer here*

** 4.3** This pset posed a purely predictive goal: forecast ridership as accurately as possible. How important is interpretability in this context? Considering, e.g., your lasso and ridge models from Question 3, how would you react if the models predicted well, but the coefficient values didn't make sense once interpreted?

*your answer here*


**4.4** Reflect back on our original goal of helping BikeShare predict what demand will be like in the week ahead, and thus how many bikes they can bring in for maintenance. In your view, did we accomplish this goal? If yes, which model would you put into production and why? If not, which model came closest, what other analyses might you conduct, and how likely do you think they are to work

*your answer here*
