In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures

import numpy as np
import pandas as pd

In the following cells, you will load in the Brooklyn Bridge pedestrian traffic dataset, which you have worked with before for exploratory data analysis. 

You will train a model to predict pedestrian traffic based on the following features: temperature, precipitation, hour, whether or not it is a weekend, and whether or not it is a holiday or other special event.

Your feature data is loaded into `X` and the target variable is loaded into `y`.

In [None]:
df = pd.read_excel('brooklyn-bridge-automated-counts.xlsx')
df['hour'] = df['hour_beginning'].dt.hour
df['date'] = df['hour_beginning'].dt.date
df['day_name'] = df['hour_beginning'].dt.day_name()
df['day_no'] = df['hour_beginning'].dt.dayofweek
df['temperature'] = df['temperature'].fillna(method="ffill")
df['precipitation'] = df['precipitation'].fillna(method="ffill")
df['weather_summary'] = df['weather_summary'].fillna(method="ffill")
df['is_weekend'] = df['day_no'].isin([5, 6]).astype('int')
df['is_holiday'] = df['events'].notnull().astype('int')

In [None]:
X = np.array(df[['temperature', 'precipitation', 'hour', 'is_weekend', 'is_holiday']])
y = np.array(df['Pedestrians'])

You have reason to believe that there may be interaction effects or non-linear effects of these features on the target variable. For example, if it is cold *and* rainy, that may have more of a deterrent effect on pedestrians than just the sum of the effects of cold and rainy individually.

So, before training a model, you will use the `sklearn` `PolynomialFeatures` function to generate polynomial and interaction features. According to its documentation, this function will:

> Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].


For example, this code will generate the degree-2 polynomial features for the Brooklyn Bridge data in `X`:





In [None]:
poly = PolynomialFeatures(degree=2)
X_trans = poly.fit_transform(X)
X_trans.shape

where the new features are:

In [None]:
for i, f in enumerate(poly.get_feature_names()):
  print(i, f)

You are interested in training a linear regression on this data, to predict the number of pedestrians, but you don't know what degree of polynomial to use. 

You decide to evaluate linear models on transformed versions of `X` up to degree 5 (including degree 5), to see which has the best performance in a linear regression.

First, you use `PolynomialFeatures` to create a transformed data set with polynomial features up to and including degree 5. 

In [None]:
poly = PolynomialFeatures(degree=5)
X_trans = poly.fit_transform(X)

In [None]:
X_trans_names = poly.get_feature_names()

Then, you set aside 30% of `X_trans` for evaluating the final model at the end.  Save the result in `X_tr`, `y_tr`, `X_ts`, and `y_ts`. 

You use `sklearn`'s `train_test_split` without shuffling (because of the temporal structure of the data).

In [None]:
X_tr, X_ts, y_tr, y_ts = train_test_split(X_trans, y, test_size = 0.3, shuffle=False)

Now, you will use 10-fold cross validation (with `sklearn`'s `KFold`) to evaluate each `degree` from 0 to 5 (including 5) in an `sklearn` `LinearRegression` model, using `r2_score` for the metric.  

In your cross validation, you will save the validation R2 for each degree in an array called `r2_val`, and save the training R2 in an array called `r2_train`.





In [None]:
nd = 6
nfold = 10

r2_train = np.zeros((nd,nfold))
r2_val = np.zeros((nd,nfold))

In [None]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

# note: only the code in this cell and the code provided for you will be 
# passed to the autograder. If you define any additional variables
# that are required to run this cell, make sure they are defined in this cell!


kf = KFold(n_splits=nfold, shuffle=False)
kf.get_n_splits(X_tr)

for isplit, idx in enumerate(kf.split(X_tr)):
        
    ...


Then, create an array `r2_mean` with the mean R2 value for each degree, across K folds. 

In [None]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
r2_mean = ...

Finally, select the model with the best validation R2. Save the model order in `d_opt`.

In [None]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
d_opt = ...