# Lab 3: Visualizing nonlinear regression

Data science is a visual practice. Visualizing your models and their prediction (and their errors) so that you can communicate findings and limitations well is 50% of the job.

We will recreate much of the models shown in class. First load and look through the motorcycle dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('motorcycle.csv')
print(df)

The dataset column 'times' is the time since impact and 'accel' is the acceleration of the rider's head in g (gravitational force).

The dataset has some duplicated times, and without a way to really interpret what those mean, we should remove them as some methods do not really deal with those well. The following line removes any entries with duplicated 'times' values (keeping only the first to appear).

In [None]:
df = df[~ df['times'].duplicated()]

Now we can plot

In [None]:
df.plot.scatter(x='times', y='accel')

## Polynomial regression

The first model to fit is polynomial regression. You saw in lecture that polynomial regression (like all basis expansion methods) are simply fit by linear regression with an appropriately constructed basis. 

### Problem 1
Use the method 'sklearn.preprocessing.PolynomialFeatures' to construct those bases and fit a few different polynomials of differing degrees and plot them to compare.

In [None]:
"""
The import statements below are to help you. If you read the documentation, you will see PolynomialFeatures
is used like this:

    pr = PolynomialFeatures(degree=3)  # for a degree 3 polynomial
    X_polynomial = pr.fit_transform(X)
    
Note that, even though we are only dealing with univariate inputs, most scikit-learn routines expect a 2-D matrix
to make predictions.

For a column of a dataframe, you can make sure you have that by doing:

    X = df['times'].to_frame()
    
Or for a numpy array, you can use

    x_line = np.linspace(0, 60, 1000)  # this is just a 1-D array of shape (1000,)
    x_line_2d = x_line.reshape(-1, 1)  # this is now a 2-D array of shape (1000, 1)
    
"""

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression


## Regression splines

It is better to use (natural) cubic regression splines. These are the most popular of all regression splines used in practice. scipy (a scientific programming package that is similar to Numpy and Pandas and is built-in to most Python distributions) has a method 'scipy.interpolate.UnivariateSpline' for natural cubic regression splines.

### Problem 2
Fit a natural cubic regression spline with various regularization parameters, which are called 's' in the method.

In [None]:
"""
After importing, you create and fit the model like:

    s = 100 * len(X)  # the regularization parameter... you can play around with different values
    spl = UnivariateSpline(X, Y, s=s)
    y_pred = spl(x_test)  # to make predictions. Here x_test can be a 1-D array.
    
"""

from scipy.interpolate import UnivariateSpline


The natural next step is to choose a good value for the penalty parameter via cross validation. Because this is a relatively small dataset, we probably want to use K-fold cross validation, which uses up as much of the data as possible.

We can not use the cross-validation helper introduced in Lab 2 (last week) so easily, because we are not using a scikit-learn model to for the smoothing spline. So instead we can use another scikit-learn cross-validation helper for K-fold cross-validation: 'sklearn.model_selection.KFold'.

### Problem 3
Use cross-validation to select a good value of the penalization parameter for the smoothing spline.

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=10)  # K-fold cross validation

for train_index, test_index in kf.split(X):
    print(train_index)
    print(test_index)
