### Polynomial Regression using Scikit-learn

<br>

In [None]:
# Import useful libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures


Let's apply polynomial regression using the Auto dataset. You can find a csv file with this dataset on Blackboard.

The Auto datset contains data about gas mileage, horsepower, # cylinders, and other information for 392 types of vehicles.

In this example, our predictor will be __horsepower__ and the outcome __gas mileage (mpg)__

In [None]:
Auto_df=pd.read_csv('C:\\Users\\jheredi2\\Documents\\PythonDataAnalytics\\1-Datasets\\Auto_ISLR.csv')

In [None]:
Auto_df.info()

#### Are horsepower and mpg related?

In [None]:
Auto_df[['mpg','horsepower']].corr()

Let's do a scatterplot to get a better sense of their relationship and identify potential outliers!

In [None]:
plt.style.use('seaborn') # This command is just to make the graph looks nicer

plt.scatter(Auto_df['horsepower'], Auto_df['mpg'],c='blue',marker='o')

plt.xlabel("Horsepower")

plt.ylabel("MPG")

plt.axhline(Auto_df['mpg'].mean(),c='red',ls='--')

plt.show()

### Fitting a polynomial model of second degree for mpg based on horsepower:

__Estimated mpg__ = bo + b1 * horsepower + b2 * (horsepower squared)

The method PolynomialFeatures() from scikit-learn allows us to specify what kind of polynomial model we want to fit with the data; that is, it allows us to specify the degree of the polynomial; whether we want to incorporate the interaction term, and other specifications.



In [None]:
poly2_object= PolynomialFeatures(degree=2)

The variable used to store the output of PolynomialFeatures() is then use to create the matrix with the X 
values of X, X squared, X cubed, etc. To do that, we use the method fit_transform().

The predictor (or predictors) need to be converted into a NumPy array before they are passed to fit_transform ().

Also, __very important__, in any method part of scikit-learn, the NumPy array containing the values of the predictors MUST have two dimensions.

So, what do we do in this case where we only have one predictor? To store the values of one predictor, we only need need a one-dimensional array, so what do we do?

We transform the array with the values of the predictor in a two dimensional array where the number of columns equals 1. That is, we create a column array = an array with m rows and 1 column.

We can use the reshape() method to do that.

In [None]:
# Horsepower as one dimensional array
np.array(Auto_df['horsepower'])

In [None]:
X= np.array(Auto_df['horsepower']).reshape(-1,1)

In [None]:
# Horsepower as a two dimensional array with m rows and 1 column

X

An alternative to using reshape() is to use this code: X= np.array(Auto_df['horsepower'])[:, np.newaxis]

This would also create a column array with two dimensions.

Next, let's use fit_transform() to create the matrix with the values of X and X squared.

In [None]:
X_poly2= poly2_object.fit_transform(X)
X_poly2

Next, the outcome variable also needs to be converted into an array. However, the outcome variable needs to be stored in a uni-dimensional array!!!

__Do not convert the array with values of Y into a column array !!!__

In [None]:
y= np.array(Auto_df['mpg'])

Now, we need to specify how we are going to obtain the polynomial regression of degree 2.

We need to specify what method we will use to obtain the coefficients of the polynomial regression of degree 2.

As discussed in the slides in polynomial regression we can still use OLS to achieve this purpose.

In scikit-learn, to use OLS, we call the function LinearRegression()

In [None]:
model_poly2 = LinearRegression(fit_intercept=True)

Now we use the linear regression object (reg_object1) to fit a model where the X are the polynomial features stored 
in X_poly2.

In [None]:
model_poly2.fit(X_poly2, y)

In [None]:
# Getting the equation coefficients

model_poly2.coef_

In [None]:
model_poly2.intercept_

Let's use the polynomial equation to predict values of __mpg__

For example, let's predict the value of mpg based on horsepower for the first five training observations.

In [None]:
model_poly2.predict(X_poly2)[0:5]

An alternative (although LONGER!) way of getting the same predictions is using the coefficients and writing out the equation.

As practice, let's predict usinf this alternative way

In [None]:
# To retrieve the individual coefficients we can do this:

model_poly2.coef_

In [None]:
print(model_poly2.coef_[1])
print(model_poly2.coef_[2])

In [None]:
model_poly2.intercept_

In [None]:
# Writing out the equation: coefficients * values of X

model_poly2.intercept_ + model_poly2.coef_[1]* (Auto_df['horsepower'].values[0:5]) + model_poly2.coef_[2]* ( (Auto_df['horsepower'].values**2)[0:5])


__QUESTION__: The expression in the previous code cell shows two important properties of arrays: __vectorization__ and __recycling__. When are these properties manifested in this expression?

In [None]:
# vectorization
y1= 2* np.array([3,4,5])
y1

In [None]:
# recycling
1 + y1

Now, let's use predict() to predict the value of mpg based for five NEW test observations. 

We need to generate five new values of horsepower. How to make sure that the new values of horsepower that I use make sense? 

Make sense = They are values in the range of possible values of horsepower

In [None]:
Auto_df['horsepower'].describe()

In [None]:
X_test= np.arange(100,121,5).reshape(-1,1)
X_test

In [None]:
X_poly2_test= poly2_object.fit_transform(X_test)
X_poly2_test

In [None]:
model_poly2.predict(X_poly2_test)

### Let's evaluate the quality of this second-degree polynomial equation

#### R squared

In [None]:
# Obtain R SQUARED for polynomial model

from sklearn.metrics import r2_score

In [None]:
np.round (r2_score(y, model_poly2.predict(X_poly2)), 2)

#### Adjusted R squared

Adj R sq is not included in scikit learn, but we can create a formula to get it. Computing Adj R sq is not needed now since we are not comparing models, but it could be useful when comparing polynomials of different degrees.

As a reference and reminder, you can check out the formula of adj R sq that we discussed in the R class (or you can find it in a book or online).

__Note__: As an alternative to the formula I am going to use to get Adj R sq, you can use the formula where Adj R sq is obtained directly from R sq (you can search for this formula online).

In [None]:
# Obtain sum of square of residuals (abbreviated as SSR or SSE)

sse=sum((y -model_poly2.predict(X_poly2))**2)
sse

In [None]:
top_adj_r2=sse/(y.size-2-1)

# The "2" in this formula represents that we have two predictors in the model: X and X squared

In [None]:
bottom_adj_r2= (sum((y-y.mean())**2))/(y.size-1)

In [None]:
# ADJ R SQUARED for polynomial model

np.round (1-(top_adj_r2/bottom_adj_r2), 2)

#### Cross validation to estimates test prediction error

In [None]:
from sklearn.model_selection import cross_val_score

<br>

Some comments about the __cross_val_score()__ method

1) The parameter 'scoring' (SEE next code cell) refers to what metric is used to evaluate the quality of the estimated model. Some options are:

scoring='r2' (default)

scoring='neg_mean_squared_error'

scoring= 'neg_root_mean_squared_error'

2) When we set the parameter 'cv' equals to a number k, the dataset will be splitted in k groups (folds), but the splitting does not occur randomly. That is, the groups are formed by spliting the data in k subsequent k parts.

You have the option of using a CV splitter method to shuffle the groups(folds) used by __cross_val_score()__ 

In [None]:
cross_val_score(model_poly2.fit(X_poly2,y), X_poly2, y, scoring= 'neg_mean_squared_error', cv=10)

In [None]:
# Save the previous values in an object and multiply them by -1

cv_values= -1*cross_val_score(model_poly2.fit(X_poly2,y), X_poly2, y, scoring= 'neg_mean_squared_error', cv=10)
cv_values

In [None]:
# Estimated test Mean Squared Error for the polynomial model of second degree based on CV

np.round (cv_values.mean(), 2)

A scatterplot that shows the second degree polynomial curve

In [None]:
plt.style.use('seaborn')

plt.scatter(Auto_df['horsepower'], Auto_df['mpg'],c='blue',marker='o')

plt.xlabel("Horsepower")

plt.ylabel("MPG")

plt.axhline(Auto_df['mpg'].mean(),c='red',ls='--')

# To plot the polynomial curve, you are required to create an array of sorted data for the x axis. You cannot use 
# the values of the X variable unless they are sorted. Why? The plot() method from pyplot starts joining 
# the points with a line in the order that they are stored in the X and Y variables. So, if the values are not sorted, 
# you get a mess of lines crossing each other. Try it and you will see !!!

xaxis=np.arange(50,225,5).reshape(-1,1)

# Linear Regression line
plt.plot(xaxis, LinearRegression().fit(X,y).predict(xaxis), c='red', ls='-', linewidth=3)

# Second degree poly curve
plt.plot(xaxis, model_poly2.fit(X_poly2, y).predict(poly2_object.fit_transform(xaxis)), c='orange', ls='-', linewidth=3)

plt.show()

### Fitting a polynomial model of third degree for mpg based on horsepower:

__Estimated mpg__ = bo + b1 * horsepower + b2 * (horsepower squared) + b3 * (horsepower cubed)

Later, we will compare the second and third degree polynomials

In [None]:
poly3_object= PolynomialFeatures(degree=3)

In [None]:
# X was already created before!

X_poly3= poly3_object.fit_transform(X)

X_poly3

In [None]:
model_poly3 = LinearRegression(fit_intercept=True)

In [None]:
# y was already created before!

model_poly3.fit(X_poly3, y)

In [None]:
model_poly3.coef_

In [None]:
model_poly3.intercept_

#### R squared

In [None]:
# Obtain R SQUARED for polynomial model of degree 3

np.round (r2_score(y, model_poly3.predict(X_poly3)), 2)

#### Adjusted R squared

In [None]:
# Obtain sum of square of residuals (abbreviated as SSR or SSE)

sse=sum((y - model_poly3.predict(X_poly3))**2)
sse

In [None]:
top_adj_r2=sse/(y.size-3-1)

# The "3" in this formula represents that we have three predictors in the model: X, X squared, and X cubed.

In [None]:
bottom_adj_r2= (sum((y-y.mean())**2))/(y.size-1)

In [None]:
# ADJ R SQUARED for polynomial model of degree 3

np.round (1-(top_adj_r2/bottom_adj_r2), 2)

<br>

What is this value of Adj R sq telling us about the third degree poly when compared to the second degree poly?

#### Cross validation to estimate test prediction error

In [None]:
cross_val_score(model_poly3.fit(X_poly3,y), X_poly3, y, scoring= 'neg_mean_squared_error', cv=10)

In [None]:
# Save the previous values in an object and multiply them by -1

cv_values= -1*cross_val_score(model_poly3.fit(X_poly3,y), X_poly3, y, scoring= 'neg_mean_squared_error', cv=10)
cv_values

In [None]:
# Estimated test Mean Squared Error for the polynomial model of third degree based on CV

np.round (cv_values.mean(), 2)

<br>

What is this CV score telling us about the third degree poly when compared to the second degree poly?