### Polynomial Regression using Statsmodels

<br>

In [None]:
# Import useful libraries

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

In [None]:
Auto_df=pd.read_csv('C:\\Users\\jheredi2\\Documents\\PythonDataAnalytics\\1-Datasets\\Auto_ISLR.csv')

### Fitting a polynomial model of second degree for mpg based on horsepower

In [None]:
regression_object1=smf.ols('mpg~horsepower+I(horsepower**2)', data=Auto_df)

In [None]:
regression_model1= regression_object1.fit()

In [None]:
regression_model1.summary() 

Use the predict() method to predict the value of mpg based on horsepower for the first five training observations 

In [None]:
regression_model1.predict()[0:5]

An alternative (LONGER!) way of getting the same predictions: writing out the equation with the coefficients

In [None]:
regression_model1.params[0]+ regression_model1.params[1]*(Auto_df['horsepower'].values[0:5]) + regression_model1.params[2]*((Auto_df['horsepower'].values**2)[0:5])

Use predict() to predict the value of mpg based on the polynomial equation for five NEW test observations.

We need to generate five new values of horsepower. These values will be 100, 105, 110, 115, and 120.

In [None]:
regression_model1.predict({'horsepower':np.arange(100, 121,5)})

### Evaluating the quality of the estimated polynomial equation

R sq and Adj R sq were already obtained when the summary method() was applied.

We might still be interested in computing the test error based on CV. Next ...

__A basic implementation of CV for this problem__

In [None]:
from sklearn.model_selection import KFold

In [None]:
k10fold=KFold(n_splits=10, shuffle=False)

The library KFold has a method called split() that we can use to generate the indexes needed to split
the data in 10 groups.

In [None]:
# Get an array with the indexes to use to split the data

indexes= np.arange(len(Auto_df['mpg']))
# The next two commands are to check that the array 'indexes' contains the indexes we want
print(indexes[0:5])
print(indexes[-5:])

Now, we create an empty array to save the Mean Squared Error resulting after each iteration of cross validation.

The last step will be to get the mean of the values stored in this array.

__Note__: Notice how in a previous loop that we did we did not create an empty array but an empty list.
Then, we used the method append() to add each new element to the list after each loop iteration.

__As practice__, you can try doing this loop using an empty list instead of an empty arrray!

In [None]:
cv_scores=np.empty(shape=10)

In [None]:
i=0
for train_index, test_index in k10fold.split(indexes):
    regression_model=smf.ols('mpg~horsepower+I(horsepower**2)', data=Auto_df.iloc[train_index,]).fit()
    predictions=regression_model.predict(Auto_df['horsepower'][test_index])
    # The next line computes the test Mean Squared Error for each iteration
    cv_scores[i]=sum((Auto_df['mpg'][test_index] -predictions)**2)/(test_index.size)
    i=i+1

In [None]:
cv_scores

In [None]:
np.round (cv_scores.mean(), 2)

This is the exact same CV error that we got when we applied scikit-learn !!!

<br>

__COMBINING PREDICTORS WITH POLYNOMIAL TERMS AND LINEAR TERMS__


I think the __possibility of easily combining__ polynomial terms from one predictor and linear terms from others is one of the advantages of Statsmodels compared to Sciki-learn. 

<br>

In the previous notebook we checked that a third degree polynomial based on horsepower does not improve a second degree
polynomial.

An interesting idea to pursue now is attempting to add a second variable to the second degree polynomial based on horsepower. Let's practice including an additional predictor linearly (= a first degree term only)

What's a good choice to add as a second predictor to the second degree polynomial based on horsepower?

Let's find a list of possible predictors that can be added:

In [None]:
Auto_df.columns

Now, we can create an __array with the column names__ of __POTENTIAL predictors__.

Then, we can loop through this array to choose the best predictor to add to the polynomial based on horsepower.

We are excluding the following columns from the column names array:

'mpg' (Why is it being excluded)?

'horsepower' (Why is it being excluded)?

'name' (because it is a column with the car model name... useless for prediction purposes)

'origin' (because it needs to be fixed and cleaned before being reading for processing. To avoid spending time of this cleaning, we are going to exclude it)

In [None]:
columns_auto= Auto_df.columns.difference(['mpg','horsepower','name','origin'])
columns_auto

<br>

__Question__: Among these five predictors, what's the best one to include to the polynomial model of second degree based on horsepower?

__Answer__: When addded to the poly model based on horsepower, which one of these five variables produces the highest increase in R sq?.... (Why do I focus on Rsq here instead of on Adj R sq?)

<br>

Let's create an empty, one column data frame to store the values of R sq for each model

In [None]:
data_out= pd.DataFrame({'R sq':np.empty(shape=5)}, index=list(columns_auto))
data_out

The following loop tests the model that combines the second degree poly on horsepower and each of the predictors.

At each iteration, the R sq for each model is recorded.

In [None]:
for i in columns_auto:
    regression_object=smf.ols('mpg~horsepower+I(horsepower**2)'+ '+' + i, data=Auto_df)
    model=regression_object.fit()
    data_out.loc[i,]=model.rsquared

In [None]:
data_out

In [None]:
data_out.loc[data_out['R sq']== max(data_out['R sq'])]

The highest R sq is for the model that combines the polynomial of second degree on horsepower together with the linear term of year.

In [None]:
regression_object2=smf.ols('mpg~horsepower+I(horsepower**2)+year', data=Auto_df)

In [None]:
regression_model2=regression_object2.fit()

In [None]:
regression_model2.summary()

<br>
It seems that adding the linear term of year to the poly model based on horsepower produces a valuable increase.

Adj R squared is 77.4 %, compared to 68.6% for the the poly model based on horsepower.

<br>
If you want to be more certain that there is practical value in adding year to the poly model, you may apply
CV to find out if the test prediction error shows a reduction.

__DO INDEPENDENTLY IF INTERESTED!__

<br>

__LAST QUESTION:__

Does it make sense to include year in the previous equation only with a linear term or with both a linear
and quadratic term?

Estimated mpg= bo + b1 * horsepower + b2 * (horsepower)^2 + b3 * year

or ...

Estimated mpg= bo + b1 * horsepower + b2 * (horsepower)^2 + b3 * year + b4 * (year)^2

<br>
This question is asking if  “it makes sense...". To know if that proposition even makes sense, you can do a plot of the residuals of the previous equation versus the variable in question (year in this case) to see if a quadratic term seems reasonable.

Do a plot of the residuals versus year

In [None]:
plt.style.use('seaborn')

plt.scatter(Auto_df['year'], Auto_df['mpg']-regression_model2.predict() ,c='blue',marker='o')

plt.xlabel("Year")

plt.ylabel("Residuals")

plt.axhline(0,c='red',ls='--')

plt.show()