## Codio Activity 8.1: Adding Nonlinear Features

**Estimated time: 60 minutes**

**Total Points: 20 Points**

This activity focuses on building polynomial models with `sklearn`.  You will fit both a standard first degree linear regression model and create a quadratic term similar to the `hp2` from video 8.2.  Using scikit-learn, you will compare the performance of the models and determine the appropriate model complexity.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import plotly.express as px

### The Data

For this exercise, a dataset containing data on automobiles including their horsepower and fuel economy is used.  Your goal is to build a model to predict the `mpg` column using the `horsepower` column as your models input.  Below, the dataset is loaded and a scatterplot of `horsepower` vs. `mpg` is displayed.  

In [None]:
auto = pd.read_csv('data/auto.csv')
auto

In [None]:
px.scatter(data_frame=auto, x='horsepower', y='mpg')

In [None]:
auto.info()

In [None]:
auto.head()

[Back to top](#Index:) 

## Problem 1

### Regression with `horsepower`

**4 Points**

Below, instantiate and fit an sklearn `LinearRegression` model to predict `mpg` using the `horsepower` column.  Your model will be of the form:

$$\text{mpg} = \beta_0 + \beta_1*\text{horsepower}$$

Your model should be instantiated as `first_degree_model` below.  Assign the models mean squared error as a float to the variable `first_degree_mse` below.  

In [None]:
### GRADED

X = ''
y = ''
first_degree_model = ''
first_degree_mse = ''

# YOUR CODE HERE
#raise NotImplementedError()
X = auto[['horsepower']]
y = auto['mpg']
first_degree_model = LinearRegression().fit(X,y)
first_degree_mse = float(mean_squared_error(y, first_degree_model.predict(X)))


# Answer check
print(type(first_degree_model))
print(first_degree_model.coef_)
print(first_degree_mse)

[Back to top](#Index:) 

## Problem 2

### Creating quadratic feature

**4 Points**

To build a second degree or quadratic model, you will first add a new column to the data based on squaring the `horsepower` column.  Do so below, assigning the new column with the name `hp2` below. 

In [None]:
### GRADED

auto['hp2'] = ''

# YOUR CODE HERE
#raise NotImplementedError()
auto['hp2'] = auto['horsepower'] ** 2

# Answer check
print(auto.shape)
print(auto.columns)

[Back to top](#Index:) 

## Problem 3

### Building a quadratic model

**4 Points**

Using both the `horsepower` and `hp2` features, fit a `LinearRegression` model using `mpg` as the target.  When creating the features, do so as a two column DataFrame in the order `horsepower, hp2`.  Assign your instantiated model to the variable `quadratic_model` below, and the models mean squared error as a float to `quad_mse` below. Note that your model will be of the form:

$$\text{mpg} = \beta_0 + \beta_1*\text{horsepower} + \beta_2*\text{hp2}$$

In [None]:
### GRADED

X = ''
y = ''
quadratic_model = ''
quad_mse = ''

# YOUR CODE HERE
#raise NotImplementedError()
X = auto[['horsepower', 'hp2']]
y = auto['mpg']
quadratic_model = LinearRegression().fit(X,y)
quad_mse = float(mean_squared_error(y, quadratic_model.predict(X)))

# Answer check
print(quadratic_model.coef_)
print(quadratic_model.intercept_)
print(quad_mse)

[Back to top](#Index:) 

## Problem 4

### Plotting Predictions

**4 Points**

Because our data is not ordered by horsepower, a lineplot of `.predict(X)` would not be sensible.  To plot the correct predictions for your quadratic model, sort the two features by the `horsepower` column from least to greatest and assign this as a DataFrame to `x_for_pred` below.  Note that the resulting DataFrame should start with:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>horsepower</th>      <th>hp2</th>    </tr>  </thead>  <tbody>    <tr>      <th>19</th>      <td>46.0</td>      <td>2116.0</td>    </tr>    <tr>      <th>101</th>      <td>46.0</td>      <td>2116.0</td>    </tr>    <tr>      <th>324</th>      <td>48.0</td>      <td>2304.0</td>    </tr>    <tr>      <th>323</th>      <td>48.0</td>      <td>2304.0</td>    </tr>    <tr>      <th>242</th>      <td>48.0</td>      <td>2304.0</td>    </tr>  </tbody></table>

In [None]:
### GRADED

x_for_pred = ''

# YOUR CODE HERE
#raise NotImplementedError()
x_for_pred = X.sort_values(by='horsepower')

# Answer check
print(type(x_for_pred))
x_for_pred.head()

[Back to top](#Index:) 

## Problem 5

### Comparing the model performance

**4 Points**

Reflect on the mean squared error of the two models.  Which model more closely approximated the data -- linear or quadratic?  Assign your answer as a string to `best_model` below (`linear` or `quadratic`).  

In [None]:
### GRADED

best_model = ''

# YOUR CODE HERE
#raise NotImplementedError()
best_model = 'quadratic'

# Answer check
print(best_model)

#### Visualization and Summary

As an ungraded exercise, use `plotly` or `matplotlib` to create a scatterplot of the data alongside the quadratic model as a line plot. 