### Codio Activity 8.5: Comparing Complexity and Variance

**Expected Time: 60 Minutes**

**Total Points: 35**

In this activity, you will explore the effect of model complexity on the variance in predictions.  Continuing with the automotive data, you will build models on a subset of 10 vehicles.  You will compare the model error when used on the entire dataset, and investigate how variance changes with model complexity.

#### Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import plotly.express as px

In [2]:
auto = pd.read_csv('data/auto.csv')

In [3]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


### The Sample

Below, a sample of ten vehicles from the data is extracted.  These data are what will form our **training** data.  The data is subsequently split into `X_train` and `y_train`.  You are to use this smaller dataset to build your models on, and explore their performance using the entire dataset.

In [4]:
X = auto.loc[:,['horsepower']]
y = auto['mpg']
sample = auto.sample(10, random_state = 22)
X_train = sample.loc[:, ['horsepower']]
y_train = sample['mpg']

In [5]:
X_train

Unnamed: 0,horsepower
280,88.0
57,80.0
46,100.0
223,110.0
303,90.0
73,140.0
98,100.0
250,105.0
254,100.0
337,110.0


In [6]:
y_train

280    22.3
57     25.0
46     19.0
223    17.5
303    28.4
73     13.0
98     18.0
250    19.2
254    20.5
337    23.5
Name: mpg, dtype: float64

In [7]:
X.shape

(392, 1)

[Back to top](#Index:) 

### Problem 1

#### Iterate on Models

**20 Points**

Complete the code below according to the instructions below:

- Assign the values in the `horsepower` column of `auto` to the variable `X` below.
- Assign the values in the `mpg` column of `auto` to the variable `y` below.

Use a `for` loop to loop over the values from one to ten. For each iteration `i`:

- Use `Pipeline` to create a pipeline object. Inside the pipeline object define a a tuple where the first element is a string identifier `quad_features'` and the second element is an instance of `PolynomialFeatures` of degree `i` with `include_bias = False`. Inside the pipeline define another tuple where the first element is a string identifier `quad_model`, and the second element is an instance of `LinearRegression`. Assign the pipeline object to the variable `pipe`.
- Use the `fit` function on `pipe` to train your model on `X_train` and `y_train`. Assign the result to `preds`.
- Use the `predict` function to predict the value of `X_train`. Assign the result to `preds`.
- Assign the each `model_predictions` of degree `i` the corresponding `preds` value.

In [10]:
### GRADED

### YOUR SOLUTION HERE
model_predictions = {f'degree_{i}': None for i in range(1, 11)}

print("Starting Dictionary of Predictions\n", model_predictions)
#for 1, 2, 3, ..., 10

    #create pipeline
    
    #fit pipeline on training data
    
    #make predictions on all data
    
    #assign to model_predictions
    


### BEGIN SOLUTION
for i in range(1, 11):
    pipe = Pipeline([('quad_features', PolynomialFeatures(degree = i, include_bias = False)), ('quad_model', LinearRegression())])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_train)
    model_predictions[f'degree_{i}'] = preds
### END SOLUTION

# Answer check
model_predictions['degree_1'][:10]

Starting Dictionary of Predictions
 {'degree_1': None, 'degree_2': None, 'degree_3': None, 'degree_4': None, 'degree_5': None, 'degree_6': None, 'degree_7': None, 'degree_8': None, 'degree_9': None, 'degree_10': None}


array([23.60120856, 25.25782873, 21.1162783 , 19.04550308, 23.18705352,
       12.83317743, 21.1162783 , 20.08089069, 21.1162783 , 19.04550308])

In [11]:
### BEGIN HIDDEN TESTS
auto_ = pd.read_csv('data/auto.csv')

sample_ = auto_.sample(10, random_state = 22)
X_train_ = sample_.loc[:, ['horsepower']]
y_train_ = sample_['mpg']
model_predictions_ = {f'degree_{i}': None for i in range(1, 11)}
for i in range(1, 11):
    pipe_ = Pipeline([('quad_features', PolynomialFeatures(degree = i, include_bias = False)), ('quad_model', LinearRegression())])
    pipe_.fit(X_train_, y_train_)
    preds_ = pipe_.predict(X_train_)
    model_predictions_[f'degree_{i}'] = preds_
    
#
#
#
#
assert type(model_predictions) == type(model_predictions_)
for i,j in zip(model_predictions.values(), model_predictions_.values()):
    np.testing.assert_array_equal(i, j)
### END HIDDEN TESTS

[Back to top](#Index:) 

### Problem 2

#### DataFrame of Predictions

**5 Points**

Use the `model_predictions` dictionary to create a DataFrame of the 10 models predictions.  Assign your solution to `pred_df` below as a DataFrame. 

In [12]:
### GRADED

### YOUR SOLUTION HERE
pred_df = ''
    


### BEGIN SOLUTION
pred_df = pd.DataFrame(model_predictions)
### END SOLUTION

# Answer check
print(type(pred_df))
print(pred_df.head())

<class 'pandas.core.frame.DataFrame'>
    degree_1   degree_2   degree_3   degree_4   degree_5   degree_6  \
0  23.601209  23.730040  23.517217  25.640822  24.918018  25.053190   
1  25.257829  25.669836  26.057265  24.755267  24.864097  24.842677   
2  21.116278  20.981922  20.820752  19.496913  19.845529  19.809102   
3  19.045503  18.839933  19.152249  20.457650  20.746906  20.716476   
4  23.187054  23.258556  22.988407  24.670613  25.141226  25.030400   

    degree_7   degree_8   degree_9  degree_10  
0  25.172014  25.269563  25.350122  25.415929  
1  24.826836  24.807190  24.787023  24.766288  
2  19.770343  19.728814  19.686741  19.645249  
3  20.686792  20.660393  20.636749  20.615915  
4  24.952850  24.899247  24.867972  24.855330  


In [13]:
### BEGIN HIDDEN TESTS
pred_df_ = pd.DataFrame(model_predictions_)
#
#
#
assert type(pred_df_) == type(pred_df)
pd.testing.assert_frame_equal(pred_df, pred_df_)
### END HIDDEN TESTS

[Back to top](#Index:) 

### Problem 3

#### DataFrame of Errors

**5 Points**

Now, determine the error for each model and create a DataFrame of these errors.  One way to do this is to use your prediction DataFrame's `.subtract` method to subtract `y` from each feature.  Assign the DataFrame of errors as `error_df` below.  

In [14]:
### GRADED

### YOUR SOLUTION HERE
error_df = ''
    


### BEGIN SOLUTION
error_df = pred_df.subtract(y, axis = 0)
### END SOLUTION

# Answer check
print(type(error_df))
print(error_df.head())

<class 'pandas.core.frame.DataFrame'>
    degree_1   degree_2   degree_3  degree_4  degree_5  degree_6  degree_7  \
0   5.601209   5.730040   5.517217  7.640822  6.918018  7.053190  7.172014   
1  10.257829  10.669836  11.057265  9.755267  9.864097  9.842677  9.826836   
2   3.116278   2.981922   2.820752  1.496913  1.845529  1.809102  1.770343   
3   3.045503   2.839933   3.152249  4.457650  4.746906  4.716476  4.686792   
4   6.187054   6.258556   5.988407  7.670613  8.141226  8.030400  7.952850   

   degree_8  degree_9  degree_10  
0  7.269563  7.350122   7.415929  
1  9.807190  9.787023   9.766288  
2  1.728814  1.686741   1.645249  
3  4.660393  4.636749   4.615915  
4  7.899247  7.867972   7.855330  


In [15]:
### BEGIN HIDDEN TESTS
error_df_ = pred_df_.subtract(y, axis = 0)
#
#
#
assert type(error_df_) == type(error_df)
pd.testing.assert_frame_equal(error_df, error_df_)
### END HIDDEN TESTS

[Back to top](#Index:) 

### Problem 4

#### Mean and Variance of Model Errors

**5 Points**


Using the DataFrame of errors, examine the mean and variance of each model's error.  What degree model has the highest variance?  Assign your response as an integer to `highest_var_degree` below.

In [16]:
### GRADED

### YOUR SOLUTION HERE
highest_var_degree = ''
    


### BEGIN SOLUTION
highest_var_degree = 10
### END SOLUTION

# Answer check
print(type(highest_var_degree))
print(highest_var_degree)

<class 'int'>
10


In [18]:
### BEGIN HIDDEN TESTS
highest_var_degree_ = 10
#
#
#
assert highest_var_degree == highest_var_degree_
### END HIDDEN TESTS