In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

In [None]:
cars = pd.read_csv('../data/fuel_economy/cars_2010.csv')

In [None]:
cars.head()

In [None]:
cars.columns

Recall from earlier that variable valve timing by itself had very little predictive power. Now, let's see if it improves when we consider it in conjunction with other predictors. First, let's plot engine displacement vs fuel economy but color by whether or not a car has variable valve timing.

In [None]:
fig, ax = plt.subplots(figsize = (10,6))

sns.regplot(data = cars[cars['VarValveTiming'] == 0], x = 'EngDispl', y = 'FE', color = 'red', label = 'VVT - No')
sns.regplot(data = cars[cars['VarValveTiming'] == 1], x = 'EngDispl', y = 'FE', color = 'blue', label = 'VVT - Yes')
plt.legend();

**What do you notice?**

In [None]:
X = cars[['EngDispl', 'VarValveTiming']]
y = cars['FE']

First, we'll look just at EngDispl.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X[['EngDispl']], y, random_state = 321)

linreg = LinearRegression().fit(X_train, y_train)

np.sqrt(mean_squared_error(y_test, linreg.predict(X_test)))

In [None]:
linreg.intercept_

In [None]:
linreg.coef_

**How is the model making predictions?**

Let's see if we do better by using a degree two polynomial.

In [None]:
pipe = Pipeline(steps = [
    ('poly', PolynomialFeatures(degree = 2, include_bias = False)),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

np.sqrt(mean_squared_error(y_test, pipe.predict(X_test)))

In [None]:
pipe[1].intercept_

In [None]:
pipe[1].coef_

**How is the model making predictions?**

Let's add in the VarValveTiming feature and see if it helps.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

linreg = LinearRegression().fit(X_train, y_train)
np.sqrt(mean_squared_error(y_test, linreg.predict(X_test)))

**Was adding that feature useful?**

In [None]:
pipe[1].intercept_

In [None]:
linreg.coef_

**How is the model making predictions?**

Now, we can add interactions between our two features.

In [None]:
pipe = Pipeline(steps = [
    ('interact', PolynomialFeatures(degree = 2, include_bias = False, interaction_only=True)),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

np.sqrt(mean_squared_error(y_test, pipe.predict(X_test)))

In [None]:
pipe[1].intercept_

In [None]:
pipe[1].coef_

In [None]:
pipe[1].coef_

To see which coefficient is which, we can use the `get_feature_names` method of our PolynomialFeatures.

In [None]:
pf = pipe[0]
pf.get_feature_names()

**How is the model making predictions?**

Finally, let's include the squared terms, too.

In [None]:
pipe = Pipeline(steps = [
    ('poly', PolynomialFeatures(degree = 2, include_bias = False)),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

np.sqrt(mean_squared_error(y_test, pipe.predict(X_test)))

In [None]:
pipe[1].intercept_

In [None]:
pipe[1].coef_

In [None]:
pf = pipe[0]

pf.get_feature_names()

**How is the model making predictions?**

Finally, let's see the effect of regularization.

In [None]:
from sklearn.linear_model import RidgeCV, LassoCV

In [None]:
pipe = Pipeline(steps = [
    ('poly', PolynomialFeatures(degree = 2, include_bias = False)),
    ('ridge', RidgeCV(normalize = True))
])

pipe.fit(X_train, y_train)

np.sqrt(mean_squared_error(y_test, pipe.predict(X_test)))

In [None]:
pipe[1].intercept_

In [None]:
pipe[1].coef_

In [None]:
pipe = Pipeline(steps = [
    ('poly', PolynomialFeatures(degree = 2, include_bias = False)),
    ('ridge', LassoCV(normalize = True))
])

pipe.fit(X_train, y_train)

np.sqrt(mean_squared_error(y_test, pipe.predict(X_test)))

In [None]:
pipe[1].intercept_

In [None]:
pipe[1].coef_

Based on this analysis, it looks like this feature does not add a lot of predictive power, either by itself or in combination with EngDispl.

**But**, what if the model we have is missing out on other possible interactions? In this notebook, you'll see a way to build a model considering all features and combinations by making use of penalized regression. Let's go back to all of the features and see how good of a predictive model we can make.

In [None]:
X = cars.drop(columns = 'FE')
y = cars['FE']

**First, create dummy variables out of your categorical predictors.**

In [None]:
#Your code here.

**Next, perform a train/test split. Use random_state = 321.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

**Now, fit a linear regression model. How well does it do?**

In [None]:
# Your code here.

**Question:** Are there any of your features that have an excessively low number of observations?

In [None]:
# Your code here.

If you have a very low number of observations for a particular observation, it can become a problem in linear regression models. - We can add in a VarianceThreshold in order to filter these observations out.

Check out the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) and make sure that you understand what the pipeline below is doing.

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
pipe = Pipeline(steps=[
    ('vt', VarianceThreshold(  ##### Add the argument here that will remove any features with variance < 0.01 )),
    ('linreg', LinearRegression())
])

Fit this pipeline to the data. Does this help your model?

In [None]:
# Your code here

Take a look at the distribution of your target and predictors. If you want to try transforming any variables, go ahead and do that now and see if you get any improvement.

In [None]:
# Your code here.

Now, let's experiment with adding some interaction terms. We'll do that using PolynomialFeatures.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

**Question:** How many predictors does your dataset have? If you create degree two polynomial features, how many predictors will you have?

*Your answer here.*

Fit a linear regression model, including degree 2 polynomial features.

How well does it do on the training set?
How does it do on the test set?

In [None]:
# Feel free to change this if you want to try out any other transformations on your features.

pipe = Pipeline(steps = [
    (   ###### ADD THE CODE HERE FOR POLYNOMIAL FEATURES #######  ),
    ('linreg', LinearRegression())
])

**Why do you think your model did such a terible job?**

*Your answer here*

**Perhaps we could use either LASSO or Ridge Regression and get a better performing model.**

In [None]:
from sklearn.linear_model import LassoCV, RidgeCV

**Now, fit a lasso model using degree 2 polynomial features. How well does it do?**

**Make sure that your predictors are either standardized or normalized prior to fitting your LASSO model.**

**Note:** You might get a ConvergenceWarning. **Check the docstring** for LassoCV and see if you can change an argument so that you no longer have that issue.

**Note #2:** This might take longer than the other models that we've used so far to fit.

In [None]:
pipe = Pipeline(steps = [
    ('poly', PolynomialFeatures(degree = 2, include_bias = False)),
    ('lasso', LassoCV(normalize = True))
])

**For your fit model, how many coeffients end up as 0? What percentage of coefficients are 0?** 

In [None]:
# Your code here.

Recall that we started out with some very low frequency categories. Once we multiply those together, we will end up with even lower-frequency observations. Let's see if applying a variance threshold helps the situation.

Perhaps it might be a good idea to apply a variance threshold to throw out any features that have an excessively small number of observations. Does this help?

In [None]:
pipe = Pipeline(steps = [
    ('poly', PolynomialFeatures(degree = 2, include_bias = False)),
    ('vt', VarianceThreshold(threshold=0.01)),
    ('lasso', LassoCV(normalize = True, max_iter=10000))
])

In [None]:
# Your code here.

**Now, fit a Ridge Regression model using degree 2 polynomial features. Again, make sure that your predictors are either standardized or normalized prior to fitting your model. How well does this model perform?**

In [None]:
pipe = Pipeline(steps = [
    ('transformer', ct),
    ('poly', PolynomialFeatures(degree = 2, include_bias = False)),
    ('ridge', RidgeCV(normalize = True))
])

In [None]:
# Your code here.

**Finally, modify the above model to include a VarianceThreshold. Does this improve the model?**

In [None]:
# Your code here