## Regularized Regression
In this notebook, you'll be working with data on cars from the year 2010 with the goal being to see how well you can predict the fuel economy based on the other variables that you have.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, Lasso, ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import VarianceThreshold

In [None]:
cars = pd.read_csv('data/cars_2010.csv')

In [None]:
cars.head()

Let's start with a baseline model which uses only EngDispl and EngDispl^2.

In [None]:
variables = ['EngDispl']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(steps = [
    ('pf', PolynomialFeatures(include_bias=False, interaction_only=False, degree = 2)),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

In [None]:
features = list(pipe['pf'].get_feature_names_out(variables))

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})
coefficients

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

In [None]:
cars.columns

Now, let's add in all other features to see how much of an improvement we can get.

In [None]:
variables = [x for x in cars.columns if x != 'FE']
categorical_variables = ['Transmission', 'AirAspirationMethod', 'DriveDesc', 'CarlineClassDesc']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(
    steps = [
        ('ct', ColumnTransformer(
            transformers = [
                ('ohe', OneHotEncoder(sparse = False, drop = 'first'), categorical_variables)
            ],
            remainder = 'passthrough')),
        ('pf', PolynomialFeatures(interaction_only = True, include_bias = False)),
        ('vt', VarianceThreshold()),
        ('linreg', LinearRegression())
    ]
)

pipe.fit(X_train, y_train)

First, look at the performance on the training data.

In [None]:
print(f'MSE: {mean_squared_error(y_train, pipe.predict(X_train))}')
print(f'R2: {r2_score(y_train, pipe.predict(X_train))}')

Now, on the test set.

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

**Question 1:** How do interpret the R^2 value that we got?

**Question 2:** Why might the model be peforming so poorly?

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names_out(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]
features = list(pipe['pf'].get_feature_names_out(features))
features = list(np.array(features)[pipe['vt'].get_support()])

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})

coefficients

**Question 3:** Explore the coefficients that you get. Does anything appear suspect?

Now, let's switch to ridge regression to see how it changes our model.

In [None]:
variables = [x for x in cars.columns if x != 'FE']
categorical_variables = ['Transmission', 'AirAspirationMethod', 'DriveDesc', 'CarlineClassDesc']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(
    steps = [
        ('ct', ColumnTransformer(
            transformers = [
                ('ohe', OneHotEncoder(sparse = False,
                                      #drop = 'first',
                                      handle_unknown = 'ignore',
                                  
                                     ), categorical_variables)
            ],
            remainder = 'passthrough')),
        ('pf', PolynomialFeatures(interaction_only = False, include_bias = False)),
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('linreg', RidgeCV())
    ]
)

pipe.fit(X_train, y_train)

In [None]:
print(f'MSE: {mean_squared_error(y_train, pipe.predict(X_train))}')
print(f'R2: {r2_score(y_train, pipe.predict(X_train))}')

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names_out(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]
features = list(pipe['pf'].get_feature_names_out(features))
features = list(np.array(features)[pipe['vt'].get_support()])

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})

coefficients

**Question 4:** What value of alpha did the model decide on? (Hint: this is a fit attribute of the model. You might want to look at the [RidgeCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)).

In [None]:
pipe['linreg'].alpha_

**Question 5: True or False -** A smaller value of alpha will tend to result in smaller coefficient values.

**Question 6:** By default, the RidgeCV model will only try out 3 different values for alpha - 0.1, 1, and 10. Modify the code below and try out a larger range of alpha values. Can you find a better model? What is the best value of alpha that you can find?

In [None]:
variables = [x for x in cars.columns if x != 'FE']
categorical_variables = ['Transmission', 'AirAspirationMethod', 'DriveDesc', 'CarlineClassDesc']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(
    steps = [
        ('ct', ColumnTransformer(
            transformers = [
                ('ohe', OneHotEncoder(sparse = False,
                                      #drop = 'first',
                                      handle_unknown = 'ignore',
                                  
                                     ), categorical_variables)
            ],
            remainder = 'passthrough')),
        ('pf', PolynomialFeatures(interaction_only = False, include_bias = False)),
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('linreg', RidgeCV(alphas = (10, 50, 100)))
    ]
)

pipe.fit(X_train, y_train)

In [None]:
pipe['linreg'].alpha_

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names_out(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]
features = list(pipe['pf'].get_feature_names_out(features))
features = list(np.array(features)[pipe['vt'].get_support()])

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})

coefficients

Finally, let's try out a lasso model. Notice that we have increased the `max_iter` value so that it has a good chance on converging.

In [None]:
variables = [x for x in cars.columns if x != 'FE']
categorical_variables = ['Transmission', 'AirAspirationMethod', 'DriveDesc', 'CarlineClassDesc']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(
    steps = [
        ('ct', ColumnTransformer(
            transformers = [
                ('ohe', OneHotEncoder(sparse = False, drop = 'first'), categorical_variables)
            ],
            remainder = 'passthrough')),
        ('pf', PolynomialFeatures(interaction_only = False, include_bias = False)),
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('linreg', LassoCV( max_iter = 5000))
    ]
)

pipe.fit(X_train, y_train)

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names_out(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]
features = list(pipe['pf'].get_feature_names_out(features))
features = list(np.array(features)[pipe['vt'].get_support()])

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})

coefficients

**Question 7:** What proportion of coefficients end up being zero?

In [None]:
(coefficients['coefficient'] == 0).mean()

In [None]:
coefficients[coefficients['coefficient'] != 0]

Finally, let's use the Lasso class so that we can manually set the value of alpha to see the effect on the model.

**Question 8:** What seems to be the relationship between alpha and the performance of the model? on the number of nonzero coefficients?

In [None]:
variables = [x for x in cars.columns if x != 'FE']
categorical_variables = ['Transmission', 'AirAspirationMethod', 'DriveDesc', 'CarlineClassDesc']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(
    steps = [
        ('ct', ColumnTransformer(
            transformers = [
                ('ohe', OneHotEncoder(sparse = False, drop = 'first'), categorical_variables)
            ],
            remainder = 'passthrough')),
        ('pf', PolynomialFeatures(interaction_only = False, include_bias = False)),
        ('vt', VarianceThreshold()),
        ('scaler', StandardScaler()),
        ('linreg', Lasso( alpha = 0.5))
    ]
)

pipe.fit(X_train, y_train)

print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names_out(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]
features = list(pipe['pf'].get_feature_names_out(features))
features = list(np.array(features)[pipe['vt'].get_support()])

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})

coefficients

coefficients[np.abs(coefficients['coefficient']) > 0]