## More Linear Regression Practice

In this notebook, you'll be working with data on cars from the year 2010 with the goal being to see how well you can predict the fuel economy based on the other variables that you have.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer

In [None]:
cars = pd.read_csv('data/cars_2010.csv')

In [None]:
cars.head()

First, we'll build a linear model that uses only the EngDispl variable. Make sure that you do a train/test split before fitting your model so that you can evaluate its performance.

In [None]:
variables = ['EngDispl']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

linreg = LinearRegression().fit(X_train, y_train)

In [None]:
coefficients = pd.DataFrame({
    'variable': ['intercept'] + list(X_train.columns),
    'coefficient': [linreg.intercept_] + list(linreg.coef_)
})

coefficients

**Question 1:** What would the predicted FE value be for a car with an EngDispl value of 3?

Now, let's look at how well our model did on the test data.

In [None]:
print(f'Mean Squared Error: {mean_squared_error(y_test, linreg.predict(X_test))}')
print(f'R2: {r2_score(y_test, linreg.predict(X_test))}')

Now, we'll build a model that uses just the AirAspirationMethod to predict fuel economy.

In [None]:
variables = ['AirAspirationMethod']
categorical_variables = ['AirAspirationMethod']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(steps = [
    ('ct', ColumnTransformer(
        transformers = [
            ('ohe', OneHotEncoder(drop = 'first'), categorical_variables)
        ],
        remainder = 'passthrough'
    )
    ),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})
coefficients

**Question 2:** What does your model predict for the FE value for a Turbocharged car?

Now, we'll fit a model using the Engine Displacement and AirAspirationMethod but not the interaction between them.

In [None]:
variables = ['EngDispl', 'AirAspirationMethod']
categorical_variables = ['AirAspirationMethod']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(steps = [
    ('ct', ColumnTransformer(
        transformers = [
            ('ohe', OneHotEncoder(drop = 'first'), categorical_variables)
        ],
        remainder = 'passthrough'
        )
    ),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})
coefficients

**Question 3:** What does your model predict for the FE value of a supercharged car that has and EngDispl value of 4?

**Question 4:** What does your model predict for the FE value of a naturally aspirated car that has and EngDispl value of 4?

Let's see how this affected the model performance.

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

Now, we'll fit a model using EngDispl, AirAspirationMethod, and the interaction between the two variables.

In [None]:
variables = ['EngDispl', 'AirAspirationMethod']
categorical_variables = ['AirAspirationMethod']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(steps = [
    ('ct', ColumnTransformer(
        transformers = [
            ('ohe', OneHotEncoder(drop = 'first'), categorical_variables)
        ],
        remainder = 'passthrough'
        )
    ),
    ('pf', PolynomialFeatures(include_bias=False, interaction_only=True)),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

In [None]:
features = list(pipe['ct'].named_transformers_['ohe'].get_feature_names(categorical_variables))
features += [x for x in X_train.columns if x not in categorical_variables]
features = pipe['pf'].get_feature_names(features)

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})
coefficients

**Question 5:** What FE value does the model predict for a supercharged car with an EngDispl value of 4?

Let's check how it did on the test data.

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

Let's look at the plot of EngDispl vs. FE.

In [None]:
cars.plot(kind = 'scatter', x = 'EngDispl', y = 'FE', figsize = (10,6));

It looks like the relationship between these variable might be slightly non-linear. Perhaps a higher degree polynomial will fit better.
Let's try fitting a model of the form

$$\text{Predicted FE} = \beta_0 + \beta_1\cdot(\text{EngDispl}) + \beta_2\cdot(\text{EngDispl})^2$$

We'll do that using the PolynomialFeatures class again.

In [None]:
variables = ['EngDispl']

X = cars[variables]
y = cars['FE']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321)

pipe = Pipeline(steps = [
    ('pf', PolynomialFeatures(include_bias=False, interaction_only=False, degree = 2)),
    ('linreg', LinearRegression())
])

pipe.fit(X_train, y_train)

In [None]:
features = pipe['pf'].get_feature_names(variables)

coefficients = pd.DataFrame({
    'variable': ['intercept'] + features,
    'coefficient': [pipe['linreg'].intercept_] + list(pipe['linreg'].coef_)
})
coefficients

**Question 6:** What does this model predict for the fuel economy of a car with an EngDispl value of 4?

Let's check the model performance.

In [None]:
print(f'MSE: {mean_squared_error(y_test, pipe.predict(X_test))}')
print(f'R2: {r2_score(y_test, pipe.predict(X_test))}')

If you want to see the curve that was fit, you can use the following code to create the plot.

In [None]:
x_grid = np.linspace(start = cars['EngDispl'].min(),
                    stop = cars['EngDispl'].max(),
                    num = 150)
y_grid = pipe.predict(x_grid.reshape(-1, 1))

fig, ax = plt.subplots(figsize = (10,6))
plt.plot(x_grid, y_grid)
cars.plot(kind = 'scatter',
             x = 'EngDispl',
             y = 'FE', ax = ax);

**Bonus Questions:** Continue to explore other variables and see how well you can fit a linear model to this dataset.