In this notebook, you'll see how to create a linear regression model using the `scikit-learn` library.

You will be using a cleaned up version of the auto mpg dataset, with the goal being to predict a car's mpg based on the other attributes of that car.

In [None]:
import pandas as pd

In [None]:
cars = pd.read_csv('../data/auto_mpg_cleaned.csv')

In [None]:
cars.head()

#### Do some exploration to see how the variables are related to mpg.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact
%matplotlib inline

First, take a look at the numeric predictors.

In [None]:
@interact(x = ['displacement', 'horsepower', 'weight', 'acceleration'])
def make_scatter(x):
    cars.plot(kind = 'scatter', x = x, y = 'mpg');

There are also some discrete numeric variables, and some categorical variables.

In [None]:
@interact(x = ['cylinders', 'origin'])
def make_box(x):
   sns.boxplot(data = cars, x = x, y = 'mpg')

Before proceeding, you are going to transform the categorical and discrete predictors. You will do this by creating new indicator columns for all of the different possible levels of these variables. You can accomplish this by using the `get_dummies` function from `pandas`.

In [None]:
cars['cylinders'] = cars['cylinders'].astype('category')

cars = cars.drop(columns = ['car_name', 'model_year'])    

cars = pd.get_dummies(cars, drop_first=True)

In [None]:
cars.head()

The `get_dummies()` method created 4 columns out of the original cylinders column and two columns out of the original origin column. In these columns, the corresponding value is marked with a 1, and all other values are maked with a 0. You may notice that there are 5 possible values for the number of cylinders and 3 possible values for the origin in the original dataset. The reason for this is that, for example, we know that if a car is not European or Japanese, then it must be American (at least in the dataset we are working with).

Now, let's split off out predictors and response variables.

In [None]:
# Predictors:
X = cars.drop(columns = 'mpg')

# Response:
y = cars['mpg']

When building models, you are often interested in the predictive power of the model. You are not interested in how well the model predicts on data that it has already seen, but instead on how well it generalizes to new, unseen data.

To evaluate this, you will set aside a portion of the full dataset as your _test set_. The remaining portion, called the _training set_ will be used to fit the model.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 321)

In [None]:
from sklearn.linear_model import LinearRegression

First, you need to create a LinearRegression instance.

In [None]:
lr = LinearRegression()

And then fit it on the training data. _Fitting a model_ means you are getting your model instance to learn about the relationship between the predicctor variables and the target variable.

In [None]:
lr.fit(X_train, y_train)

Now, let's evaluate how well the model performs.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error

First, you can look at the $R^2$ score. Recall that this shows how much of the variation in the mpg values which can be accounted for by using our model. To use this function, you need to pass in the true values (contained in `y_train`) and the predicted values. You can obtain the predicted values by calling `.predict` on `lr` and passing in `X_train`.

In [None]:
r2_score(y_train, lr.predict(X_train))

If you want to use your model to make predictions on new data, then you don't really care how it does on the training data. Instead, you want to see how well it generalizes to new data. You can check this by predicting the target variables for the test data.

In [None]:
y_pred = lr.predict(X_test)

r2_score(y_test, y_pred)

Typically, there will be a drop in perfomance from the training set to the test set, and you see that here. This drop will often be more pronounced the more flexible a model you use. For example, using a large number of predictor variables in a linear regression model will increase the flexibility.

You can also look at other metrics. For example, the mean absolute error measures how far off the predictions are (in magnitude) on average. One bonus of using the mean absolute error is that it is measured in the same units as the response variable.

First, on the training set: 

In [None]:
mean_absolute_error(y_train, lr.predict(X_train))

And then on the test set:

In [None]:
mean_absolute_error(y_test, y_pred)

Again, there is a slight drop in performance from the training set to the test set. With the training data you see that the model is off by about 2.85 mpg on average, while it is off an avergae of around 3.19 mpg for test data.

What if you want to understand _how_ the model is making predictions? Since you are using a linear model, looking at coefficients can help you understand the model. The intercept and coefficients can be accessed from our trained model, `lr`.  

The code in the following cell extracts the coefficients and converts the result into a DataFrame.

In [None]:
coefficients = pd.DataFrame({'variable': ['intercept'] + list(X.columns),
                             'coefficient': [lr.intercept_] + list(lr.coef_)})

In [None]:
coefficients

For the continuous variables, the coefficient represents the change in mpg that would occur for a one-unit change in the corresponding predictor, _if all other predictors are held constant_.

For example, our coefficients show that for every one unit increase in horsepower, all other variables held constant, there is a drop in mpg of 0.093286.

There is one strange value that stands out. It seems that according to the model, increasing displacement will increase mpg. However, if you look at the scatterplot earlier in this notebook, it seems that there is a negative association between the two variables. Cars with higher than average displacement (or engine size) tend to have lower than average mpg.

This can happen for a number of reasons, but the cause for an unexpected coefficient sign is due to correlations with other predictor variables.

In [None]:
cars.plot(kind = 'scatter', x = 'horsepower', y = 'displacement');

You can see, for example, that displacement is strongly correlated with horsepower. One possible explanation for the positive coefficient on horsepower is that the effect of displacement has already been captured by the horsepower.

All of this to say that when you have correlated predictors, you need to exercise caution when interpreting them. It is always a good idea to do thorough exploratory analysis.

What about the categorical predictors? Look specifically at the origin variable. Since you dropped the origin_American variable when dummyizing, you can interpret the other two as the change in mpg from changing a car's origin to either European or Japanese, keeping all other variables fixed.

The model is telling you that, all other variables held fixed, a European car will tend to get about 1.15 mpg higher than an American car, and a Japanese car will tend to get about 2.74 mpg higher than an American car.