# Predicting Automobile Fuel Efficiency

## A simple regression example

In this notebook, we'll explore some techniques for training  *regression* machine learning model that predicts the miles-per-gallon fuel efficiency of an automobile based on its features.

### Explore and clean the data

Let's start by loading the automobile data into a Pandas dataframe and looking at the first 20 instances.

In [None]:
import pandas as pd

# load the training dataset
auto_data = pd.read_csv('data/auto-mpg.csv')
auto_data.head(20)

The data we have to work with includes *features* of automobiles, such as the model, number of cylinders, engine displacement, horsepower, weight, accelleration, model year, and origin (North America, Europe, or Asia), and also the *label* that we want to train a model to predict (*mpg*).

Let's see how many observations we have in our dataset.

In [None]:
len(auto_data)

We can also count the number of values in each column:

In [None]:
auto_data.count()

Hmm, it looks like there are some missing values in the *horsepower* column. We confirm the number of null values for each column by using the **isnull** function, like this:

In [None]:
auto_data.isnull().sum()

Let's see the records with missing values in context:

In [None]:
auto_data[auto_data.isnull().any(axis=1)]

The *NaN* value indicates that the value is "not a number" - in this case because it is missing.

There are a few ways we can deal with missing values like this. For example, we could substitute the missing values with a reasonable value, such as the mean horsepower for all of the other records. However, sometimes the easiest thing to do (assuming you have enough remaining data with which to train a model) is to simply eliminate the rows with missing data.

In [None]:
auto_data = auto_data.dropna(axis=0, how='any')
auto_data.count()

Alright, now that we've cleaned up the data, let's see what we have. First, let's identify the data types in our dataset:

In [None]:
auto_data.dtypes

Some of the data variables are numeric (in this case, 64-bit integers and floating-point values), while others are objects (often text values, or *strings*).

Let's look at the statistical description of the dataframe values (this applies only to numeric columns):

In [None]:
auto_data.describe()

The *mpg* column contains the *label* value that we want to train a model to predict. The other columns contain the numeric *features* that might help predict the label.

In general, there are two kinds of numeric feature:

- *Continuous* features - values that represent numeric values you would typically *measure*, and which could be fractional
- *Discrete* features - values that represent discrete quantities that you would typically *count*, and which are typically whole numbers.

While discrete numeric values often represent quantities, they are sometimes used as *categorical* variables - in other words, rather than representing a quantity, they can be viewed as category indicators. In the case of our automobile data, the *model_year* variable is an obvious example of this - the value 1970, does not represent a quantity of 1,970.00; but rather the year 1970.

A less obvious example might be *cylinders*. Even though this represents a quantity, the spread of the values is from 3 to 8; so while this value does indeed represent the number of cylinders a car has, you could also view it as a category that groups cars into 3-cylinder cars, 4- cylinder cars, 8-cylinder cars, and so on). In a real machine learning project, you'd likely do some more analysis of the data to determine how best to treat this variable, but for the purposes of this exercise, we'll consider it as categorical rather than a numeric value.

So, let's explicitly define the *features* we want to consider as numeric values, along with the numeric *label* we want to train a model to predict (*mpg*):

In [None]:
numeric_features = ['displacement', 'horsepower', 'weight', 'accelleration']
auto_data[numeric_features + ['mpg']].describe()

Let's now focus on the label that our model will try to predict: *mpg*.

Let's use matplotlib to plot a histogram and a box plot so we can understand the distribution of mpg values in the sample data we have.

In [None]:
import matplotlib.pyplot as plt

# This ensures plots are displayed inline in the Jupyter notebook
%matplotlib inline

# Get the label column
label = auto_data['mpg']


# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))

# Plot the histogram   
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')

# Add lines for the mean, median, and mode
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)

# Plot the boxplot   
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel(label.name)

# Add a title to the Figure
fig.suptitle('Distribution')

# Show the figure
fig.show()

The distribution of *mpg* values is slightly *left skewed*; in other words, the bulk of the data is at the lower end of the scale, with fewer instances of cars with extremely high values. However, the mean and median values are not too far from the center of the distribution; so while the data is not spread across what statisticians call a *normal* distribution, it seems reasonably balanced. There are no apparent *outliers* that would represent unusually high or low values.

Now let's examine the distributions of the various numeric features.

In [None]:
# Plot a histogram for each numeric feature
for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = auto_data[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

Again, there don't appear to be any significant outliers. The distribution of *accelleration* looks more or less *normal* (with low values at the extremes and most of the data peaking in the middle at the mean value), while the others tend to be *left-skewed*.

Now let's explore the relationships bwteen the numeric features and the label. To do this we'll create *scatterplot* for each variable that plots the intersection of each feature and label value. We'll also calculate a statistics called *correlation*, which measures the strength of the relationship between numeric variables on scale between -1 and + 1. A correlation of +1 indicates that high values of one variable tend to coincide with high values of the other; a correlation of -1 indicates that high values of one variable tend to coincide with low values of the other, and a correlation of 0 indicates that there is no discernible relationship between the variables.

In [None]:
# Plot a scatterpolot for each numeric feature vs the label
for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    y = auto_data[col]
    x = auto_data['mpg']
    plt.scatter(x, y)
    cor = auto_data['mpg'].corr(auto_data[col])
    ax.set_title(col + ' vs mpg (correlation: ' + str(round(cor, 2)) + ')')
    ax.set_ylabel(col)
    ax.set_xlabel('mpg')
plt.show()


The scatterplots for *displacement*, *horsepower*, and *weight*  show a distinct diagonal pattern in which cars with high values for these features tend to have lower *mpg* values. This visual pattern is supported by the negative correlation scores calculated for these columns. The *acceleration* feature however has a slightly positive correlation with the label, in which cars with higher accelleration also tend to have higher *mpg* values.

The key takeaway from the perspective of building a predictive model, is that there do seem to be relationships between all of these features and the label we're trying to predict - so all of these features may be useful in trying to predict the unknown *mpg* of a car.

### Train a predictive model

OK, now it's time for our first attempt at training a model that can predict the *mpg* of a car based on some of its features. We'll use the numeric features we've been exploring, and which appear to have a relationship with the *mpg*.

Predicting numeric values, such as *mpg*, is a form of machine learning known as *regression*. Regression is an example of a *supervised* machine learning technique, in which a dataset containing known label values is used to train a model, which can then be applied to new data for which the label is unknown. The first step, is to identify the *features* we want to use to train the model, and the *label* we want to train it to predict. The following code separates these into two arrays of dfata, and prints the first 10 elements of each.

> **Note**: By convention, we often refer to the feature values as **X** and the label values as **y** - in effect, training a machine learning model is the process of determining a function (**f**) that operates on **X** to calculate **y**, or mathematically, ***f*(X)=y**

In [None]:
# Separate features and labels
X, y = auto_data[numeric_features].values, label.values
print('Features:',X[:10], '\nLabels:', y[:10], sep='\n')

Next, because we have a dataset containing known label values, we can use some rows to train a model, but hold some rows back that later we can use to test the model and see how close the predicted labels for the hold-back data are to the actual label values that we already know. To accomplsih this, we'll use the Scikit-Learn library's *train_test_split* function to randomly split the data into a training set (which consists of an array of features and the corresponding array of labels) and a test set (which again, consists of an array of features and a corresponding array of labels).

It's common to use most of the data to train the model, and a smaller subset to test it - a 70%/30% split is a typical starting point.

In [None]:
from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape[0]))

With the data split into train and test subsets, we can now go ahead and use the training data to train a model. To do this, we need to select the training *algorithm* that we want to use to *fit* the features and labels in the training data to a predictive model. There are many different kinds of algorithm for regression, and in this case, we'll start with one of the simplest and most well-established of these - *linear regression*, which attempts to establish a *linear* relationship between the feature values and the label values.

In [None]:
# Train the model
from sklearn.linear_model import LinearRegression

# Fit a linear regression model on the training set
model = LinearRegression().fit(X_train, y_train)
print (model.coef_)

The *coefficients* for the trained model are displayed - these are the values that the linear regression algorithm has determined should be applied to the feature values in order to calculate the predicted label value.

Let's see what labels the model predicts for the test dataset we held back previously, and compare those predicted label values to the actual label values we know to be true:

In [None]:
import numpy as np

predictions = model.predict(X_test)
np.set_printoptions(suppress=True)
print('Predicted labels: ', np.round(predictions, 1)[:10])
print('Actual labels   : ' ,y_test[:10])

So how did it do?

We can compare each predicted value to the actual value, and get a sense that the model is reasonably close for each prediction; but that's quite laborious, and makes it hard to get an overall view of how well the model predicts.

An alternative approach might be to visualize the predicted and the actual values as a scatter plot, so we can see how well they line up.

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predicted vs Actual Labels')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

The scatterplot shows a diagonal trend in which the predicted values align more or less linearly with the actual values - we've added a *trend line* to the chart to make it more obvious. It's not perfect, but definitely seems to have some predictive capability.

What we might want to do is to try a few other algorithms to see if we can produce a better model, and while we could visually compare scatterplots for each model's predictions, it's easier to evaluate model performance if we can quantify the predictive performance of each model as a simple metric that can be compared. Fotunately, there are many such metrics we can calculate, including:

- **Mean Square Error (MSE)**: The mean of the squared differences between predicted and actual values. This is a *relative* metric in which the smaller the value, the better the fit of the model
- **Root Mean Square Error (RMSE)**: The square root of the MSE. This is an *absolute* metric in the same unit as the label (in this case, *mpg*). Again, the smaller the value, the better the model.
- **Coefficient of Determination (usually known as *R-squared* or R<sup>2</sup>)**: A *relative* metric between 0 and 1 in which the higher the value, the better the fit of the model. Essentially, R<sup>2</sup> represents how much of the variance between predicted and actual label values the model is able to explain.

Let's see these metrics for our linear regression model.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

OK, now that we have quantified the predictive performance of our model, let's try a different kind of algorithm and see if it gives us better results. This time, we'll use a *tree-based* model, in which the features in the dataset are examined in a series of evaluations, each of which results in a *branch* in a *decision tree* based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.

To see this in action, we'll train a *Decision Tree* regression model using the automobile data. After training the model, the code below will print the model definition and a text representation of the tree it uses to predict label values.

In [None]:
# Train the model
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

# Fit a linear regression model on the training set
model = DecisionTreeRegressor().fit(X_train, y_train)
print (model)

# Visualize the model tree
tree = export_text(model)
print(tree)

So, how does our tree-based model perform with our test data?

In [None]:
predictions = model.predict(X_test)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

Well, it looks like the decision tree model performs slightly worse than the linear regression model. Given the linear relationships we observed previously between the numeric features and the label, it's perhaps not all that suprising that a linear model works well.

However, let;s not be hasty - there are other algorithms we can try, including *ensemble* methods that actually combine multiple base algorithms to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a *bagging*) or by building a sequence of models that build on one another to improve predictive performance (referred to as *boosting*).

We'll try a *gradient boosting* ensemble model that tries to combine a sequence of models that minimize the *loss* (error) in predictions by determining the curve (*gradient*) for a function that calculates the loss, and adjusting the model coefficients so that the loss value is reduced (a technique known generally as *gradient descent*).

In [None]:
# Train the model
from sklearn.ensemble import GradientBoostingRegressor

# Fit a linear regression model on the training set
model = GradientBoostingRegressor().fit(X_train, y_train)
print (model)

predictions = model.predict(X_test)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

That's a little better than the original linear regression algorithm, we're making progress!

Now, to improve even further, we can try adjusting some parameters in the algorithm to control how it behaves. Technically, parameters in machine learning algorithms are called *hyper*parameters (you can think of the feature values themselves as the parameters for the function produced by the training process, so we use the term hyperparameters to be clear that we mean parameters that are set outside of the training data).

in the case of the gradient boosting algorithm, there are several hyperparameters we can experiment with; including the *learning rate* (the size of the adjustments made to coefficients to optimize the loss function) and *n_estimators* (the number of estimators combined to for the ensemble sequence). Let's try setting these values explicitly.

In [None]:
# Train the model
from sklearn.ensemble import GradientBoostingRegressor

# Fit a linear regression model on the training set
model = GradientBoostingRegressor(learning_rate=0.5, n_estimators=50).fit(X_train, y_train)
print (model)

predictions = model.predict(X_test)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

We get a slightly different result from before, so clearly the hyperparameters affect the model.

But how can we chose the best hyperparameter values? Well, we can use a technique knows as *hyperparameter sweeping*, which tries multiple combinations of parameter values until we find the ones that produce the best model (based on the metric we specify).

let's try hyperparameter sweeping to find the model with the best r<sup>2</sup> metric.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score

# Use a Gradient Boosting algorithm
alg = GradientBoostingRegressor()

# Try these hyperparameter values
params = {
 'learning_rate': [0.1, 0.5, 1.0],
 'n_estimators' : [50, 100, 150]
 }

# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)
gridsearch.fit(X_train, y_train)
print("Best parameter combination:", gridsearch.best_params_, "\n")

# Get the best model
model=gridsearch.best_estimator_
print(model, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

Pretty good - we've now improved the model a little more.

### Use categorical features

Up to now, we've trained the model using only the numeric features in the dataset. However, there are some categorical features that we could also make use of. Let's take a look at those.

In [None]:
import numpy as np

# plot a bar plot for each categorical feature count
categorical_features = ['car_name','cylinders','model_year','origin']

for col in categorical_features:
    counts = auto_data[col].value_counts().sort_index()
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    counts.plot.bar(ax = ax, color='steelblue')
    ax.set_title(col + ' counts')
    ax.set_xlabel(col) 
    ax.set_ylabel("Frequency")
plt.show()


There are (not unexpectedly) lots of distinct car names in the dataset. The cars all seem to have 3, 4, 5, 6, or 8 cylinders (with 4 cylinder cars being the most common), and there's a relatively even number of examples from each year between 1970 and 1982 (though there are more from 1973). The cars are manufactured in Asia, Europe, or North America (with many more North American cars than Asian or European).

Let's compare each of these categorical feature values with the label, by comparing the distribution and median *mpg* value for each category within each feature.

In [None]:
# plot a boxplot for the label by each categorical feature
for col in categorical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    auto_data.boxplot(column = label.name, by = col, ax = ax)
    ax.set_title(label.name +' by ' + col)
    ax.set_ylabel(label.name)
plt.show()

There are definitely some differences in the label distributions depending on specific categories for cylinders, model year, and origin. It's hard to see any meaningful differences for car name because there are so many, and if you think about it logically, there's no real reason that the name of a car would influence its fuel efficiency.

So, let's exclude the car name feature, and convert the data type of the others to *categorical* so we know to treat them as categories and not integer numeric values.

In [None]:
auto_data_copy = auto_data.drop(['car_name'], axis=1).copy()

print("Data Types:")
print(auto_data_copy.dtypes)

# Change columns 0 and 5 to categorical
cat_columns = [0,5]
auto_data_copy.iloc[:,cat_columns] = auto_data_copy.iloc[:,cat_columns].astype("category")
auto_data_copy.dtypes

print("\nConverted Data Types:")
print(auto_data_copy.dtypes)

Machine learning algorithms generally work with numeric feature values - even if they're categorical, so we need to encode the text-based origins as integer representations. To do this, we'll use an ordinal encoder, which generates an integer code for each distinct origin value.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

origin_encoder = OrdinalEncoder(dtype=int).fit(auto_data_copy[['origin']])

origin_codes = origin_encoder.transform(auto_data_copy[['origin']])
auto_data_copy['origin_code'] = origin_codes
auto_data_copy.head(20)


Now that we have some categorical features, let's add them to the numeric features and use them to train a model.

In [None]:
# Separate features and labels
features = ['cylinders','displacement','horsepower','weight','accelleration','model_year','origin_code']
X, y = auto_data_copy[features].values, label.values

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Fit a gradient boosting model on the training set
model = GradientBoostingRegressor(learning_rate=0.1, n_estimators=100).fit(X_train, y_train)

# test the model
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

Pretty good!

We previously encoded the *origin* feature using an ordinal integer value. An alternative approach is to use (*one-hot* encoding, in which a column for each possible category value is added to the dataset, and a 1 or 0 is used to indicate whether a row belongs to each category.

Let's take a look.

In [None]:
from sklearn.preprocessing import OneHotEncoder

auto_data_copy = auto_data_copy.drop(['origin_code'], axis=1).copy()

origin_oh_encoder = OneHotEncoder(dtype=int)

origin_oh_codes = origin_oh_encoder.fit_transform(auto_data[['origin']])
auto_data_copy[origin_oh_encoder.get_feature_names()] = origin_oh_codes.toarray()
auto_data_copy.head(20)


Now, instead of a single *origin_code* feature with a value of 0, 1, or 2; there are three columns (one for each possible origin - Asia, Europe, or North America), and each autombobile has a **1** for the feature representing its origin, and **0** for the others.

Let's retrain the model using this encoding scheme.

In [None]:
# Separate features (original columns plus encoder results) and labels
features = np.append(['cylinders','displacement','horsepower','weight','accelleration','model_year'],origin_oh_encoder.get_feature_names())
X, y = auto_data_copy[features].values, label.values

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Fit a linear regression model on the training set
model = GradientBoostingRegressor(learning_rate=0.1, n_estimators=100).fit(X_train, y_train)

# test the model
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

### Pre-process numeric features

Now we're using a combination of numeric and categorical features, and getting pretty good results from the algorithm we've selected for model training. but there's one more optimziation we should explore. Currently, we're using the numeric values as-is, even though they're all measured on different scales. For example, *displacement* and *horsepower* are typically measured in the hundreds, while *weight* values are in the thousands, and *accelleration* is in the tens.

Why is this a problem? Well, in some algorithms, larger values may be disporportionately affected by coefficients determined during training. Put simply, features with larger values may outweigh other, smaller features - even if the smaller features are more predictive.

So what's the solution? We typically *normalize* the numeroic values so they're on a similar scale (for example, by assigning all numeric features a value between 0 and 1, relative to their actual unscaled values).

Let's try that, using a MinMax scaler that assigns 0 the minimum and 1 to the maximum value for each feature, and assigns relative values between 0 and 1 to the others.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Get a scaler object
scaler = MinMaxScaler()

# Normalize the numeric columns
new_cols = ['displacement_norm','horsepower_norm','weight_norm','accelleration_norm']
auto_data_copy[new_cols] = scaler.fit_transform(auto_data_copy[numeric_features])

# Plot the normalized values
auto_data_copy.head(20)

Now let's train the model again, this time using the normalized numeric features.

In [None]:
features = np.append(['cylinders','displacement_norm','horsepower_norm','weight_norm','accelleration_norm','model_year'],origin_oh_encoder.get_feature_names())
X, y = auto_data_copy[features].values, label.values

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Fit a linear regression model on the training set
model = GradientBoostingRegressor(learning_rate=0.1, n_estimators=100).fit(X_train, y_train)

# test the model
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

### Create a pipeline

Alright, now we have a model that seems to produce reasonably accurate predictions. However it's based on features that we had to perform some pre-processing tasks on to prepare them for the model, so when used to predict the *mpg* for a new car, it will only work if we perform the same preprocessing steps to the new features.

A better approach is to encapsulate all of the pre-processing as well as the model itself into a *pipeline*. We can use a pipeline to prepare the data and train the model, similarly to before, but the model that it produces includes all of the pre-processing steps - so we can submit new car data as it is, and the model will do all the necessary encoding and scaling before generating a prediction.

To see how this work, let's use a pipeline to nromalize the numeric features, encode the categorical features, and train a model.

In [None]:
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingRegressor

# Separate features and labels
features = ['cylinders','displacement','horsepower','weight','accelleration','model_year','origin']
X, y = auto_data[features].values, label.values

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Define preprocessing for numeric columns (scale them)
numeric_features = [1,2,3,4]
numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())])

# Define preprocessing for categorical features (one-hot encode them)
categorical_features = [0,5,6]
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
    remainder = 'drop' # Remove any other features
    )

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', GradientBoostingRegressor(learning_rate=0.1, n_estimators=100))])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, y_train)
print (model)

The output shows the pipeline, which includes steps for normalizing and encoding before applying the trained Gradient Boosting regressor.

let's try it with our test data.

In [None]:
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

If you find that encapsulating your training process in a pipline results in a drop in performance, you can easily experiment with alternative algorithms just by changing the algorithm step, and using the same pre-processing transformer steps as before.

In [None]:
from sklearn.linear_model import LinearRegression

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LinearRegression())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)


predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

rmse = np.sqrt(mse)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("R2:", r2)

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()

### Save and use the model

Now that we have a working model, let's save it so we can use it again later.

To save the model, we'll use the *pickle* library (to preserve it - geddit?)

In [None]:
import joblib

# Save the model as a pickle file
filename = './models/auto.pkl'
joblib.dump(model, filename)

With the model saved, we can load it again later when we need it to predict the *mpg* for a new car.

In [None]:
# Load the model from the file
loaded_model = joblib.load(filename)

# Create a numpy array containing details of a new car
X_new = np.array([[8,302,140,3449,10.5,1970,'North America']])
print ('New sample: {}'.format(list(X_new[0])))

# Use the model to predict mpg
result = loaded_model.predict(X_new)
print('Prediction: {:.0f} mpg'.format(result[0]))

The model's **predict** method accepts an array of observations, so you can use it to generate multiple predictions as a batch. For example, suppose you have details for three new cars; you could use the model to predict *mpg* for each car.

In [None]:
# An array of features based on three new cars
X_new = np.array([[8,302,140,3449,10.5,1970,'North America'],
                  [4,97,46,1835,20.5,1970,'Europe'],
                  [4,97,88,2130,14.5,1971,'Asia'],])

# Use the model to predict mpg
results = loaded_model.predict(X_new)
print('MPG predictions:')
for prediction in results:
    print(np.round(prediction))