# Basic regression: Predict fuel efficiency

*This notebook is based on the [tutorial notebook](https://www.tensorflow.org/tutorials/keras/regression) provided by TensorFlow.*

---
In a *regression* problem, we aim to predict the output of a continuous value, like a price or a probability. 

This notebook uses the classic [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) Dataset, which you've already encountered in previous notebooks, and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the `tf.keras` API, see [this guide](https://www.tensorflow.org/guide/keras) for details.

## Setup

In [None]:
import datetime, time, os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make numpy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

In [None]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
import tensorflow.keras.layers as kl

print(tf.__version__)

### Setup for TensorBoard

We will use the TensorBoard to visualize some results. You can find more information and the board itself at the end of this notebook, but we will define the path were the information should be stored directly here at the beginning. 

In [None]:
# With this command you can clear any logs from previous runs
# If you want to compare different runs you can skip this cell 
!rm -rf my_logs/

In [None]:
# Define path for new directory 
root_logdir = os.path.join(os.curdir, "my_logs")

In [None]:
# Define function for creating a new folder for each run
def get_run_logdir():
    run_id = time.strftime('run_%d_%m_%Y-%H_%M_%S')
    return os.path.join(root_logdir, run_id)

In [None]:
run_logdir = get_run_logdir()

In [None]:
# Create function for using callbacks; "name" should be the name of the model you use
def get_callbacks(name):
    return tf.keras.callbacks.TensorBoard(run_logdir+name, histogram_freq=1)

## The Auto MPG dataset

The dataset is available from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data).


### Get the data
First download and import the dataset using pandas:

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'

column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

In [None]:
df = raw_dataset.copy()
df.head()

### Clean the data

The `Horespower` column contains a few unknown values.

In [None]:
df.isna().sum()

Let's drop those rows to keep this tutorial simple.

In [None]:
df = df.dropna()

Let's check the datatypes and amount of unique values for each column.

In [None]:
df.info()

In [None]:
df.nunique()

The `"Origin"` column is really categorical, not numeric. So we have to convert that to a one-hot/dummy:

(Note: You can set up the `keras.Model` to do this kind of transformation for you. That's beyond the scope of this tutorial. See the [preprocessing layers](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers) or [Loading CSV data](https://www.tensorflow.org/tutorials/load_data/csv) tutorials for examples.)

In [None]:
df['Origin'] = df['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

In [None]:
df = pd.get_dummies(df, prefix='', prefix_sep='', dtype='uint8')
df.head()

### Split the data into train and test

Now we'll split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our models.

In [None]:
df_train = df.sample(frac=0.8, random_state=0)
df_test = df.drop(df_train.index)

### Inspecting the data

Let's have a quick look at the joint distribution of a few pairs of columns from the training set.

Looking at the top row it should be clear that the fuel efficiency (MPG) is a function of all the other parameters. Looking at the other rows it should be clear that they are each functions of eachother.

In [None]:
sns.pairplot(df_train[['MPG', 'Cylinders', 'Displacement', 'Weight']], diag_kind='kde');

Also look at the overall statistics, note how each feature covers a very different range:

In [None]:
df_train.describe().transpose()

### Split features from labels

Before we can start with the modelling process we need to separate our label from the dataset. This label is the value that we will train the model to predict.

In [None]:
X_train = df_train.copy()
X_test = df_test.copy()

y_train = X_train.pop('MPG')
y_test = X_test.pop('MPG')

## Sklearn

Before we start using TensorFlow and Keras, let's train a simple `LinearRegression` model from sklearn for comparison. 

In [None]:
# Scaling the data 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Initalizing and training the model 
lin_reg = LinearRegression()
lin_reg.fit(X_train_scaled, y_train)

In [None]:
# Making predictions 
y_pred = lin_reg.predict(X_test_scaled)

In [None]:
# Evaluting model
mae =  mean_absolute_error(y_test, y_pred).round(2)
mse = mean_squared_error(y_test, y_pred).round(2)

print('MAE:', mae)
print('MSE', mse)

We'll store the result in a dictionary in order to compare the results of different models in the end.

In [None]:
test_results = {}
test_results['sklearn_model'] =  [mae, mse]

## Normalization

In the table of statistics it's easy to see how different the ranges of each feature are.

In [None]:
X_train.describe().transpose()[['mean', 'std']]

It is good practice to normalize features that use different scales and ranges. 

One reason this is important is because the features are multiplied by the model weights. So the scale of the outputs and the scale of the gradients are affected by the scale of the inputs. 

Although a model *might* converge without feature normalization, normalization makes training much more stable. 

### The Normalization layer
The `preprocessing.Normalization` layer is a clean and simple way to build that preprocessing into your model.

The first step is to create the layer:

In [None]:
normalizer = kl.Normalization()

Then `.adapt()` it to the data:

In [None]:
X_train

In [None]:
normalizer.adapt(X_train.values) # adapt expect an array

This calculates the mean and variance, and stores them in the layer. 

In [None]:
normalizer.mean.numpy()

When the layer is called it returns the input data, with each feature independently normalized. We can have a look at the first training instance and compare the original and normalized features:

In [None]:
first = X_train[:1]*1.

with np.printoptions(precision=2, suppress=True):
    print('First example:', first)
    print()
    print('Normalized:', normalizer(first.values).numpy())

## Linear regression

Before building a DNN (deep neural network) model, let's start with a linear regression.

### One Variable

We'll start easy with a single-variable linear regression, to predict `MPG` from `Horsepower`.

Training a model with `tf.keras` typically starts by defining the model architecture.

In this case we'll use a `keras.Sequential` model. This model represents a sequence of steps. In this case there are two steps:

1. Normalize the input `horsepower`.
2. Apply a linear transformation ($y = mx+b$) to produce 1 output using `layers.Dense`.

The number of _inputs_ can either be set by the `input_shape` argument, or automatically when the model is run for the first time.

1. First create the horsepower `Normalization` layer:

In [None]:
kl.Input(shape=[1,])

In [None]:
horsepower = np.array(X_train['Horsepower']) # equivalent X_train['Horsepower'].values

horsepower_normalizer = kl.Normalization(input_shape = [1,], axis= None)
horsepower_normalizer.adapt(horsepower)

2. Then we'll build the sequential model:

In [None]:
horsepower_model = tf.keras.Sequential([
    horsepower_normalizer,
    layers.Dense(units=1)
])

# We can print a summary of the model architecture with the following line:
horsepower_model.summary()

This model will predict `MPG` from `Horsepower`.

Run the untrained model on the first 10 horse-power values. The output won't be good, but you'll see that it has the expected shape, `(10,1)`:

In [None]:
horsepower_model.predict(horsepower[:10]).shape

In [None]:
horsepower_model.predict(horsepower[:10])

Once the model is built, configure the training procedure using the `Model.compile()` method. The most important arguments to compile are the `loss` and the `optimizer` since these define what will be optimized (`mean_absolute_error`) and how (using the `optimizers.Adam`). We'll also define the `metrics` to use the `mean_squared_error`.

In [None]:
horsepower_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss=['mae'],
    metrics=['mse'])

Once the training is configured, use `Model.fit()` to execute the training:

In [None]:
%%time
history = horsepower_model.fit(
    df_train['Horsepower'], y_train,
    epochs=100,
    # suppress logging (if you want to see the output for the different epochs set the value to 1)
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2,
    # Store information for TensorBoard
    callbacks=get_callbacks("horsepower_model"))

Visualize the model's training progress using the stats stored in the `history` object.

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

# Show results from first 5 epochs
hist.head() 

In [None]:
# Show results from last 5 epochs (loss and val_loss decreased)
hist.tail()

In [None]:
def plot_loss(history):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.ylim([0, 10])
    plt.xlabel('Epoch')
    plt.ylabel('Error [MPG]')
    plt.legend()
    plt.grid(True)

In [None]:
plot_loss(history)

We'll store the results on the test set for later:

In [None]:
test_results['horsepower_model'] = horsepower_model.evaluate(
    X_test['Horsepower'],
    y_test, verbose=0)

SInce this is a single variable regression it's easy to look at the model's predictions as a function of the input:

In [None]:
# We'll predict the MPG for 251 different values for horsepower 
x = tf.linspace(0.0, 250, 251)
y = horsepower_model.predict(x)

In [None]:
def plot_horsepower(x, y):
    plt.scatter(X_train['Horsepower'], y_train, label='Data')
    plt.plot(x, y, color='k', label='Predictions')
    plt.xlabel('Horsepower')
    plt.ylabel('MPG')
    plt.legend()

In [None]:
plot_horsepower(x,y)

### Multiple inputs

You can use an almost identical setup to make predictions based on multiple inputs. This model still does the same $y = mx+b$ except that $m$ is a matrix and $b$ is a vector.

This time we'll use the `Normalization` layer that was adapted to the whole dataset.

In [None]:
linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

When you call this model on a batch of inputs, it produces `units=1` outputs for each example.

In [None]:
linear_model.predict(X_train[:10]*1.)

When you call the model it's weight matrices will be built. Now you can see that the `kernel` (the $m$ in $y=mx+b$) has a shape of `(9,1)`.

In [None]:
linear_model.layers[1].kernel

Use the same `compile` and `fit` calls as for the single input `horsepower` model:

In [None]:
linear_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss=['mae'],
    metrics=['mse'])

In [None]:
%%time
history = linear_model.fit(
    X_train*1., y_train, 
    epochs=100,
    # suppress logging (again: change it to 1 if you want to print more information about each epoch)
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2,
    # Store information for TensorBoard
    callbacks=get_callbacks("linear_model"))

Using all the inputs achieves a much lower training and validation error than the `horsepower` model: 

In [None]:
plot_loss(history)

We'll collect the test set result in our previously defined dictionary for later:

In [None]:
test_results['linear_model'] = linear_model.evaluate(
    np.array(X_test).astype('float32'), y_test, verbose=0)

## A DNN regression

The previous section implemented linear models for single and multiple inputs.

This section implements single-input and multiple-input DNN models. The code is basically the same except the model is expanded to include some "hidden"  non-linear layers. The name "hidden" here just means not directly connected to the inputs or outputs.

These models will contain a few more layers than the linear model:

1. The normalization layer.
2. Two hidden, nonlinear, `Dense` layers using the `gelu` nonlinearity.
3. A linear single-output layer.

Both will use the same training procedure so we'll include the `compile` method in the newly defined `build_and_compile_model` function below.

In [None]:
def build_and_compile_model(norm):
    model = keras.Sequential([
        norm,
        layers.Dense(64, activation='gelu'),
        layers.Dense(64, activation='gelu'),
        layers.Dense(1)
    ])

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss=['mae'],
        metrics=['mse'],)
    return model

### One variable

Let's start again with a DNN model for a single input: "Horsepower".

We can use the normalization layer we've created for our first model. 

In [None]:
dnn_horsepower_model = build_and_compile_model(horsepower_normalizer)

This model has quite a few more trainable parameters than the linear models.

In [None]:
dnn_horsepower_model.summary()

Let's train the model:

In [None]:
%%time
history = dnn_horsepower_model.fit(
    X_train['Horsepower'], y_train,
    validation_split=0.2,
    verbose=0, epochs=100,
    callbacks=get_callbacks("dnn_horespower"))

This model does slightly better than the linear-horsepower model.

In [None]:
plot_loss(history)

If you plot the predictions as a function of `Horsepower`, you'll see how this model takes advantage of the nonlinearity provided by the hidden layers:

In [None]:
x = tf.linspace(0.0, 250, 251)
y = dnn_horsepower_model.predict(x)

In [None]:
plot_horsepower(x, y)

We'll also collect the results on the test set, for later:

In [None]:
test_results['dnn_horsepower_model'] = dnn_horsepower_model.evaluate(
    X_train['Horsepower'], y_train,
    verbose=0)

### Full model

If you repeat this process using all the inputs it slightly improves the performance on the validation dataset.

In [None]:
dnn_model = build_and_compile_model(normalizer)
dnn_model.summary()

In [None]:
%%time
history = dnn_model.fit(
    X_train*1., y_train,
    validation_split=0.2,
    verbose=0, epochs=100,
    callbacks=get_callbacks("dnn_model"))

In [None]:
plot_loss(history)

Collect the results on the test set:

In [None]:
test_results['dnn_model'] = dnn_model.evaluate(X_test, y_test, verbose=0)

## Performance

Now that all the models are trained let's check the test-set performance and compare how they performed:

In [None]:
pd.DataFrame(test_results, index=['Mean absolute error [MPG]', 'Mean squared error [MPG]']).T

These results match the validation error seen during training.

### Make predictions

Finally, predict have a look at the errors made by the model when making predictions on the test set:

In [None]:
y_pred = dnn_model.predict(X_test).flatten()

a = plt.axes(aspect='equal')
plt.scatter(y_test, y_pred)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
lims = [0, 50]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)

It looks like the model predicts reasonably well. 

Now take a look at the error distribution:

In [None]:
error = y_pred - y_test
plt.hist(error, bins=25)
plt.xlabel('Prediction Error [MPG]')
_ = plt.ylabel('Count')

If you're happy with the model you can save it for later use:

In [None]:
dnn_model.save('dnn_model.keras')

If you reload the model, it will give you identical outputs:

In [None]:
reloaded = tf.keras.models.load_model('dnn_model.keras')

test_results['reloaded'] = reloaded.evaluate(
    X_test, y_test, verbose=0)

In [None]:
pd.DataFrame(test_results, index=['Mean absolute error [MPG]', 'Mean squared error [MPG]']).T

## Conclusion

This notebook introduced a few techniques to handle a regression problem. Here are a few more tips that may help:

* [Mean Squared Error (MSE)](https://www.tensorflow.org/api_docs/python/tf/losses/MeanSquaredError) and [Mean Absolute Error (MAE)](https://www.tensorflow.org/api_docs/python/tf/losses/MeanAbsoluteError) are common loss functions used for regression problems. Mean Absolute Error is less sensitive to outliers. Different loss functions are used for classification problems.
* Similarly, evaluation metrics used for regression differ from classification.
* When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
* Overfitting is a common problem for DNN models, it wasn't a problem for this tutorial. See the [overfit and underfit](https://www.tensorflow.org/tutorials/keras/overfit_and_underfit) tutorial for more help with this.


---
## TensorBoard

In machine learning, to improve something you often need to be able to measure it. TensorBoard is a tool for providing the measurements and visualizations needed during the machine learning workflow. It enables tracking experiment metrics like loss and accuracy, visualizing the model graph, projecting embeddings to a lower dimensional space, and much more.

TensorBoard can be used directly within a notebook in Colab and Jupyter. This can be helpful for sharing results, integrating TensorBoard into existing workflows, and using TensorBoard without installing anything locally.

You can find a nice introductory notebook [here](https://www.tensorflow.org/tensorboard/get_started).

When you are running this NB in a browster and the tensorboard cannot be displayed: If you are using Safari, try to switch to Google Chrome and run it again.

### Example using TensorBoard

When training with Keras's `.fit()`, adding the `tf.keras.callbacks.TensorBoard` callback ensures that logs are created and stored. Additionally, enable histogram computation every epoch with histogram_freq=1 (this is turned off by default).

We've saved the logs in a timestamped subdirectory to allow easy selection of different training runs.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

You can start TensorBoard within the notebook using magics. At the end of the command you need to specify the path where the log files are saved.

In [None]:
%tensorboard --logdir=./my_logs