In [None]:
# Based on the TensFlow course : https://www.tensorflow.org/tutorials/keras/regression
# Modified by Mehdi Ammi, Univ. Paris 8

# TensorFlow: Regression for Predictive Modeling

In regression tasks, the goal is to forecast the output of a continuous variable, such as a price or a probability. This differs from classification problems, where the objective is to choose a class from a set of classes (for instance, identifying whether a picture shows an apple or an orange).

This tutorial employs the renowned Auto MPG dataset to illustrate how to construct models to predict the fuel efficiency of cars from the late 1970s and early 1980s. You will provide the models with detailed information about numerous cars from that era, including attributes like cylinders, displacement, horsepower, and weight.

We'll utilize the Keras API in this example. (Refer to the Keras tutorials and guides for more information.)

## Install packages & import libraries

In [None]:
# Install the seaborn library for data visualization.
!pip install -q seaborn

In [None]:
# Import libraries for plotting, numerical operations, data manipulation, and visualization.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Set NumPy print options for better readability.
np.set_printoptions(precision=3, suppress=True)

In [None]:
# Import TensorFlow and Keras for building neural networks.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Print the TensorFlow version.
print(tf.__version__)

## The Auto MPG dataset

The dataset is available from the UCI Machine Learning Repository.

### Get the data
First download and import the dataset using pandas:

In [None]:
# URL of the dataset to be loaded.
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'

# Define the column names for the dataset.
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

# Load the dataset from the URL, specifying column names, handling missing values, and removing comments.
raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

# Create a copy of the dataset for further manipulation and analysis.
dataset = raw_dataset.copy()

# Display the last few rows of the dataset.
dataset.tail()

|index|MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model Year|Origin|
|---|---|---|---|---|---|---|---|---|
|393|27\.0|4|140\.0|86\.0|2790\.0|15\.6|82|1|
|394|44\.0|4|97\.0|52\.0|2130\.0|24\.6|82|2|
|395|32\.0|4|135\.0|84\.0|2295\.0|11\.6|82|1|
|396|28\.0|4|120\.0|79\.0|2625\.0|18\.6|82|1|
|397|31\.0|4|119\.0|82\.0|2720\.0|19\.4|82|1|

### Clean the data

The dataset has a few missing values:

In [None]:
dataset.isna().sum()

In [None]:
>>
MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64

Drop those rows to keep this initial tutorial simple:

In [None]:
dataset = dataset.dropna()

The "Origin" column is categorical, not numeric. So the next step is to one-hot encode the values in the column with pd.get_dummies.

In [None]:
# Map the values in the 'Origin' column to country labels.
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

# Convert the 'Origin' column to indicator variables (one-hot encoding).
dataset = pd.get_dummies(dataset, columns=['Origin'], prefix='', prefix_sep='')

# Display the last few rows of the dataset.
dataset.tail()

|index|MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model Year|Europe|Japan|USA|
|---|---|---|---|---|---|---|---|---|---|---|
|393|27\.0|4|140\.0|86\.0|2790\.0|15\.6|82|false|false|true|
|394|44\.0|4|97\.0|52\.0|2130\.0|24\.6|82|true|false|false|
|395|32\.0|4|135\.0|84\.0|2295\.0|11\.6|82|false|false|true|
|396|28\.0|4|120\.0|79\.0|2625\.0|18\.6|82|false|false|true|
|397|31\.0|4|119\.0|82\.0|2720\.0|19\.4|82|false|false|true|

### Split the data into training and testing sets

Now, split the dataset into a training set and a test set:

In [None]:
# Sample 80% of the data to create the training dataset.
train_dataset = dataset.sample(frac=0.8, random_state=0)

# Use the remaining 20% of the data to create the testing dataset.
test_dataset = dataset.drop(train_dataset.index)

### Inspect the data

Review the joint distribution of a few pairs of columns from the training set.

The top row suggests that the fuel efficiency (MPG) is a function of all the other parameters. The other rows indicate they are functions of each other.

In [None]:
sns.pairplot(train_dataset[['MPG', 'Cylinders', 'Displacement', 'Weight']], diag_kind='kde')

![reg_plot-1.png](attachment:6f28749a-fc9b-4111-a335-dc9e39050bd1.png)

Let's also check the overall statistics. Note how each feature covers a very different range:

train_dataset.describe().transpose()

|index|count|mean|std|min|25%|50%|75%|max|
|---|---|---|---|---|---|---|---|---|
|MPG|318\.0|23\.590566037735847|7\.913617162025714|10\.0|17\.125|22\.75|29\.0|46\.6|
|Cylinders|318\.0|5\.427672955974843|1\.6829413919287102|3\.0|4\.0|4\.0|6\.0|8\.0|
|Displacement|318\.0|193\.06132075471697|103\.8127417257744|70\.0|100\.25|151\.0|259\.5|455\.0|
|Horsepower|313\.0|104\.06709265175719|38\.67466171160924|46\.0|75\.0|92\.0|120\.0|230\.0|
|Weight|318\.0|2963\.8238993710693|844\.7498054897484|1613\.0|2219\.25|2792\.5|3571\.25|5140\.0|
|Acceleration|318\.0|15\.595911949685535|2\.796282280384398|8\.0|13\.9|15\.5|17\.3|24\.8|
|Model Year|318\.0|75\.94654088050315|3\.7052657537475624|70\.0|73\.0|76\.0|79\.0|82\.0|

### Split features from labels

Separate the target value—the "label"—from the features. This label is the value that you will train the model to predict.

In [None]:
# Create a copy of the training and testing datasets to separate features and labels.
train_features = train_dataset.copy()
test_features = test_dataset.copy()

# Remove the target variable 'MPG' from the features dataset and store it separately as labels.
train_labels = train_features.pop('MPG')
test_labels = test_features.pop('MPG')

### Normalization

In the table of statistics it's easy to see how different the ranges of each feature are:

In [None]:
train_dataset.describe().transpose()[['mean', 'std']]

|index|mean|std|
|---|---|---|
|MPG|23\.590566037735847|7\.913617162025714|
|Cylinders|5\.427672955974843|1\.6829413919287102|
|Displacement|193\.06132075471697|103\.8127417257744|
|Horsepower|104\.06709265175719|38\.67466171160924|
|Weight|2963\.8238993710693|844\.7498054897484|
|Acceleration|15\.595911949685535|2\.796282280384398|
|Model Year|75\.94654088050315|3\.7052657537475624|

It is good practice to normalize features that use different scales and ranges.

One reason this is important is because the features are multiplied by the model weights. So, the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.

Although a model might converge without feature normalization, normalization makes training much more stable.

### The Normalization layer

The tf.keras.layers.Normalization is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

In [None]:
normalizer = tf.keras.layers.Normalization(axis=-1)

Then, fit the state of the preprocessing layer to the data by calling Normalization.adapt:

In [None]:
# Convert train_features to a NumPy array with dtype float32 before adapting the normalizer.
train_features_array = np.array(train_features, dtype=np.float32)

# Adapt the normalizer to the training features.
normalizer.adapt(train_features_array)

Calculate the mean and variance, and store them in the layer:

In [None]:
print(normalizer.mean.numpy())

In [None]:
>>
[[   5.428  193.061      nan 2963.824   15.596   75.947    0.164    0.195
     0.642]]

When the layer is called, it returns the input data, with each feature independently normalized:

In [None]:
# Convert boolean columns to float (0.0 and 1.0)
train_features = train_features.astype(float)

# Extract the first row of the training features as a NumPy array with dtype float32.
first = np.array(train_features[:1], dtype=np.float32)

with np.printoptions(precision=2, suppress=True):
  print('First example:', first)
  print()
  print('Normalized:', normalizer(first).numpy())

## Linear regression

Before building a deep neural network model, start with linear regression using one and several variables.

### Linear regression with one variable
Begin with a single-variable linear regression to predict 'MPG' from 'Horsepower'.

Training a model with tf.keras typically starts by defining the model architecture. Use a tf.keras.Sequential model, which represents a sequence of steps.

There are two steps in your single-variable linear regression model:

 - Normalize the 'Horsepower' input features using the tf.keras.layers.Normalization preprocessing layer.
 - Apply a linear transformation (y = mx + b) to produce 1 output using a linear layer (tf.keras.layers.Dense).
 
The number of inputs can either be set by the input_shape argument, or automatically when the model is run for the first time.

First, create a NumPy array made of the 'Horsepower' features. Then, instantiate the tf.keras.layers.Normalization and fit its state to the horsepower data:

In [None]:
# Extract the 'Horsepower' column from the training features as a NumPy array.
horsepower = np.array(train_features['Horsepower'])

# Create a Normalization layer for normalizing the 'Horsepower' data.
horsepower_normalizer = layers.Normalization(input_shape=[1,], axis=None)

# Adapt the normalization layer to the 'Horsepower' data, calculating the mean and variance.
horsepower_normalizer.adapt(horsepower)

Build the Keras Sequential model:

In [None]:
horsepower_model = tf.keras.Sequential([
    horsepower_normalizer,
    layers.Dense(units=1)
])

horsepower_model.summary()

In [None]:
>>
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization_3 (Normaliza  (None, 1)                 3         
 tion)                                                           
                                                                 
 dense_2 (Dense)             (None, 1)                 2         
                                                                 
=================================================================
Total params: 5 (24.00 Byte)
Trainable params: 2 (8.00 Byte)
Non-trainable params: 3 (16.00 Byte)
_________________________________________________________________

horsepower_model.predict(horsepower[:10])

In [None]:
>>
1/1 [==============================] - 0s 52ms/step
array([[-1.133],
       [-0.64 ],
       [ 2.09 ],
       [-1.588],
       [-1.436],
       [-0.564],
       [-1.701],
       [-1.436],
       [-0.374],
       [-0.64 ]], dtype=float32)

Once the model is built, configure the training procedure using the Keras Model.compile method. The most important arguments to compile are the loss and the optimizer, since these define what will be optimized (mean_absolute_error) and how (using the tf.keras.optimizers.Adam).

In [None]:
horsepower_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

Use Keras Model.fit to execute the training for 100 epochs:

In [None]:
%%time
history = horsepower_model.fit(
    train_features['Horsepower'],
    train_labels,
    epochs=100,
    # Suppress logging.
    verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)

In [None]:
>>
CPU times: user 4.53 s, sys: 183 ms, total: 4.71 s
Wall time: 5.76 s

Visualize the model's training progress using the stats stored in the history object:

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

|index|loss|val\_loss|epoch|
|---|---|---|---|
|95|3\.803338050842285|4\.1811957359313965|95|
|96|3\.802067279815674|4\.20026969909668|96|
|97|3\.803834915161133|4\.194756507873535|97|
|98|3\.811049222946167|4\.185152530670166|98|
|99|3\.8024814128875732|4\.211414337158203|99|

In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.ylim([0, 10])
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

In [None]:
plot_loss(history)

![reg_plot-2.png](attachment:ff274ac9-5948-49a8-ad74-149b43ccc599.png)

Collect the results on the test set for later:

In [None]:
test_results = {}

test_results['horsepower_model'] = horsepower_model.evaluate(
    test_features['Horsepower'],
    test_labels, verbose=0)

Since this is a single variable regression, it's easy to view the model's predictions as a function of the input:

In [None]:
x = tf.linspace(0.0, 250, 251)
y = horsepower_model.predict(x)

In [None]:
>>
8/8 [==============================] - 0s 5ms/step

In [None]:
def plot_horsepower(x, y):
  plt.scatter(train_features['Horsepower'], train_labels, label='Data')
  plt.plot(x, y, color='k', label='Predictions')
  plt.xlabel('Horsepower')
  plt.ylabel('MPG')
  plt.legend()

In [None]:
plot_horsepower(x, y)

![reg_plot-3.png](attachment:d713f25e-6f8c-4282-b862-0a0025e7f204.png)

### Linear regression with multiple inputs
You can use an almost identical setup to make predictions based on multiple inputs. This model still does the same 
y=mx+b except that m is a matrix and x is a vector.

Create a two-step Keras Sequential model again with the first layer being normalizer (tf.keras.layers.Normalization(axis=-1)) you defined earlier and adapted to the whole dataset:

In [None]:
linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

When you call Model.predict on a batch of inputs, it produces units=1 outputs for each example:

In [None]:
linear_model.predict(train_features[:10])

In [None]:
>>
1/1 [==============================] - 0s 98ms/step
array([[ 0.065],
       [-0.725],
       [ 2.513],
       [-1.437],
       [-1.561],
       [-0.375],
       [-1.875],
       [-3.045],
       [ 0.378],
       [-0.319]], dtype=float32)

When you call the model, its weight matrices will be built—check that the kernel weights (the m in y=mx+b) have a shape of (9, 1):

In [None]:
linear_model.layers[1].kernel

In [None]:
>>
<tf.Variable 'dense_3/kernel:0' shape=(9, 1) dtype=float32, numpy=
array([[-0.168],
       [ 0.137],
       [ 0.159],
       [ 0.576],
       [-0.535],
       [-0.657],
       [-0.007],
       [ 0.62 ],
       [ 0.69 ]], dtype=float32)>

Configure the model with Keras Model.compile and train with Model.fit for 100 epochs:

In [None]:
linear_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
%%time
history = linear_model.fit(
    train_features,
    train_labels,
    epochs=100,
    # Suppress logging.
    verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)

In [None]:
>>
CPU times: user 5.05 s, sys: 181 ms, total: 5.24 s
Wall time: 13.3 s

In [None]:
Using all the inputs in this regression model achieves a much lower training and validation error than the horsepower_model, which had one input:

plot_loss(history)

![reg_plot-4.png](attachment:6d656a95-975f-4d3b-b8d6-3b43646351cf.png)

Collect the results on the test set for later:

In [None]:
# Ensure all features are converted to float32 to be compatible with TensorFlow.
test_features = test_features.astype(np.float32)

# Evaluate the model on the test dataset and store the results.
test_results['linear_model'] = linear_model.evaluate(test_features, test_labels, verbose=0)

## Regression with a deep neural network (DNN)

In the previous section, you implemented two linear models for single and multiple inputs.

Here, you will implement single-input and multiple-input DNN models.

The code is basically the same except the model is expanded to include some "hidden" non-linear layers. The name "hidden" here just means not directly connected to the inputs or outputs.

These models will contain a few more layers than the linear model:

 - The normalization layer, as before (with horsepower_normalizer for a single-input model and normalizer for a multiple-input model).
 - Two hidden, non-linear, Dense layers with the ReLU (relu) activation function nonlinearity.
 - A linear Dense single-output layer.

Both models will use the same training procedure, so the compile method is included in the build_and_compile_model function below.

In [None]:
def build_and_compile_model(norm):
  model = keras.Sequential([
      norm,
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
  ])

  model.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))
  return model

### Regression using a DNN and a single input

Create a DNN model with only 'Horsepower' as input and horsepower_normalizer (defined earlier) as the normalization layer:

In [None]:
dnn_horsepower_model = build_and_compile_model(horsepower_normalizer)

This model has quite a few more trainable parameters than the linear models:

In [None]:
dnn_horsepower_model.summary()

In [None]:
>>
Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization_6 (Normaliza  (None, 1)                 3         
 tion)                                                           
                                                                 
 dense_7 (Dense)             (None, 64)                128       
                                                                 
 dense_8 (Dense)             (None, 64)                4160      
                                                                 
 dense_9 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 4356 (17.02 KB)
Trainable params: 4353 (17.00 KB)
Non-trainable params: 3 (16.00 Byte)
_________________________________________________________________

Train the model with Keras Model.fit:

In [None]:
%%time
history = dnn_horsepower_model.fit(
    train_features['Horsepower'],
    train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)

In [None]:
>>
CPU times: user 6.59 s, sys: 193 ms, total: 6.78 s
Wall time: 11.5 s

This model does slightly better than the linear single-input horsepower_model:

In [None]:
plot_loss(history)

![reg_plot-5.png](attachment:5aab5c95-be7f-46b0-8a5b-b17724f124a6.png)

If you plot the predictions as a function of 'Horsepower', you should notice how this model takes advantage of the nonlinearity provided by the hidden layers:

In [None]:
x = tf.linspace(0.0, 250, 251)
y = dnn_horsepower_model.predict(x)

In [None]:
>>
8/8 [==============================] - 0s 4ms/step

In [None]:
plot_horsepower(x, y)

![reg_plot-6.png](attachment:3a6b3bf2-95a1-4cf7-aca2-81146330508b.png)

Collect the results on the test set for later:

In [None]:
test_results['dnn_horsepower_model'] = dnn_horsepower_model.evaluate(
    test_features['Horsepower'], test_labels,
    verbose=0)

### Regression using a DNN and multiple inputs

Repeat the previous process using all the inputs. The model's performance slightly improves on the validation dataset.

dnn_model = build_and_compile_model(normalizer)
dnn_model.summary()

In [None]:
>>
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization_5 (Normaliza  (None, 9)                 19        
 tion)                                                           
                                                                 
 dense_10 (Dense)            (None, 64)                640       
                                                                 
 dense_11 (Dense)            (None, 64)                4160      
                                                                 
 dense_12 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 4884 (19.08 KB)
Trainable params: 4865 (19.00 KB)
Non-trainable params: 19 (80.00 Byte)
_________________________________________________________________

%%time
history = dnn_model.fit(
    train_features,
    train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)

In [None]:
%%time
history = dnn_model.fit(
    train_features,
    train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)

In [None]:
>>
CPU times: user 5.51 s, sys: 176 ms, total: 5.69 s
Wall time: 11.8 s

![reg_plot-7.png](attachment:37b561b0-c112-4efc-afe0-f167f61e7876.png)

Collect the results on the test set:

In [None]:
test_results['dnn_model'] = dnn_model.evaluate(test_features, test_labels, verbose=0)

### Performance

Since all models have been trained, you can review their test set performance:

In [None]:
pd.DataFrame(test_results, index=['Mean absolute error [MPG]']).T

|index|Mean absolute error \[MPG\]|
|---|---|
|horsepower\_model|3\.651707172393799|
|linear\_model|2\.481541156768799|
|dnn\_horsepower\_model|2\.9002833366394043|
|dnn\_model|1\.683077335357666|

These results match the validation error observed during training.

### Make predictions

You can now make predictions with the dnn_model on the test set using Keras Model.predict and review the loss:

In [None]:
test_predictions = dnn_model.predict(test_features).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
lims = [0, 50]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)

![reg_plot-8.png](attachment:61b0deef-e9bb-404c-8b28-bfcaad8247a6.png)

It appears that the model predicts reasonably well.

Now, check the error distribution:

In [None]:
error = test_predictions - test_labels
plt.hist(error, bins=25)
plt.xlabel('Prediction Error [MPG]')
_ = plt.ylabel('Count')

![reg_plot-9.png](attachment:cc754a66-3408-44bd-977e-91f5579c1171.png)

If you're happy with the model, save it for later use with Model.save:

In [None]:
dnn_model.save('dnn_model.keras')

## Exercices

### Exercise 1: Data Preprocessing

1. Load and preprocess the Auto MPG dataset, including handling missing values and one-hot encoding categorical variables.
2. Replace missing values in the 'Horsepower' column with the median value instead of dropping them.
3. Modify the one-hot encoding to include a prefix for the origin countries ('Origin_').


### Exercise 2: Single-Variable Linear Regression

1. Create and train a single-variable linear regression model to predict 'MPG' from 'Horsepower'.
2. Use 'Weight' as the single feature for prediction instead of 'Horsepower'.
3. Change the optimizer from 'Adam' to 'SGD' with a learning rate of 0.01 and retrain the model.


### Exercise 3: Multi-Variable Linear Regression

1. Create and train a linear regression model using multiple features.
2. Add an additional Dense layer with 10 units before the output layer and retrain the model.
3. Change the learning rate to 0.05 and retrain the model.


### Exercise 4: Deep Neural Network Regression

1. Create and train a deep neural network model using multiple features.
2. Use 3 hidden layers with 128, 64, and 32 units respectively.
3. Change the activation function of the hidden layers from 'relu' to 'tanh' and retrain the model.



### Exercise 5: Evaluating Model Performance

1. Evaluate the single-variable linear regression model on the test dataset.
2. Plot the true vs. predicted 'MPG' values for the test dataset.
3. Compute and plot the distribution of prediction errors (true values - predicted values).


### Exercise 6: Feature Engineering

1. Add polynomial features (e.g., square and cubic terms) for 'Horsepower' to the dataset and retrain the linear regression model.
2. Implement feature scaling using Min-Max normalization instead of standard normalization and retrain the model.
3. Compare the performance of the model with polynomial features to the original linear model.


### Exercise 7: Regularization

1. Add L2 regularization to the multi-variable linear regression model and retrain it.
2. Adjust the regularization strength and observe its effect on model performance and overfitting.
3. Plot the training and validation loss curves to visualize the impact of regularization.


### Exercise 8: Hyperparameter Tuning

1. Perform hyperparameter tuning for the deep neural network model using Keras Tuner to find the optimal number of layers, units, and learning rate.
2. Train the model with the best hyperparameters found.
3. Compare the performance of the tuned model with the original DNN model.


### Exercise 9: Cross-Validation

1. Implement k-fold cross-validation for the multi-variable linear regression model.
2. Calculate the average mean absolute error (MAE) across all folds.
3. Compare the cross-validation performance to the train-test split performance.


### Exercise 10: Model Deployment

1. Save the trained deep neural network model to a file.
2. Load the saved model and make predictions on a new dataset.
3. Implement a simple Flask web application that accepts input features and returns the predicted 'MPG' value.