<a href="https://colab.research.google.com/github/Benjamin-morel/TensorFlow/blob/main/03_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---


# Machine Learning Model: regression and predictions
| | |
|------|------|
| Filename | 03_regression.ipynb |
| Author(s) | Benjamin Morel (benjaminmorel27@gmail.com) |
| Date | September 8, 2024 |
| Aim(s) | Do regression tasks with a neural network machine learning model|
| Dataset(s) | Auto MPG dataset from UC Irvine [[1]](https://archive.ics.uci.edu/dataset/9/auto+mpg) |
| Version | Python 3.12 - TensorFlow 2.17.0 |


<br> **!!Read before running!!** <br>
1. CPU execution is enough
2. Run all and read comments.


---

#### **Motivation**

A car's fuel efficiency, i.e. the amount of fuel consumed by a car for a given distance, is a parameter that is difficult for car manufacturers to calculate. It depends on many factors: vehicle weight, power, engine volume, number of cylinders, technology used, etc.

However, many authorities and organizations require exact fuel efficiency for all vehicles, in order to check whether the vehicle complies with pollution standards. With the Auto MPG datase from the UC Irvine university [[1]](https://archive.ics.uci.edu/dataset/9/auto+mpg), the neural network model used in this script will predict **fuel efficiency** of any 1970s / 1980s automobiles. A simple linear regression model will first be used, then compared with the predictions made by a second non-linear model.

---


#### **1. Import libraries & prebuilt dataset**

In [157]:
import numpy as np # scientific computing
import pandas as pd # data structures and data analysis tools
import tensorflow as tf # machine learning models
import plotly.graph_objects as go # graphing packages
from plotly.subplots import make_subplots
import time

###### **1.1. What is the dataset formed of?**

The dataset used contains data about fuel efficiency for 398 car models, along with 8 characteristics of the cars:
*   **displacement**: engine displacement (in^3)
*   **cylinders**: number of cylinders in the engine
*   **weight**: vehicle weight (lb)
*   **model year**: car's model year
*   **origin**: country of origin (1 = USA, 2 = Europe, 3 = Japan)
*   **car name**: descriptive string containing the car’s name
*   **acceleration**: time to accelerate from 0 to 60 mph (s)
*   **horsepower**: engine power (some missing values)
*   **mpg**: car's fuel efficiency (MPG=Miles Per Gallon)

The target variable is the fuel efficiency **mpg**. The dataset is imported with *pandas* as a table with 398 rows and 9 columns. The function `tail` displays a part of this pandas table.

In [158]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin'] # MPG = fuel efficiency

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', # "?" values become a NaN value
                          comment='\t', # elements following \t are ignored
                          sep=' ', # columns are separating by spaces
                          skipinitialspace=True)

print("Summary of the dataset: ",
      "\n",
      "\nNumber of rows in raw_dataset: ", len(raw_dataset), "/ 398",
      "\nNumber of missing values: ", raw_dataset["Horsepower"].isna().sum(),
      "/", len(raw_dataset)*9) # horsepower NaN values

raw_dataset.tail()

Summary of the dataset:  
 
Number of rows in raw_dataset:  398 / 398 
Number of missing values:  6 / 3582


Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


###### **1.2. Dataset cleaning**

It is necessary to remove vehicles for which data is missing, especially for horsepower.

In [159]:
dataset = raw_dataset.copy() # copy of the original data before modification
dataset = dataset.dropna() # remove rows containing NaN values

print("Summary of the dataset after data cleaning: ",
      "\n",
      "\nNumber of rows in the dataset cleaned: ", len(dataset), "/", len(raw_dataset),
      "\nNumber of missing values: ", dataset["Horsepower"].isna().sum(),
      "/", len(dataset)*9) # horsepower NaN values

Summary of the dataset after data cleaning:  
 
Number of rows in the dataset cleaned:  392 / 398 
Number of missing values:  0 / 3528


---


#### **2. Sample preparation & pre-processing**

###### **2.1. How are training and test sets formed?**

The dataset is randomly split into 2 subsets: training (80%) and test (20%) [[2]](https://arxiv.org/pdf/2202.03326). Test set is used for a finale evaluation of the model. The function `plot_data`is used to check the data distribution according to 2 variables. The first line (MPG) suggests that this parameter depends on the others.

In [160]:
train_dataset = dataset.sample(frac=0.8) # randomly draws 80% of dataset
test_dataset = dataset.drop(train_dataset.index) # removes the rows selected for the training set from dataset

print("# of the training set:", len(train_dataset),
      "\n# of the test set:", len(test_dataset),
      "\n# of the training and test sets:", len(train_dataset)+len(test_dataset))

# of the training set: 314 
# of the test set: 78 
# of the training and test sets: 392


In [161]:
def plot_data(dataset):
  df = dataset
  fig = go.Figure(data=go.Splom(dimensions=[dict(label='MPG', values=df['MPG']),
                                            dict(label='Displacement', values=df['Displacement']),
                                            dict(label='Horsepower', values=df['Horsepower']),
                                            dict(label='Weight', values=df['Weight'])],
                                diagonal_visible=False)) # remove plots on diagonal
  fig.update_layout(title="Some representations of the dataset", width=1200, height=800)
  fig.show()

plot_data(train_dataset)

###### **2.2. Training set handling**

The next step is to arrange the label variables (those to be predicted) and the feature variables (those to be used for prediction). In this case, the fuel efficiency `MPG` is the label and the other data are the features.

In [162]:
train_features, test_features = train_dataset.copy(), test_dataset.copy()
train_labels, test_labels = train_features.pop('MPG'), test_features.pop('MPG')

###### **2.3. Normalization step**

The dataset is composed of quantities with different ranges and units. A good practice is to normalize the features so that the scale of the outputs and the scale of the gradients are not affected by the scale of the inputs. Models converge more stably with normalized features.

For the following part of the script, the engine horsepower parameter `Horsepower` is used to predict the fuel efficiency of a vehicle. The variable `Horsepower` is kept in the dataset `train_features`. The mean value and standard deviation of this distribution are computed for the normalization layer. Since only one value is needed to predict an output, the input size `input_shape` is set to 1.


In [163]:
horsepower = np.array(train_features['Horsepower'])
horsepower_normalizer = tf.keras.layers.Normalization(input_shape=[1,], axis=None)
horsepower_normalizer.adapt(horsepower) # fit to the data to compute the mean and standard deviation


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



---


#### **3. Linear regression model (L-model)**

First model: a single-variable linear regression model to predict `MPG` from `Horsepower`. For this, two steps are done in this model:
*   The feature `Horsepower` is normalized by using a specific preprocessing layer `norm` (then replaced by the one defined above).
*   A linear transformation `y = ax + b` is applied with the layer `Dense` to produce 1 output.



In [164]:
def create_linear_model(norm):
  model = tf.keras.Sequential([norm, tf.keras.layers.Dense(units=1)])
  model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.1), loss='mean_absolute_error')
  return model

In [165]:
linear_model = create_linear_model(horsepower_normalizer)
computation_time = {}
t0 = time.time()
history = linear_model.fit(train_features['Horsepower'], train_labels, epochs=100, verbose=0, validation_split = 0.2)
t1 = time.time()
computation_time['L-model'] = t1-t0 # save the computation time

In [166]:
test_results = {}
test_results['L-model'] = linear_model.evaluate(test_features['Horsepower'], test_labels, verbose=0) # save the loss value for the test set

---


#### **4. Multiple inputs linear regression model (nL-model)**


In [167]:
normalizer_multiple = tf.keras.layers.Normalization(axis=-1)
normalizer_multiple.adapt(np.array(train_features))

In [168]:
multiple_linear_model = create_linear_model(normalizer_multiple)
t0 = time.time()
history = multiple_linear_model.fit(train_features, train_labels, epochs=100, verbose=0, validation_split = 0.2)
t1 = time.time()
computation_time['nL-model'] = t1-t0

In [169]:
test_results['nL-model'] = multiple_linear_model.evaluate(test_features, test_labels, verbose=0)

---


#### **5. Deep Neural Network model (DNN-model)**

In [170]:
def DNN_model(norm):
  model = tf.keras.Sequential([
      norm,
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(1)])

  model.compile(loss='mean_absolute_error', optimizer=tf.keras.optimizers.Adam(0.001))
  return model

In [171]:
dnn_model = DNN_model(horsepower_normalizer)
t0 = time.time()
history = dnn_model.fit(train_features['Horsepower'], train_labels, validation_split=0.2, verbose=0, epochs=100)
t1 = time.time()
computation_time['DNN-model'] = t1-t0

In [172]:
test_results['DNN-model'] = dnn_model.evaluate(test_features['Horsepower'], test_labels, verbose=0)

---


#### **6. Multiple inputs deep Neural Network model (nDNN-model)**

In [173]:
multiple_dnn_model = DNN_model(normalizer_multiple)
t0 = time.time()
history = multiple_dnn_model.fit(train_features, train_labels, validation_split=0.2, verbose=0, epochs=100)
t1 = time.time()
computation_time['nDNN-model'] = t1-t0

In [174]:
test_results['nDNN-model'] = multiple_dnn_model.evaluate(test_features, test_labels, verbose=0)

---


#### **7. Confrontation & comments**

In [179]:
pd.DataFrame([test_results, computation_time], index=['Mean absolute error [MPG]', "Computation time [s]"]).T

Unnamed: 0,Mean absolute error [MPG],Computation time [s]
L-model,3.963596,10.000369
nL-model,2.665467,8.958707
DNN-model,3.543202,9.774154
nDNN-model,1.837558,11.008158


In [176]:
predictions_linear_model = linear_model.predict(test_features["Horsepower"], verbose=0).flatten()
predictions_multiple_linear_model = multiple_linear_model.predict(test_features, verbose=0).flatten()
predictions_dnn_model = dnn_model.predict(test_features["Horsepower"], verbose=0).flatten()
predictions_multiple_dnn_model = multiple_dnn_model.predict(test_features, verbose=0).flatten()

df_predictions = [predictions_linear_model, predictions_multiple_linear_model, predictions_dnn_model, predictions_multiple_dnn_model]

In [177]:
def plot_prediction(df):
  subtitles = ["L-model", "L-model", "nL-model", "nL-model", "DNN-model", "DNN-model", "nDNN-model", "nDNN-model"]
  fig = make_subplots(rows=len(df_predictions), cols=2, subplot_titles=subtitles)

  for index, model in enumerate(df):
    fig.add_trace(go.Scatter(x=test_labels, y=model, mode='markers', marker_color='blue', showlegend=False), row=index+1, col=1)
    xaxis = np.linspace(np.min(test_labels)*0.9,np.max(test_labels)*1.1,50)
    fig.add_trace(go.Scatter(x=xaxis, y=xaxis, mode='lines', marker_color='red', showlegend=False), row=index+1, col=1)
    fig.add_trace(go.Histogram(x=model - test_labels, nbinsx=25, marker_color='blue', showlegend=False), row=index+1, col=2)
    fig.update_yaxes(title_text="Predictions [MPG]", row=index+1, col=1), fig.update_yaxes(title_text="Count", range=[0, 25], row=index+1, col=2)

  fig.update_layout(title=dict(text='Predicted values Vs. True values'), width=800, height=1000)
  fig.update_xaxes(title_text="True values [MPG]", row=len(df_predictions), col=1), fig.update_xaxes(title_text="Prediction error [MPG]", row=len(df_predictions), col=2)
  fig.show()

In [178]:
plot_prediction(df_predictions)