<a href="https://colab.research.google.com/github/Benjamin-morel/TensorFlow/blob/main/03_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---


# **Machine Learning Model: regression and predictions**
| | |
|------|------|
| Filename | 03_regression.ipynb |
| Author(s) | Benjamin Morel (benjaminmorel27@gmail.com) |
| Date | September 8, 2024 |
| Aim(s) | Do regression tasks with a neural network machine learning model|
| Dataset(s) | Auto MPG dataset from UC Irvine [[1]](https://archive.ics.uci.edu/dataset/9/auto+mpg) |
| Version | Python 3.12 - TensorFlow 2.17.0 |


<br> **!!Read before running!!** <br>
1. Fill in the inputs
2. CPU execution is enough
3. Run all and read comments.


---

#### **Motivation**

A car's fuel efficiency, i.e. the amount of fuel consumed by a car for a given distance, is a parameter that is difficult for car manufacturers to calculate. It depends on many factors: vehicle weight, power, engine volume, number of cylinders, technology used, etc.

However, many authorities and organizations require exact fuel efficiency for all vehicles, in order to check whether the vehicle complies with pollution standards. With the Auto MPG datase from the UC Irvine university [[1]](https://archive.ics.uci.edu/dataset/9/auto+mpg), the neural network model used in this script will predict **fuel efficiency** of any 1970s / 1980s automobiles. A simple linear regression model will first be used, then compared with the predictions made by a second non-linear model.



---


#### **0. Input section**

The model has already been trained: **parameters** (weights and biases) of each neuron are already known according to the base dataset. The user can choose to keep these parameters and **not retrain the model** (No), or he can decide to repeat the **training phase** (Yes). Using a pre-trained model saves time, computer resources and CO2 emissions.

In [1]:
training_phase = 'No'

---


#### **1. Import libraries & prebuilt dataset**

###### **1.1. Python librairies**

In [2]:
from numpy import array, max, min                                   # scientific computing
from pandas import read_csv, DataFrame                              # data manipulation tool
from tensorflow import keras, optimizers, linspace                  # machine learning models
from plotly.graph_objects import Figure, Splom, Scatter, Histogram  # graphing packages
from plotly.subplots import make_subplots                           # graphing packages
from time import thread_time                                        # timer

###### **1.2. Github importations**

In [3]:
def get_github_files():
  !git clone https://github.com/Benjamin-morel/TensorFlow.git TensorFlow_duplicata # go to the Github repertory TensorFlow and clone it
  model_L = keras.models.load_model('TensorFlow_duplicata/99_pre_trained_models/03_regression/Lmodel.keras') # pre-trained model
  model_nL = keras.models.load_model('TensorFlow_duplicata/99_pre_trained_models/03_regression/nLmodel.keras')
  model_DNN = keras.models.load_model('TensorFlow_duplicata/99_pre_trained_models/03_regression/DNNmodel.keras')
  model_nDNN = keras.models.load_model('TensorFlow_duplicata/99_pre_trained_models/03_regression/nDNNmodel.keras')
  performances_CSV = read_csv('TensorFlow_duplicata/99_pre_trained_models/03_regression/performances_regression.csv') # pre-trained model
  !rm -rf TensorFlow_duplicata/ # delete the cloned repertory
  return model_L, model_nL, model_DNN, model_nDNN, performances_CSV

###### **1.1. What is the dataset formed of?**

The dataset used contains data about fuel efficiency for 398 car models, along with 8 characteristics of the cars:
*   **displacement**: engine displacement (in^3)
*   **cylinders**: number of cylinders in the engine
*   **weight**: vehicle weight (lb)
*   **model year**: car's model year
*   **origin**: country of origin (1 = USA, 2 = Europe, 3 = Japan)
*   **car name**: descriptive string containing the car’s name
*   **acceleration**: time to accelerate from 0 to 60 mph (s)
*   **horsepower**: engine power (some missing values)
*   **mpg**: car's fuel efficiency (MPG=Miles Per Gallon)

The target variable is the fuel efficiency **mpg**. The dataset is imported with *pandas* as a table with 398 rows and 9 columns. The function `tail` displays a part of this pandas table.

In [4]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower',
                'Weight', 'Acceleration', 'Model Year', 'Origin']               # MPG = fuel efficiency

raw_dataset = read_csv(url, names=column_names,
                          na_values='?',                                        # "?" values become a NaN value
                          comment='\t',                                         # elements following \t are ignored
                          sep=' ',                                              # columns are separating by spaces
                          skipinitialspace=True)

print("Summary of the dataset: ",
      "\n",
      "\nNumber of rows in raw_dataset: ", len(raw_dataset), "/ 398",
      "\nNumber of missing values: ", raw_dataset["Horsepower"].isna().sum(),
      "/", len(raw_dataset)*9)                                                  # horsepower NaN values

raw_dataset.tail()

Summary of the dataset:  
 
Number of rows in raw_dataset:  398 / 398 
Number of missing values:  6 / 3582


Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


###### **1.2. Dataset cleaning**

It is necessary to remove vehicles for which data is missing, especially for horsepower.

In [5]:
dataset = raw_dataset.copy() # copy of the original data before modification
dataset = dataset.dropna() # remove rows containing NaN values

print("Summary of the dataset after data cleaning: ",
      "\n",
      "\nNumber of rows in the dataset cleaned: ", len(dataset), "/", len(raw_dataset),
      "\nNumber of missing values: ", dataset["Horsepower"].isna().sum(),
      "/", len(dataset)*9) # horsepower NaN values

Summary of the dataset after data cleaning:  
 
Number of rows in the dataset cleaned:  392 / 398 
Number of missing values:  0 / 3528


---


#### **2. Sample preparation & pre-processing**

###### **2.1. How are training and test sets formed?**

The dataset is randomly split into 2 subsets: training (80%) and test (20%) [[2]](https://arxiv.org/pdf/2202.03326). Test set is used for a finale evaluation of the model. The function `plot_data`is used to check the data distribution according to 2 variables. The first line (MPG) suggests that this parameter depends on the others.

In [6]:
train_dataset = dataset.sample(frac=0.8) # randomly draws 80% of dataset
test_dataset = dataset.drop(train_dataset.index) # removes the rows selected for the training set from dataset

print("# of the training set:", len(train_dataset),
      "\n# of the test set:", len(test_dataset),
      "\n# of the training and test sets:", len(train_dataset)+len(test_dataset))

# of the training set: 314 
# of the test set: 78 
# of the training and test sets: 392


In [7]:
def plot_data(dataset):
  df = dataset
  fig = Figure(data=Splom(dimensions=[dict(label='MPG', values=df['MPG']),
                                            dict(label='Displacement', values=df['Displacement']),
                                            dict(label='Horsepower', values=df['Horsepower']),
                                            dict(label='Weight', values=df['Weight'])],
                                diagonal_visible=False)) # remove plots on diagonal
  fig.update_layout(title="Some representations of the dataset", width=1000, height=600)
  fig.show()

plot_data(train_dataset)

###### **2.2. Training set handling**

The next step is to arrange the label variables (those to be predicted) and the feature variables (those to be used for prediction). In this case, the fuel efficiency `MPG` is the label and the other data are the features.

In [8]:
train_features, test_features = train_dataset.copy(), test_dataset.copy()
train_labels, test_labels = train_features.pop('MPG'), test_features.pop('MPG')

###### **2.3. Normalization step**

The dataset is composed of quantities with different ranges and units. A good practice is to normalize the features so that the scale of the outputs and the scale of the gradients are not affected by the scale of the inputs. Models converge more stably with normalized features.

For the following part of the script, the engine horsepower parameter `Horsepower` is used to predict the fuel efficiency of a vehicle. The variable `Horsepower` is kept in the dataset `train_features`. The mean value and standard deviation of this distribution are computed for the normalization layer. Since only one value is needed to predict an output, the input size `input_shape` is set to 1.


In [9]:
horsepower = array(train_features['Horsepower'])
horsepower_normalizer = keras.layers.Normalization(input_shape=(1,), axis=None)
horsepower_normalizer.adapt(horsepower) # fit to the data to compute the mean and standard deviation


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



In [10]:
def show_evolution(history):
  history_dict = history.history
  history_dict.keys()

  fig = Figure()

  acc_train, acc_val = history_dict['loss'], history_dict['val_loss']
  epochs = array(range(1, len(acc_train) + 1))

  fig.add_trace(Scatter(x=epochs, y=acc_train, mode='lines', name='loss'))
  fig.add_trace(Scatter(x=epochs, y=acc_val, mode='lines', name='val_loss'))
  fig.update_layout(legend=dict(x=0.02, y=0.98, xanchor='left', yanchor='top', bgcolor='rgba(255, 255, 255, 0.8)', bordercolor='black', borderwidth=1),
                    width=500, height=500)
  fig.update_xaxes(title = "epochs"), fig.update_yaxes(title = "loss")

  fig.show()

---


#### **3. Linear regression model (L-model)**

First model: a single-variable linear regression model to predict `MPG` from `Horsepower`. For this, two steps are done in this model:
*   The feature `Horsepower` is normalized by using a specific preprocessing layer `norm` (then replaced by the one defined above).
*   A linear transformation `y = ax + b` is applied with the layer `Dense` to produce 1 output.



In [11]:
def create_linear_model(norm):
  model = keras.Sequential([norm, keras.layers.Dense(units=1)])
  model.compile(optimizer=optimizers.Adam(learning_rate=0.1), loss='mean_absolute_error')
  return model

In [12]:
linear_model = create_linear_model(horsepower_normalizer)

if training_phase == 'Yes':
  computation_time = {}
  t0 = thread_time()
  history = linear_model.fit(train_features['Horsepower'], train_labels, epochs=50, verbose=0, validation_split = 0.2)
  t1 = thread_time()
  computation_time['L-model'] = t1-t0 # save the computation time
  linear_model.save('Lmodel.keras')
  show_evolution(history)

In [13]:
if training_phase == "Yes":
  test_results = {}
  test_results['L-model'] = linear_model.evaluate(test_features['Horsepower'], test_labels, verbose=0) # save the loss value for the test set

---


#### **4. Multiple inputs linear regression model (nL-model)**

Second model: a 8-variables linear regression model to predict `MPG` from `train_features`. For this, two steps are done in this model:
*   The features `train_features` are normalized by using a specific preprocessing layer `norm`
*   A linear transformation `y = Mx + v` is applied with the layer `Dense` to produce 1 output

In [14]:
normalizer_multiple = keras.layers.Normalization(axis=-1)
normalizer_multiple.adapt(array(train_features))

In [15]:
multiple_linear_model = create_linear_model(normalizer_multiple)
if training_phase == "Yes":
  t0 = thread_time()
  history = multiple_linear_model.fit(train_features, train_labels, epochs=50, verbose=0, validation_split = 0.2)
  t1 = thread_time()
  multiple_linear_model.save('nLmodel.keras')
  computation_time['nL-model'] = t1-t0
  test_results['nL-model'] = multiple_linear_model.evaluate(test_features, test_labels, verbose=0) # save the loss value for the test set
  show_evolution(history)

---


#### **5. Deep Neural Network model (DNN-model)**

Third model: a deep network regression model to predict `MPG` from `Horsepower`. The model is composed of one pre-processing layer, 2 deep layers and one output layer:
* The pre-processing layer `norm`
* Two deep layers `Dense` with the non-linear function `ReLu` as activation function
* One output layer `Dense` to compute an unique output


In [16]:
def DNN_model(norm):
  model = keras.Sequential([
      norm,
      keras.layers.Dense(64, activation='relu'),
      keras.layers.Dense(64, activation='relu'),
      keras.layers.Dense(1)])

  model.compile(loss='mean_absolute_error', optimizer=keras.optimizers.Adam(0.001))
  return model

In [17]:
dnn_model = DNN_model(horsepower_normalizer)
if training_phase == "Yes":
  t0 = thread_time()
  history = dnn_model.fit(train_features['Horsepower'], train_labels, epochs=50, validation_split=0.2, verbose=0)
  t1 = thread_time()
  computation_time['DNN-model'] = t1-t0
  test_results['DNN-model'] = dnn_model.evaluate(test_features['Horsepower'], test_labels, verbose=0)
  dnn_model.save('DNNmodel.keras')
  show_evolution(history)

---


#### **6. Multiple inputs deep Neural Network model (nDNN-model)**

Last model: a deep network regression model with 8 inputs to predict `MPG` from `train_features`.

In [18]:
multiple_dnn_model = DNN_model(normalizer_multiple)
if training_phase == "Yes":
  t0 = thread_time()
  history = multiple_dnn_model.fit(train_features, train_labels, epochs=50, validation_split=0.2, verbose=0)
  t1 = thread_time()
  computation_time['nDNN-model'] = t1-t0
  test_results['nDNN-model'] = multiple_dnn_model.evaluate(test_features, test_labels, verbose=0)
  multiple_dnn_model.save('nDNNmodel.keras')
  show_evolution(history)

---


#### **7. Confrontation & comments**

If the models have been trained, it is possible to compare the performance of each model. `nDNN_model` is the best in terms of minimizing the loss function, but also the longest in terms of computing time. Depending on performance, the function `plot_prediction` can be used to visualize predicted values as a function of true values (left graph) and error distribution (right graph).

In [19]:
if training_phase == "Yes":
  performances = DataFrame([test_results, computation_time], index=['Mean absolute error [MPG]', "Computation time [s]"]).T
  performances.to_csv('performances_regression.csv', index=False)
else:
  linear_model, multiple_linear_model, dnn_model, multiple_dnn_model, performances = get_github_files()

Cloning into 'TensorFlow_duplicata'...
remote: Enumerating objects: 312, done.[K
remote: Counting objects: 100% (118/118), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 312 (delta 114), reused 88 (delta 88), pack-reused 194 (from 2)[K
Receiving objects: 100% (312/312), 31.29 MiB | 19.77 MiB/s, done.
Resolving deltas: 100% (154/154), done.


In [20]:
print(performances)

   Mean absolute error [MPG]  Computation time [s]
0                   3.827374              4.595454
1                   2.092484              3.868038
2                   3.352383              5.078055
3                   1.965180              4.296537


In [21]:
all_model = {'L-model': linear_model, 'nL-model': multiple_linear_model, 'DNN-model': dnn_model, 'nDNN-model':	multiple_dnn_model}

In [22]:
def plot_prediction(name):
  model = all_model[name]
  if name == 'L-model' or name == 'DNN-model':
    predictions = model.predict(test_features["Horsepower"], verbose=0).flatten()
  else:
    predictions = model.predict(test_features, verbose=0).flatten()
  fig = make_subplots(rows=1, cols=2, subplot_titles=["Predicted values Vs. True values", "Error distribution"])
  # Subplot 1
  fig.add_trace(Scatter(x=test_labels, y=predictions, mode='markers', marker_color='blue', showlegend=False), row=1, col=1)
  xaxis = linspace(min(test_labels)*0.9,max(test_labels)*1.1,50)
  fig.add_trace(Scatter(x=xaxis, y=xaxis, mode='lines', marker_color='red', showlegend=False), row=1, col=1) # line y=x
  # Subplot 2
  fig.add_trace(Histogram(x=predictions - test_labels, nbinsx=25, marker_color='blue', showlegend=False), row=1, col=2)
  fig.update_yaxes(title_text="Predictions [MPG]", row=1, col=1), fig.update_yaxes(title_text="Count", range=[0, 25], row=1, col=2)
  # Layout and axis
  fig.update_layout(title=dict(text=name), width=900, height=500)
  fig.update_xaxes(title_text="True values [MPG]", row=1, col=1), fig.update_xaxes(title_text="Prediction error [MPG]", row=1, col=2)
  fig.show()

In [23]:
plot_prediction('nDNN-model')