# Lab 4: Basic regression - Predict fuel efficiency



## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # we use this library to load the dataset
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

## Load data

In [2]:
# Load the 'mpg' dataset using seaborn library into a Pandas DataFrame
df = sns.load_dataset('mpg')

MPG dataset can be viewed online at  
https://github.com/mwaskom/seaborn-data/blob/master/mpg.csv

## Data Exploration - Pandas Review

### Show the first 5 rows of the dataset

In [3]:
#your code here

### Show the size of the dataframe

In [5]:
#your code here

### Find the columns name and their types (numerical or categorical)

In [7]:
#your code here

### Find the number of missing values in each column

In [11]:
#your code here

### Handle the missing values in the dataframe

Since the number of missing values is low, we can simply drop the rows containing them. However, as a practice and review, let's substitute the missing values in the numerical columns (if any) with the mean of the respective column and the missing values in the categorical columns (if any) with the median of the respective column.

In [13]:
#your solution here

### Compute the average and the median weight

In [15]:
#your code here

### Find the number of cars that weight more than 2000 kgs

In [18]:
#your code here

### Find how many cars there are for each number of cylinders

In [20]:
#your code here

### Find what are the car models with number of cylinders (3 or 5)

In [22]:
#your code here

### Show the `value_counts()` of `origin` column or show the unique values of this column.

In [24]:
#your code here

## Data Preprocessing

### Use one hot encoding to change the categorical values of `origin` column to numerical values.

- use `pd.get_dummies()` method to do the encoding
- Join the original DataFrame with the new dummy DataFrame with `pd.concat()` and use `axis=1` to concate in horizontal direction.

In [25]:
#your code here

### Remove the `name` and `origin` column form the dataframe to have all numerical dataframe.

In [28]:
#your code here

### Does the input needs reshaping?

In [30]:
#your code here

### Form features `X` and labels `y` based on the processed datafram

In [31]:
#your code here
X = None
y = None

### Split the data into training and test sets and form `train_features`, `train_labels`, `test_features`, `test_labels`

In [37]:
from sklearn.model_selection import train_test_split
#your code here
train_features, test_features, train_labels, test_labels = ...

### For simplicity in the following steps, convert the dataset from a pandas DataFrame to a numpy array.

In [38]:
train_features = np.array(train_features)
train_labels = np.array(train_labels)
test_features = np.array(test_features)
test_labels = np.array(test_labels)

### Do some sanity check on the shape of the data before building a model

In [None]:
# your code here

## Normalization layer

To ensure stable training of neural networks, we typically normalize the data. This process also enhances the convergence of the gradient descent algorithm.

There is not single way to normalize the data. You can also use `scikit-learn `or `pandas` to do it. However, in this lab, we will use the normalization layer provided by tensorflow which matches the other parts of the model.

The `tf.keras.layers.Normalization` is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

In [42]:
normalizer = tf.keras.layers.Normalization(axis=-1)

Then, fit the state of the preprocessing layer to the data by calling `Normalization.adapt`.

It calculates the mean and variance of each feature, and store them in the layer

In [43]:
normalizer.adapt(train_features)

When the layer is called, it returns the input data, with each feature independently normalized.

In [44]:
first = train_features[0]
print('First example:', first)
print()
print('Normalized:', normalizer(first).numpy())

First example: [6.00e+00 2.25e+02 1.10e+02 3.62e+03 1.87e+01 7.80e+01 0.00e+00 0.00e+00
 1.00e+00]

Normalized: [[ 0.3048616   0.2845775   0.14142872  0.7548031   1.1217592   0.4945284
  -0.42559615 -0.50199604  0.74128604]]


## **Approach #1:** Regression using `Linear Regression`

**You are welcome to use scikit-learn to perform linear regression on this dataset.**

However, here we aim to implement it using TensorFlow.

- As we saw in Lab Week 2, `logistic regression` is essentially a single neuron with a `sigmoid` activation function.

- Similarly, `linear regression` can be viewed as a single neuron with a `linear` activation function.

### **Step 1:** Linear regression model architecture

In [45]:
linear_model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(9,)),
    normalizer,
    layers.Dense(1, activation='linear')
])

**Note:** You can define your model all at once like the cell above or you can buid the model incrementaly  (suitable for your assignment)

In [46]:
# Defining the model incrementaly (suitable for your assignment)
linear_model = tf.keras.Sequential()
linear_model.add(tf.keras.layers.Input(shape=(9,)))
linear_model.add(normalizer)
linear_model.add(layers.Dense(1, activation='linear'))

### **Step 2:** Configure the model with Keras `Model.compile()`

The most important arguments to compile are the `loss` and the `optimizer`, since these define what will be optimized (`"mean_absolute_error"`) and how (using the `tf.keras.optimizers.Adam(learning_rate=0.1)`).

**arguments:**
- optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
- loss='mean_absolute_error'

In [47]:
#your code here


### **Step 3:** Train the model using the `Model.fit()` for `100` epochs, and store the output in a variable named history.

In [None]:
history = linear_model.fit(train_features, train_labels, epochs=100)

In [None]:
history.history

In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

### Get the model summary

In [None]:
linear_model.summary()

### **Step 4:** Evaluate the linear model on the test set using Keras `Model.evaluate()` and see the `mean_absolute_error` and save the result for future comparison.

In [53]:
#your code here

## **Approach #2:** Regression using a `Deep Neural Network (DNN)`

### Solve the same problem and using deep neural network with the sample architecture;
- 1st hidden layer no. of units =  64
- 2nd hidden layer no. of units = 64
- Choose appropriate `activation` functions for hidden and output layers

In [62]:
#your code here


### Print the model summary (after training). How many parameters are there in the model?

In [55]:
#your code here

## Compare the evaluation result of the two approaches, i.e., linear regression and deep neural network.

In [56]:
#your code here

## Use the following large model and evaluate it on the test set.

In [57]:
model_dnn_large = tf.keras.Sequential([
    normalizer,
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='linear')
])


In [None]:
# your code here

### Explain your observation. Why do you think the large model is not performing well?

- hint: when the number of trainable parameters is very large (even larger than the number of data points), the model may overfit the training data. One way to solve this problem is to use more data.