<a href="https://colab.research.google.com/github/GrzegorzMeller/AlgorithmsForMassiveData/blob/master/MPG_PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises 14/05

In regression problems, we want to predict the output of a continuous value. Contrast this with a classification problem, where we want to pick a class from a list of classes.

In the 14/05 lecture we will focus on regression. We will use the [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) dataset to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. Fluel efficiency is predicted relying on several attributes like: number of cylinders, displacement, horsepower, and weight.

The dataset can be downloaded as follows:


In [1]:
import keras

dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

Using TensorFlow backend.


Downloading data from http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data


'/root/.keras/datasets/auto-mpg.data'

Pandas can be used to first preprocess the dataset:

In [2]:
import pandas as pd

column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset = raw_dataset.copy()
dataset.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


The dataset contains a few unknown values:

In [3]:
dataset.isna().sum()

MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64

Simply drop the rows that contain unknown values.

Build the training set and the test set as follows:

In [4]:
train_dataset = dataset.sample(frac = 0.8, random_state = 0)
test_dataset = dataset.drop(train_dataset.index)
train_dataset

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
65,14.0,8,351.0,153.0,4129.0,13.0,72,1
132,25.0,4,140.0,75.0,2542.0,17.0,74,1
74,13.0,8,302.0,140.0,4294.0,16.0,72,1
78,21.0,4,120.0,87.0,2979.0,19.5,72,2
37,18.0,6,232.0,100.0,3288.0,15.5,71,1
...,...,...,...,...,...,...,...,...
207,20.0,4,130.0,102.0,3150.0,15.7,76,2
279,29.5,4,98.0,68.0,2135.0,16.6,78,3
227,19.0,6,225.0,100.0,3630.0,17.7,77,1
148,26.0,4,116.0,75.0,2246.0,14.0,74,2


In [9]:
test_dataset

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
9,15.0,8,390.0,190.0,3850.0,8.5,70,1
25,10.0,8,360.0,215.0,4615.0,14.0,70,1
28,9.0,8,304.0,193.0,4732.0,18.5,70,1
31,25.0,4,113.0,95.0,2228.0,14.0,71,3
32,25.0,4,98.0,,2046.0,19.0,71,1
...,...,...,...,...,...,...,...,...
368,27.0,4,112.0,88.0,2640.0,18.6,82,1
370,31.0,4,112.0,85.0,2575.0,16.2,82,1
382,34.0,4,108.0,70.0,2245.0,16.9,82,3
384,32.0,4,91.0,67.0,1965.0,15.7,82,3


Then, separate the target value, or "label", from the features. This label is the value that you will train the model to predict. We are talking about the MPG values (Miles Per Gallon).

After the preprocessing step:

- Build a neural network to predict the MPG value from the provided data.
- Use the Mean Squared Error (MSE) as loss function. MSE is a common loss function used in regression problems.
- Evaluation metrics used for regression differ from classification. A common regression metric that you can try to use is the Mean Absolute Error (MAE).
- Try to scale the input data and the output data using Normalization or Standardization.
- Do you notice any improvement in the performance when data scaling is applied?

In [41]:
x_train = train_dataset[train_dataset.columns[1:]]
y_train = train_dataset.MPG
y_train = y_train.astype(float)

x_test = test_dataset[test_dataset.columns[1:]]
y_test = test_dataset.MPG
y_test = y_test.astype(float)

#nan values
x_train = x_train.fillna(x_train.mean())
x_test = x_test.fillna(x_test.mean())
y_train = y_train.fillna(y_train.mean())
y_test = y_test.fillna(y_test.mean())
print(type(y_train))
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

<class 'pandas.core.series.Series'>
(318, 7) (318,) (80, 7) (80,)


In [44]:
from tensorflow import keras
#neural network implementation
model = keras.Sequential([
    keras.layers.Dense(64, activation="relu", input_shape=(x_train.shape[1],)), 

    keras.layers.Dense(64, activation="relu"),


    keras.layers.Dense(1)]) 
  

model.compile(optimizer='adam',
              loss='mean_squared_error',
              metrics=['mean_absolute_error'])

model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_18 (Dense)             (None, 64)                512       
_________________________________________________________________
dense_19 (Dense)             (None, 64)                4160      
_________________________________________________________________
dense_20 (Dense)             (None, 1)                 65        
Total params: 4,737
Trainable params: 4,737
Non-trainable params: 0
_________________________________________________________________


In [45]:
history = model.fit(x_train,
                    y_train,
                    batch_size=60,
                    epochs=30,
                    validation_data=(x_test,y_test),
                    )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [46]:
#data normalization
import sklearn
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
scaler.fit(x_train)
x_train_norm = scaler.transform(x_train)

scaler2 = preprocessing.MinMaxScaler()
scaler2.fit(x_test)
x_test_norm = scaler2.transform(x_test)

#print(x_train_norm, x_test_norm)
print(x_train_norm.shape, x_test_norm.shape)

(318, 7) (80, 7)


In [47]:
history = model.fit(x_train_norm,
                    y_train,
                    batch_size=60,
                    epochs=30,
                    validation_data=(x_test_norm,y_test),
                    )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
