# Basic regression: Predict fuel efficiency

In a *regression* problem, we aim to predict the output of a continuous value, like a price or a probability. 

This notebook uses the classic [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) Dataset and builds a model to predict the fuel efficiency. 

In [None]:
# Use seaborn for pairplot
!pip install seaborn

# Use some functions from tensorflow_docs
!pip install git+https://github.com/tensorflow/docs

In [None]:
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)
tf.random.set_seed(10)

In [None]:
import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

## The Auto MPG dataset

The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/).


### Get the data
First download the dataset.

In [None]:
dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

Import it using pandas

In [None]:
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset.tail()

### Clean the data

The dataset contains a few unknown values.

Drop those rows if they contain any unknown values.


In [None]:
# Write code to remove the unknown values
dataset = dataset.dropna()

The `"Origin"` column is really categorical, not numeric. So convert that to a one-hot:

In [None]:
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

In [None]:
dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
dataset.tail()

### Split the data into train and test

Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.

In [None]:
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.

In [None]:
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

### Normalize the data

Use z-score normalization for both datasets

In [None]:
# Write code here: To normalize both train and test datasets
train_stats = train_dataset.describe()
train_stats = train_stats.transpose()
train_stats

In [None]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

## The model

### Build the model

(Q7) Build a `Sequential` model with two densely connected hidden layers (with 64 units and `relu` activation), and an output layer that returns a single, continuous value. The model building steps must be wrapped in a function, `build_model`. Use Adam optimizer with its default arguments and `mse` as the loss metric.

In [None]:
def build_model():
  model = keras.Sequential([
                            layers.Dense(64, activation='relu', input_shape = [len(train_dataset.keys())]),
                            layers.Dense(64, activation = 'relu'),
                            layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.Adam(0.001)
  model.compile(loss ='mse', 
                optimizer=optimizer,
                metrics = ['mae', 'mse'])
  return model

In [None]:
# Build and compile your model in this cell.
model = build_model()

### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history` object.

In [None]:
EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])

In [None]:
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))

Q. 8 Change the model architecture by increasing units of both hidden layers to 100. Train the model again with. What is the range of the training error after the 900th epoch of the training process? Try to think of why this is happening

In [None]:
def build_model2():
  model2 = keras.Sequential([
                            layers.Dense(100, activation='relu', input_shape = [len(train_dataset.keys())]),
                            layers.Dense(100, activation = 'relu'),
                            layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.Adam(0.001)
  model2.compile(loss ='mse', 
                optimizer=optimizer,
                metrics = ['mae','mse'])
  return model2

In [None]:
model2 = build_model2()

In [None]:
EPOCHS = 1000

history = model2.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])

In [None]:
loss, mae, mse = model2.evaluate(normed_test_data, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))

Q.10 We will now change the loss from ​ mse​ to ​ mae​ . The loss on the test data after training the best model lies in the range:

In [None]:
def build_model3():
  model3 = keras.Sequential([
                            layers.Dense(100, activation='relu', input_shape = [len(train_dataset.keys())]),
                            layers.Dense(100, activation = 'relu'),
                            layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.Adam(0.001)
  model3.compile(loss ='mae', 
                optimizer=optimizer,
                metrics = ['mae','mse'])
  return model3

In [None]:
model3 = build_model3()

In [None]:
EPOCHS = 1000

history = model3.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])

In [None]:
loss, mae, mse = model3.evaluate(normed_test_data, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))