# Fuel Efficiency

Source: https://www.tensorflow.org/alpha/tutorials/keras/basic_regression

Dataset: https://archive.ics.uci.edu/ml/datasets/auto+mpg

We are going to user the [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) Dataset to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this we'll use Tensorflow 2.0 and Keras. 

First, install our libraries: 
* seaborn
* matplotlib pyplot
* pandas
* Tensorflow

In [0]:
!pip install tensorflow==2.0.0-alpha0

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import tensorflow as tf

print(tf.__version__)

## The Auto MPG dataset


### Get the data
We will need to download the dataset using[ tf.keras.utils.get_file ](https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file)

Link to the dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

In [0]:
dataset_path = tf.keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")

Now that the file is saved on our Google Colab, we will need to import it into a Pandas Dataframe so we can manipulate it. Use [ pd.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to import it. Then display the first 5 so we can see what the dataset looks like

In [0]:
raw_dataset = pd.read_csv(dataset_path)
raw_dataset.head()

Something is off with this dataset. Let's see if we can clean things up using the parameters in [ pd.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). 

Define: 
* column_names
* na_values
* comment
* sep
* skipinitialspace

Check out the documentation on [ pd.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to see what all these parameters do. Once you've imported the data, display the first five again. 

In [0]:
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset.head()

### Clean the data

First, we'll need to check for missing data in our dataset. Pandas has a built-in function for this [isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html). Run this function on the dataset and take the sum to see if there is any missing data.

In [0]:
dataset.isna().sum()

Looks like there are some missing values. Let's get rid of them by using [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) function. Since we are dropping so few, it won't have a large impact on our dataset. 

In [0]:
dataset = dataset.dropna()
dataset.isna().sum()

Next, the `"Origin"` column is really categorical, not numeric. We will need to convert this using **one hot encoding**. Pandas has a built in function [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) that can help us here.

In [0]:
dataset = pd.get_dummies(dataset, columns=['Origin'], dtype=float)
dataset.head()

As you'll notice, `get_dummies` did not have the correct column names. Let's update these manually using Pandas [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) function. We will need to create a dictionary with the correct names. 

Note that:
* USA = 1
* Europe = 2
* Japan = 3

In [0]:
origin_names = {'Origin_1':'USA', 'Origin_2':'Europe', 'Origin_3': 'Japan'}
dataset.rename(index=str, columns=origin_names, inplace=True)
dataset.head()

##Exploring the Data


### Graphs
Before we create our label datasets, let's explore the Auto MPG dataset. We can use seaborn's [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) to automatically build graphs displaying our features. 

Plot out:
- MPG
- Cylinders
- Displacement 
- Weight

In [0]:
sns.pairplot(dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")

Here are some interesting things to note from the graphs:
* Most cars are 4, 6, or 8 cylinder cars, with a few 3 and 5 cylinder cars. 
* The cylinders vs MPG graph shows us a linear trend, more cylinders = less gas mileage. 
* We also see the same trend with displacement vs MPG and weight vs MPG. 
* More cylinders = more displacement = more weight. 


### Statistics
We can also look at the overall statistics of our dataset using Pandas [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) and [transpose](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html) functions:

*Don't forget to pop off the MPG*

In [0]:
stats = dataset.describe()
stats.pop("MPG")
stats = stats.transpose()
stats

##Training and Testing

### Split the data into train and test

Now split the dataset into a training set and a test set. We can use the Pandas [sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) function to grab a random 80% selection from the dataset. 

We can then use Pandas [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function to grab the remaining 20% to use as our testing data. 

Don't forget to [pop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html) off the labels as well. 

In [0]:
train_data = dataset.sample(frac=0.8,)
test_data = dataset.drop(train_data.index)
train_labels = train_data.pop('MPG')
test_labels = test_data.pop('MPG')

### Normalize the data

Look again at the `stats` block above and note how different the ranges of each feature are.

We want to create a function to normalize the data. Since we want to determine the relation of our features to MPG, we need to normalize our feature values so they are within the same range and our model isn't thrown off by data in a different range. 

To accomplish this, we will build a function that subtracts from our data the mean / standard deviation. 

For example: data - (mean / standard deviation)

In [0]:
def norm(x):
  return (x - stats['mean']) / stats['std']
normed_train_data = norm(train_data)
normed_test_data = norm(test_data)

This normalized data is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model, along with the one-hot encoding that we did earlier.  That includes the test data.

## The model

### Build the model

Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output layer that returns a single, continuous value. We will need to define the layers and compile the model.

For the compile we are going to define a custom optimizer, the [tf.keras.optimizers.RMSprop](https://keras.io/optimizers/) with a learning rate of .001. This is an unpublished adaptive learning rate optimization algorithm designed for Neural Networks. Click [here](https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a) for more information.

Since this is a regression problem, we'll use [ mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error) and[ mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) as our metrics. *Hint: Keras abbreviates these to mse and mae*


In [0]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=[len(train_data.keys())]),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

In [0]:
#Custom Optimizer
optimizer = tf.keras.optimizers.RMSprop(0.001)

model.compile(loss='mse',
              optimizer=optimizer,
              metrics=['mae', 'mse'])

### Inspect the model

Use the `.summary` method to print a simple description of the model

In [0]:
model.summary()


Now try out the model. Take a batch of `10` examples from the training data and call `model.predict` on it. These predictions might seem far off. Remember these predictions are before we trained the model. 

In [0]:
example_batch = train_data[:10]
example_result = model.predict(example_batch)
example_result

### Train the model

Train the model for 1000 epochs and record the training and validation accuracy in the `history` object. Take 20% of our training data to use for validation. **Remember** to pass in our normalized data!

While fitting our model we are going to define a custom callback. 1000 epochs is a lot and we don't want the training ticks taking up all of our screen. We'll create a custom class using [tf.keras.callbacks.Callback](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/Callback) that will allow us to change what displays after the completed epoch. 

We are also going to define [EarlyStopping](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping). This callback will stop the model training after x number of epochs to see if it's improving. 


In [0]:
# Display training progress by printing a single dot for each completed epoch
class PrintDot(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 100 == 0: print('')
    print('.', end='')
    
# The patience parameter is the amount of epochs to check for improvement
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)


history = model.fit(
  normed_train_data, train_labels,
  epochs=1000, validation_split = 0.2, verbose=0,
  callbacks=[early_stop, PrintDot()])

Note that with the Early Stop defined, our model trains for much fewer epochs

Now that the model is finished, we can visualize the model's training progress using the stats stored in the `history` object. 

1.   Import the history into a pandas dataframe
2.   Create a column for the epochs
3.   Look at the end of the dataframe



In [0]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Using the following code, was can take the mean squared error and mean absolute error from the history object and graph it. 

In [0]:
def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error [MPG]')
  plt.plot(hist['epoch'], hist['mae'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mae'],
           label = 'Val Error')
  plt.ylim([0,5])
  plt.legend()

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error [$MPG^2$]')
  plt.plot(hist['epoch'], hist['mse'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mse'],
           label = 'Val Error')
  plt.ylim([0,20])
  plt.legend()
  plt.show()


plot_history(history)

You can see that even though we defined 1000 epochs, the model only trained for around 70. That's because our EarlyStopping stopped the training once validation stopped improving. 

You can learn more about this callback [here](https://www.tensorflow.org/versions/master/api_docs/python/tf/keras/callbacks/EarlyStopping).

The graph shows that on the validation set, the average error is usually around +/- 2 MPG. This means that our model predictions  were about + or - 2 MPG off.

Let's see how well the model generalizes by using the **test** set, which we did not use when training the model.  This tells us how well we can expect the model to predict when we use it in the real world.

In [0]:
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=0)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))

### Make predictions

Finally, predict MPG values using data in the testing set. The following code will graph our model's best fit line compared to the acutal MPG values. You can see our model's best fit line 'fits' well to the testing data. 

In [0]:
test_predictions = model.predict(normed_test_data).flatten()

plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-100, 100], [-100, 100])


It looks like our model predicts reasonably well. Let's take a look at the error distribution.

In [0]:
error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MPG]")
_ = plt.ylabel("Count")

This last graph shows us the distribution  of prediction error. We can see that our model is pretty close to the money, only one answer throwing a skewed result. 

## Conclusion

This notebook introduced a few techniques to handle a regression problem.

* Mean Squared Error (MSE) is a common loss function used for regression problems (different loss functions are used for classification problems).
* Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).
* When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
* If there is not much training data, one technique is to prefer a small network with few hidden layers to avoid overfitting.
* Early stopping is a useful technique to prevent overfitting.