# Project Cars - MPG Prediction (Regression)
Copyright 2023, LEAKY.AI LLC

This is project 2 for the Introduction to Deep Learning with PyTorch course (https://www.leaky.ai).

In this project you will build a neural network from scratch and train it to automatically predict Miles per Gallon (MPG) for various types of cars by simply looking at other properties of the car. You will be working with a real-world car dataset which will include values for each car including the number of cylinders, horsepower, weight, model year and acceleration.  From these values, the goal of your neural network will be to predict the expected Mile per Gallon (MPG for short) of the car.

### In this project you will:
- Build and train a neural network from scratch to predict MPG (Miles per Gallon) for cars using other car attributes (year, model, displacement etc.)
- Prepare a real-world dataset of car attributes (along with MPGs) to be used for training the neural network
- Experiment with different models to achieve a target average loss (or better) on the test set
- Use the trained model to make new predictions of MPGs

### To get started:
- Open up a web browser (preferable Chrome)
- Copy the Project GitHub Link: https://github.com/LeakyAI/PyTorch-Overview
- Head over to Google Colab (https://colab.research.google.com)
- Load the notebook: Project Cars - START HERE.ipynb
- Replace the [TBD]'s with your own code
- Execute the notebook after completing each cell

### Hint
Don't forget to print out and have your PyTorch and Pandas Cheatsheet handy when tackling this project. You can find it on the right-hand side of the the course home landing page in the Resource section.

Good luck!

In [None]:
# Import PyTorch libraries
import [TBD]
import [TBD] as nn
import [TBD] as optim
import [TBD] as F

# Import math and visual libraries we will need
import math, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set our seeds to get reproducible results
torch.manual_seed(4)
random.seed(4)

# Modify print options for numpy and pandas
np.set_printoptions(precision=3, suppress=True)
pd.options.display.float_format = "{:,.2f}".format

### Load and Analyze the Dataset
Below you will download the car dataset and explore the different attributes.

In [None]:
# Dataset from UCI - https://archive.ics.uci.edu/ml/datasets/auto+mpg
# Dua, D. and Graff, C. (2019). UCI Machine Learning Repository Irvine, CA
# University of California, School of Information and Computer Science.
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']
cars = pd.read_csv(url, names=column_names, na_values='?', comment='\t', 
                   sep=' ', skipinitialspace=True)
cars['Origin']=cars['Origin'].replace({1:'USA',2:'Europe',3:'Japan'})
cars.head()

In [None]:
cars.describe()

### Question:  What are some of the common dataset issues you see with the above tables that will need to tackled before we can use it for training neural networks?

Your Answer:   [TBD]

# Prepare our Dataset
### Start by counting and removing rows with missing items


In [None]:
# Determine the number of missing items from each attribute
cars.[TBD]

In [None]:
# Drop rows with missing items
cars = cars.[TBD]

# Check all missing items have been removed
cars.isna().sum()

### One-hot encode categorical attributes
Origin is a categorical attribute in this dataset.  Start by determining how many categories exits after dropping any missing items from above.

In [None]:
# Count number of unique values in each attribute including 'Origin'
cars.[TBD]

### Question:  How many unique categories exist for Origin?

Your Answer:  [TBD]

In [None]:
# OneHot encode the Origin attribute using pd.get_dummies, drop and pd.concat
oneHot = pd.get_dummies([TBD])
cars=cars.drop([TBD],axis=1)
cars = pd.concat([TBD],axis=1)
cars.describe()

### How many total attributes are left in the dataset after one hot encoding the Origin attribute?

Your Answer:  [TBD]

### Extract input X and output Y (MPG) from the dataset

In [None]:
# Extract our target output ydf (MPG) and input x attributes
ydf = [TBD]
xdf = cars.drop([TBD],axis=1)

### Apply standardization to our input xdf and output ydf 

In [None]:
# Calculate and save values required values to standardize both the input xdf and ydf
xMean = xdf.[TBD]
xStd = xdf.[TBD]
yMean = ydf.[TBD]
yStd = ydf.[TBD]

# Standardize both the input (xdf) and output (ydf) values
xdf = [TBD]
ydf = [TBD]

In [None]:
# Check the input xdf values are now standardized
xdf.describe()

### Question:  What are you checking in the above to ensure the input values are correctly standardized?  

Your Answer: [TBD]

In [None]:
# Check that the output target ydf (MPG) is also correctly standardized
ydf.describe()

### Convert dataframes to PyTorch Tensors and Dataloaders

In [None]:
# Create input and output PyTorch tensors of type torch.float
x = torch.tensor([TBD],dtype=[TBD])
y = torch.tensor([TBD],dtype=[TBD])

In [None]:
# Add a batch dimension to y and print out the shape and type of both x and y
y = y.unsqueeze(dim=1)
print (f"x.shape -> {x.shape}")
print (f"x.type() -> {x.type()}")
print (f"y.shape -> {y.shape}")
print (f"y.type() -> {y.type()}")

In [None]:
# Create the index splits for training, validation and test
# Import numpy for shuffling index values
import numpy as np

# Start by finding the total number of items in the original dataset
# using the len function
total = len(x)

# Build a list of indicies and shuffle them randomly
# Indicies should be in range of the length of x
indices = list(range(total))

# Shuffle the indicies
np.random.shuffle(indices)

# Allocate 80% of the data for the training set
# (10% for test set and 10% for validation set)
trainingPercent = .8

# Calculate the first split point so that x[:split1] will be your training set
split1 = int(total*trainingPercent)

# Calculate your 2nd split point so that x[split1:split2] will be
# your validation set and x[split2:] will be your testing set
split2 = int(((total - split1)/2)+split1)

In [None]:
# Create our a simple dataset using the PyTorch dataset class
class Dataset(torch.utils.data.Dataset):
  def __init__(self, x, y):
        # Initialize both x and y
        self.x = x
        self.y = y

  def __len__(self):
        # Total number of samples in the dataset
        return len(self.y)

  def __getitem__(self, index):
        # Return the data at location index
        x=self.x[index]
        y=self.y[index]
        return x,y

# Instantiate the three datasets
train_set = Dataset(x[indices[:split1]], y[indices[:split1]])
val_set = Dataset(x[indices[split1:split2]], y[indices[split1:split2]])
test_set = Dataset(x[indices[split2:]], y[indices[split2:]])

# Create dataloaders for each dataset
# For the training set, make sure to set shuffle to true
train_loader = torch.utils.data.DataLoader([TBD])
val_loader = torch.utils.data.DataLoader([TBD])
test_loader = torch.utils.data.DataLoader([TBD])

In [None]:
# Check the size and number of batches for each set
print (f"Train Loader - Total Number of Mini-Batches: {len(train_loader)}")
print (f"Train Loader - Total Size of Dataset: {len(train_loader.sampler)}")
print (f"Validation Loader - Total Number of Mini-Batches: {len(val_loader)}")
print (f"Validation Loader - Total Size of Dataset: {len(val_loader.sampler)}")
print (f"Test Loader - Total Number of Mini-Batches: {len(test_loader)}")
print (f"Test Loader - Total Size of Dataset: {len(test_loader.sampler)}")

### Create a scoring function for validation and test sets

In [None]:
# Purpose:  This function will calculate the average loss
#           using the criterion on the loader dataset
# Returns:  Average loss
@torch.[TBD] 
def scoreModel(model, loader, criterion):
    model.[TBD]                                # Set the model to inference mode
    lossTotal = 0.0                            # Initialize the total loss
    
    for x,y in loader:
        pred = [TBD]                           # Single forward pass
        loss = [TBD]                           # Calculate the average loss for the batch
        lossTotal+=loss.item()*x.size(0)       # Add the average loss adjusting for size of batch
        
    lossAvg = lossTotal/len(loader.sampler)    # Calculate the average loss for the entire dataset
    return lossAvg

### Create the training function

In [None]:
# Purpose:  This function will train the neural network on the
#           given training dataset for epochs number of iterations
#           returning the average loss on the test set
# Inputs:
#    epochs:       The total number of training epochs
#    model:        The neural network definition
#    train_loader: Vraining set dataloader 
#    val_loader:   Validation set dataloader 
#    test_loader:  Testing set dataloader
#    criterion:    Loss function used during training
#    optimizer:    Algorithm for determining weights
#
# Returns:
#    tLoss:        Average loss on the test_loader
def train(epochs, model, train_loader, val_loader, test_loader, criterion, optimizer):

    # Set the model to training mode (enable dropout, batch norm stats etc.)
    minValLoss = float('inf')
    model.[TBD]
    
    # Train model for epochs number of epochs (full pass of training set)
    for epoch in range(epochs):
        
        # Track loss over the entire epoch
        totalLoss = 0
        for x, y in train_loader:
            
            # Perform a single forward pass with a mini-batch and calculate loss
            y_pred = [TBD] 
            loss = [TBD]
            totalLoss+=loss.item()*x.size(0)
            
            # Update the weights
            optimizer.[TBD]
            loss.[TBD]
            optimizer.[TBD]

        # Calculate the Average Training Loss
        avgTLoss = totalLoss/len(train_loader.sampler)

        # Calculate validation loss and checkpoint model if lower
        vLoss = scoreModel([TBD],[TBD],[TBD])
        if (minValLoss > vLoss):
            # Save the model if the validation loss improved
            # print ("Model validation score improved, saving model...")
            torch.save(model.state_dict(), "trainingModelCheckpoint.pt")
            minValLoss = vLoss
                
        # Display average loss every 50 epochs
        if ((epoch+1) % 50 == 0):
            print (f"Epoch {epoch+1}  Training Loss: {totalLoss/len(train_loader.sampler):.4f} Validation Loss: {vLoss:.4f}")
     
    # Finally, score the best model on the test dataset
    model.load_state_dict(torch.load("trainingModelCheckpoint.pt"))
    tLoss = scoreModel([TBD], [TBD], [TBD])
    print (f"Final Average Test Dataset Loss:  {tLoss:.4f}")  
    
    # Return the average loss on the test set
    return [TBD]

### Create a graphing function to visualize your final results on the test set
Since your model is a regression, one way to visualize it's performance is to plot it's predictions vs. the actual predictions on the test set using a scatter plot.  That way you can easily see how close the model's predicted values were compared with the actual values.

In [None]:
# Plot predictions vs. true values
@torch.no_grad() 
def graphPredictions(model, loader, minValue, maxValue):
    
    model.eval()                               # Set the model to inference mode
    
    predictions=[]                             # Track predictions
    actual=[]                                  # Track the actual labels
    
    for x,y in loader:
        
        # Single forward pass
        pred = model(x)                               

        # Un-normalize our prediction and label
        pred = pred*yStd+yMean                 
        y= y*yStd+yMean

        # Save prediction and actual label
        predictions.append([TBD].tolist())
        actual.append([TBD].tolist())
    
    # Plot actuals vs predictions
    plt.scatter(actual, predictions)
    plt.xlabel('Actual MPGs')
    plt.ylabel('Predicted MPGs')
    plt.plot([minValue,maxValue], [minValue,maxValue]) 
    plt.xlim(minValue, maxValue)
    plt.ylim(minValue, maxValue)
 
    # Make the display equal in both dimensions
    plt.gca().set_aspect('equal', adjustable='box')
    plt.show()

### Build and Train your Model - Target an Average Test Loss < 0.2
Build your model and train it to achieve an average test loss of less than 0.2 by:

- Iterating on the model size and shape
- Experiementing with different optimziers and learning rates
- Adding regularization techniques like dropout

In [None]:
# Step 1 - Create a model and save the initialized weights
# Use a variable to define the number of hidden units in the hidden layer
numHiddenUnits = [TBD]

# Build the simple Neural Network by extending the nn.Module class
class MyModel(nn.Module):
    
        def __init__(self):
            super().__init__()
            [TBD]
            
        def forward(self, x):
            [TBD]
            return x

# Create an instance of the model, save it and print out the model summary
net = MyModel()
torch.save(net.state_dict(), 'modelcheckpoint.pth')

In [None]:
# Step2 - Reset the model and train using different learning rates and optimziers
# in order to acheive a test loss < 0.2

# Load the model weights to the same as initialized
net.load_state_dict(torch.load('modelcheckpoint.pth'))

# Select our criterion (MSE Loss) and optimzer (SGD or Adam or experiment with others...)
criterion = [TBD]
optimizer = [TBD]

# Train network for 1000 epochs (experiment with more or less as well)
train([TBD], [TBD], [TBD], [TBD], [TBD], [TBD], [TBD])

# Display the final results on the test set
graphPredictions(net, test_loader, 0, 45)

### Make your own predictions
After you achieve an average test loss of less than 0.2, you can use the model to make your own predictions. 

In [None]:
# Specify your inputs to your model
df = pd.DataFrame()
df.loc[0,'Cylinders']=[TBD]
df.loc[0,'Displacement']=[TBD]
df.loc[0,'Horsepower']=[TBD]
df.loc[0,'Weight']=[TBD]
df.loc[0,'Acceleration']=[TBD]
df.loc[0,'Model Year']=[TBD]
df.loc[0,'Origin_Japan']=[TBD]
df.loc[0,'Origin_USA']=[TBD]

df.head()

In [None]:
# Normalize the input values using the previously calculate xMean and sStd
xdf = (df-[TBD])/[TBD]
xdf.head()

In [None]:
# Create a torch tensor for the input
x = torch.tensor(xdf.to_numpy(), dtype=torch.float)

# Set the model to evaluation mode
net = net.[TBD]

# Perform a single forward pass
pred = net(x)

# Unnormalize the output using yMean and yStd
pred = pred*yStd+yMean

print (f"Predicted MPG: {pred.item():0.4}")

### Submitting Your Project 
You have made it to the end of the project!  Once you have achieved a training set average loss of less than 0.2, you may submit your project by downloading it and emailing it to us for review.  