<img src="https://futurejobs.my/wp-content/uploads/2021/05/d-min-1024x297.png" width="300"> </img>

> **Copyright &copy; 2021 Skymind Education Group Sdn. Bhd.**<br>
 <br>
This program and the accompanying materials are made available under the
terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). \
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License. <br>
<br>**SPDX-License-Identifier: Apache-2.0** 

# Demo - Predict Car Price 
Authored by : [Nazurah Kamil](mailto:nazurah.kamil@skymind.my)

In [Feature Engineering](../../machine_learning/supervised_learning/Feature%20Engineering.ipynb), we have built a `LinearRegression` using `scikit-learn` library to predict car price. This notebook will instead use neural network to attempt the same problem.

###### Learning Outcome
By the end of this notebook, you will be able to:
- Apply deep learning model to perform regression modelling
- Utilize `Dataset` and `Dataloader`
- Determine how well our deep learning model generalize on test set

Let us first import the required libraries for this notebook.

In [None]:
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim

from torch.utils.tensorboard import SummaryWriter
%reload_ext tensorboard

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import dataset
import pandas as pd
car_data = pd.read_csv("../datasets/car-price.csv")

In [None]:
car_data.shape

In [None]:
car_data.head()

In [None]:
# Remove unused column
car_data.drop(['CarName','car_ID'], axis=1, inplace=True)

In [None]:
car_data.shape

## Feature Engineering

In [None]:
# Change from string to int
word2num = {"doornumber": {"two": 2, "four": 4},
            "cylindernumber": {"two": 2, "three": 3, "four": 4, "five": 5,
                               "six": 6, "eight": 8, "twelve": 12}}

# Replace the values in dataset
car_data = car_data.replace(word2num)
car_data.dtypes

In [None]:
car_data['fueltype'].unique()

## **Dummy Encoding**
Let's use dummy encoding to encode our categorical features in string format into binary format.

In [None]:
# Dummy encoding to categorical features
car_encode = pd.get_dummies(car_data, columns=['fueltype', 'aspiration', 'carbody', 'drivewheel', 'enginelocation',
                                      'enginetype', 'fuelsystem'], drop_first=True)
car_encode.head()

In [None]:
car_encode.columns.tolist().index('fueltype_gas')

## Features and Label
Now, we will separate out the features and store it in variable `X` and store label in variable `y`.

In [None]:
X = car_encode.drop('price', axis=1)
X.head()

In [None]:
X.shape

In [None]:
y = car_encode.price
y

# Split to Train And Test

In [None]:
# split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=5)

# Data Preprocessing

Next, we are going to perform feature scaling on `X_train` and `X_test` using `StandardScaler` from `scikit-learn`.<br>
*Note: only fit the train features but transform both train and test features*

In [None]:
# feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Y_train=scaler.fit_transform(y_train.values.reshape(-1,1))

# Dataset And DataLoader

Here, we are using a custom dataset from a `csv` file. Thus, we have to build our own `Dataset` class by subclassing from `torch.utils.data.Dataset`.

Whilst subclassing `Dataset`, PyTorch [documentation](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) states that we have to override the `__getitem__()` method and optionally the `__len__()` method.<br>
We will mainly have three methods in this `Dataset` class:
- `__init__(self, data, label)`: helps us pass in the feature and labels into the dataset
- `__len__(self)`:allows the dataset to know how many instances of data there is 
- `__getitem__(self, idx)`:allows the dataset to get items from the data and labels by indexing

In [None]:
class Custom_Dataset(Dataset):
    def __init__(self, features, labels):
        # convert dataset to tensor
        self.features = torch.tensor(features, dtype = torch.float32)
        self.labels = torch.tensor(labels.values, dtype  = torch.float32)

    def __len__(self):
        # return len of features
        return self.features.shape[0]
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

In [None]:
train_dataset = Custom_Dataset(X_train, y_train)
test_dataset = Custom_Dataset(X_test, y_test)

`DataLoader` helps us to transform our dataset into an iterable dataset and allows for batch loading with a configurable size (batch size). It can also be shuffled before loading, which helps in randomizing the input. This allows for faster optimization to minimize loss.

In [None]:
train_loader = DataLoader(train_dataset, batch_size = 32)
test_loader = DataLoader(test_dataset, batch_size = 64 )

`torch.nn.Sequential` is a function that accepts a list of `nn.Modules` and returns a model with all the sequential layers. We will be configuring these few layers:
1. nn.Linear(38,50)
2. nn.ReLU()
3. nn.Linear(50,25)
4. nn.ReLU()
5. nn.Linear(25,10)
6. nn.ReLU()
7. nn.Linear(10,1)

# Define Model

In [None]:
n_features = 38
num_output = 1
torch.manual_seed(123)
model_sequential = nn.Sequential(nn.Linear(n_features, 50),
                                 nn.ReLU(),
                                 nn.Linear(50, 25),
                                 nn.ReLU(),
                                 nn.Linear(25, 10),
                                 nn.ReLU(),
                                 nn.Linear(10, num_output)
                                 )

In [None]:
# Loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_sequential.parameters(), lr = 0.01)

# Tensorboard
TENSORBOARD_LOG_PATH = Path("./run_Linear_Regression").resolve()
writer = SummaryWriter(TENSORBOARD_LOG_PATH)

# Start Model Training

In [None]:
# Perform model training on training data only
losses_list = []
def train_model(epochs, model, criterion, optimizer, loader, writer):
    running_loss = 0.0
    
    # Repeat for given epoch numbers
    for epoch in range(1, epochs+1):
        
        # Train with mini-batch of data
        for i, data in enumerate(loader):
            
            # clearing the gradient from previous minibatch gradient computation
            optimizer.zero_grad()
            
            # divide into features and labels
            features, labels = data[0], data[1]
            
            # predict using features
            prediction = model(features)
            
            # RMSE loss
            loss = torch.sqrt(criterion(prediction, torch.unsqueeze(labels, dim=1)))
            
            # Compute gradient
            loss.backward()
            
            # Updating weight and bias
            optimizer.step()
            
            # this running_loss will keep track of the losses of every epoch from each respective iteration
            running_loss += loss.item()
        
        loss_per_epoch = running_loss / len(loader) # mini-batch size 
        
        # Print the progress (for this print every 10 epochs)
        if (epoch % 10 == 0 or epoch == 1):
            print(f"Epoch {epoch} Train Loss: {loss_per_epoch}")
        running_loss = 0.0
        
        writer.add_scalar('Loss', loss_per_epoch, epoch)
        losses_list.append(loss_per_epoch)
    writer.close()

In [None]:
num_epochs = 100
train_model(num_epochs, model_sequential, criterion, optimizer, train_loader, writer)

In [None]:
# Visualizing the loss curve
plt.figure(figsize=(10,6))
plt.plot(range(num_epochs), losses_list);
plt.grid()
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

From the graph, we can observe the loss value of each epoch. After 20 epochs, there is not much difference in loss for the subsequent epochs. The effect in further training beyond 20 epochs is not too significant for loss score improvement. Next, we will see how good our model generalize to test set.

There is no point of computing gradient during model testing. So, let's call `torch.no_grad` before evaluating our model using test set.

**`torch.no_grad`**<br>
`torch.no_grad` sets the tensor's `reguires_grad` property to `False` and turns off the `Autograd` engine, which computes gradients with respect to parameters. We do not need any gradients in the test step because the parameter updates were done in the training step, hence this context manager is utilized the test phase only.

In [None]:
# Set our model to evaluation mode (model.eval())
model_sequential.eval()

# Test set
with torch.no_grad():
    running_loss = 0.0
    
    for idx, (X_test, y_test) in enumerate(test_loader):
        y_predtest = model_sequential(X_test)
        
        loss = torch.sqrt(criterion(y_predtest, torch.unsqueeze(y_test, dim=1)))
        
        running_loss += loss.item() * y_predtest.size(0)
        
    print(f'Test Loss: {running_loss / len(X_test)}')  

It seem that our model can generalize to the test dataset pretty well.

# Tensorboard

You can also make use of `Tensorboard` to visualize the accuracies and losses per iteration.

If you're using **Windows** and are not able to view your plots after running the following cell, there is a temporary workaround [(Reference)](https://github.com/tensorflow/tensorboard/issues/2481#issuecomment-516974546). Run the following commands in `CMD.exe` or `Powershell` and try running the cell again:-
>`taskkill /im tensorboard.exe /f`</br>
>`del /q %TMP%\.tensorboard-info\*` (CMD.exe only)</br>
>`del $env:TMP\.tensorboard-info\*` (Powershell only)</br>

In [None]:
%tensorboard --logdir={TENSORBOARD_LOG_PATH.as_posix()}