# Assignment 3 – Tabular Dataset Preparation

This is Assignment 3 for the Introduction to Deep Learning with PyTorch course (www.leaky.ai).  In this assignment you will practice preparing tabular datasets for training a neural network.  You will practice applying normalization and standardization techniques.  You will also use pandas to convert categorical inputs into numerical values.
To Get Started

1.	Open up a web browser (preferable Chrome)
2.	Copy the Project GitHub Link: https://github.com/LeakyAI/PyTorch-Overview
3.	Head over to Google Colab (https://colab.research.google.com)
4.	Load the notebook: Tabular Dataset Preparation - Start Here.ipynb
5.	Replace the [TBD]'s with your own code
6.	Execute the notebook after completing each cell and check your answers using the solution notebook

Good Luck!

Key Takeaways:
- You calculated the minimum and maximum values for each input and applied normalization
- You then applied standardization and compared the results
- You replaced categorical inputs with numerical values using one-hot encoding

## Part 1 - Standardization and Min Max Normalization

In [None]:
# Import PyTorch and set the seed for reproducible results
import torch
torch.set_printoptions(precision=3,sci_mode=False)  # Tensor easier to read

In [None]:
# Create a PyTorch tensor with the following content:
#    [[1,100,3,0.01,5000],[0,10,8,-0.002,0.01],[1,25,13,0.04,0.2],[1,45,18,-0.05,0.5]]
data = torch.tensor([TBD], dtype=torch.float)
print (data)

#### Answer
<pre>
tensor([[     1.000,    100.000,      3.000,      0.010,   5000.000],
        [     0.000,     10.000,      8.000,     -0.002,      0.010],
        [     1.000,     25.000,     13.000,      0.040,      0.200],
        [     1.000,     45.000,     18.000,     -0.050,      0.500]])
</pre>

# Normalize the Values
Here you will apply normalization to the column values.

In [None]:
# Find the minimum and maximum value for each column
# Hint:  Make sure you use axis = 0 when calling min and max as we
#        want to apply the function calls to the columns (not entire tensor)
maximums = [TBD]
minimums = [TBD]
print (f"Max Values: {maximums.values}")
print (f"Min Values: {minimums.values}")

#### Answer
<pre>
Max Values: tensor([    1.000,   100.000,    18.000,     0.040,  5000.000])
Min Values: tensor([     0.000,     10.000,      3.000,     -0.050,      0.010])
</pre>

In [None]:
# Applying normalization to each input
# Use the formula x = (x-min)/(max-min)
# Hint:  use maximum.values and minimum.values
dataNormalized = (data - [TBD]) / ([TBD] - [TBD])
print (dataNormalized)

#### Answer
<pre>
tensor([[    1.000,     1.000,     0.000,     0.667,     1.000],
        [    0.000,     0.000,     0.333,     0.533,     0.000],
        [    1.000,     0.167,     0.667,     1.000,     0.000],
        [    1.000,     0.389,     1.000,     0.000,     0.000]])
</pre>

### Question
What obervations can be made about using normalization?  Does normalization work well in all cases?  How about the last column?

### Your Answer
[TBD]

## Standardize the Values
Use the following formula:
xStandardized = (x - xMean) / xStdDeviation

In [None]:
# Calculate the mean and standard deviation of each column
dataMean = [TBD]
dataStDev = [TBD]

print (f"Mean  : {dataMean}")
print (f"St Dev: {dataStDev}")

#### Answer
<pre>
Mean  : tensor([     0.750,     45.000,     10.500,     -0.001,   1250.177])
St Dev: tensor([    0.500,    39.370,     6.455,     0.037,  2499.882])
</pre>

In [None]:
# Standadize the columns using the following formula:
# dataStandardized = (data - mean) / (standardDeviation)
# hint - make sure you use axis=0 as we want these operations
#        conducted on the columns (not rows, not entire tensor)
dataStandardized = (data - [TBD]) / ([TBD])
print (dataStandardized)

#### Answer
<pre>
tensor([[ 0.500,  1.397, -1.162,  0.281,  1.500],
        [-1.500, -0.889, -0.387, -0.040, -0.500],
        [ 0.500, -0.508,  0.387,  1.082, -0.500],
        [ 0.500,  0.000,  1.162, -1.322, -0.500]])
</pre>

### Question
What obervations can be made about using normalization?  Does normalization work well in all cases?  How about the last column?

### Your Answer
[TBD]


## One-Hot Encoding
Most tabular datasets contain categorical data.  You will need to convert this type of data into numerical data before training.  We will be using the panda library to automatically convert our the categorical data into numeric using the get_dummies function.

In [None]:
# Load a categorical dataset using Pandas
import pandas as pd
!wget https://raw.githubusercontent.com/LeakyAI/PyTorch-Overview/main/cat_data_v1.csv
df = pd.read_csv('cat_data_v1.csv')

In [None]:
# Understand the shape of the data by displaying the value of shape
[TBD]

In [None]:
# Show the first portion of the data using head()
[TBD]

In [None]:
# Use the describe() function to better understand the data
# and look for missing values
[TBD]

In [None]:
# Drop rows that contain missing values using dropna()
[TBD]
df.describe()

In [None]:
# Create one-hot encoded values for each column using
# the the get_dummies function:
OneHot = [TBD]
OneHot.head()

### Key Takeaways:
- You calculated the minimum and maximum values for each input and applied normalization
- You then applied standardization and compared the results
- You replaced categorical inputs with numerical values using one-hot encoding