# Assignment 3 – Regression Dataset Preparation

This is Assignment 3 for the Introduction to Deep Learning with PyTorch course (www.leaky.ai).  

In this assignment you will practice preparing a real-world dataset for training a neural network.  You will be dealing with missing items, variable ranges in the input data as well as categorical text features.

Finally, when training neural networks, instead of processing one input at a time, it’s usually better to process a batch of inputs at the same time.  This leads to better training results and enables us to take advantage of parallel computing to accelerate the calculations.  But how do we create batches of input data?  The good news that PyTorch includes a dataset and dataloader object that automatically creates batches of the input data for the training process.  Once the dataset is ready, all we have to do is wrap it into a dataset object and pass that to our dataloader object.   You will practice this technique towards the end of the assignment.

### To Get Started
1.	Open up a web browser (preferable Chrome)
2.	Copy the Project GitHub Link: https://github.com/LeakyAI/PyTorch-Overview
3.	Head over to Google Colab (https://colab.research.google.com)
4.	Load the notebook: <b>Assignment 3 – Regression Dataset Preparation - Start Here.ipynb</b>
5.	Replace the [TBD]'s with your own code
6.	Execute the notebook after completing each cell and check your answers using the solution notebook

Good Luck!




In [4]:
# Import PyTorch and check the version
# Create a PyTorch tensor with the following content
# Tensor a should contain [[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15],[16,17,18,19,20]]
import torch
torch.manual_seed(6)
torch.__version__
a = torch.tensor([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15],[16,17,18,19,20]], dtype=torch.float)
print (a)

tensor([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.],
        [11., 12., 13., 14., 15.],
        [16., 17., 18., 19., 20.]])


# Finding Maximum Values
Here you will explore different approachs to extract the highest values from tensors.

In [2]:
# Find the highest and lowest value in the entire tensor 
maximum = a.max()
minimum = a.min()

print (f"Max: {maximum.item()}\nMin: {minimum.item()}")

Max: 20.0
Min: 1.0


### Correct Answer:  
<pre>
Values:
tensor([[ 5.,  4.],
        [10.,  9.],
        [15., 14.],
        [20., 19.]])

Indicies:
tensor([[4, 3],
        [4, 3],
        [4, 3],
        [4, 3]])
</pre>

# Standardize Columns
For tabular datasets, you will likely need to standardize or normalize the data before using it for training.

In [3]:
# Standardize the column values

# Start by calculating the mean for each column (dim=0)
mean = a.mean(dim=0)

# Then, calculate the standard deviation for each column (dim=0)
std = a.std(dim=0)

# Standardize the data by subtracting the mean and dividing by st. dev.
a_std = (a - mean) / std

# Output the results
print (a)
print (f"\nMeans:\n{mean}")
print (f"\nStandard Deviations:\n{std}")
print (f"\nStandardized:\n{a_std}")

tensor([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.],
        [11., 12., 13., 14., 15.],
        [16., 17., 18., 19., 20.]])

Means:
tensor([ 8.5000,  9.5000, 10.5000, 11.5000, 12.5000])

Standard Deviations:
tensor([6.4550, 6.4550, 6.4550, 6.4550, 6.4550])

Standardized:
tensor([[-1.1619, -1.1619, -1.1619, -1.1619, -1.1619],
        [-0.3873, -0.3873, -0.3873, -0.3873, -0.3873],
        [ 0.3873,  0.3873,  0.3873,  0.3873,  0.3873],
        [ 1.1619,  1.1619,  1.1619,  1.1619,  1.1619]])


### Correct Answer:  
<pre>
tensor([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.],
        [11., 12., 13., 14., 15.],
        [16., 17., 18., 19., 20.]])

Means:
tensor([[ 8.5000,  9.5000, 10.5000, 11.5000, 12.5000]])

Standard Deviations:
tensor([[6.4550, 6.4550, 6.4550, 6.4550, 6.4550]])

Standardized:
tensor([[-1.1619, -1.1619, -1.1619, -1.1619, -1.1619],
        [-0.3873, -0.3873, -0.3873, -0.3873, -0.3873],
        [ 0.3873,  0.3873,  0.3873,  0.3873,  0.3873],
        [ 1.1619,  1.1619,  1.1619,  1.1619,  1.1619]])</pre>

## Key Takeaways:
- You removed missing items from the dataset
- You used standardization to ensure your input values were of similar scale
- You replaced categorical inputs with numerical values using one-hot encoding
- You wrapped the dataset using the PyTorch dataset and dataloader object making it ready to be used for training a neural network
