# Datasets and Neural Networks
This notebook will step through the process of loading an arbitrary dataset in PyTorch, and creating a simple neural network for regression.

# Datasets
We will first work through loading an arbitrary dataset in PyTorch. For this project, we chose the <a href="http://www.cs.toronto.edu/~delve/data/abalone/desc.html">delve abalone dataset</a>. 

First, download and unzip the dataset from the link above, then unzip `Dataset.data.gz` and move `Dataset.data` into `hackpack-ml/models/data`.
We are given the following attribute information in the spec:
```
Attributes:
  1   sex                 u  M F I	# Gender or Infant (I)
  2   length              u  (0,Inf]	# Longest shell measurement (mm)
  3   diameter            u  (0,Inf]	# perpendicular to length     (mm)
  4   height              u  (0,Inf]	# with meat in shell (mm)
  5   whole_weight        u  (0,Inf]	# whole abalone  (gr)
  6   shucked_weight      u  (0,Inf]	# weight of meat (gr)    
  7   viscera_weight      u  (0,Inf]	# gut weight (after bleeding) (gr)
  8   shell_weight        u  (0,Inf]	# after being dried (gr)
  9   rings               u  0..29	# +1.5 gives the age in years
```

In [None]:
import math
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.nn.functional as F
import pandas as pd

from torch.utils.data import Dataset, DataLoader

Pandas is a data manipulation library that works really well with structured data. We can use Pandas DataFrames to load the dataset.

In [None]:
col_names = ['sex', 'length', 'diameter', 'height', 'whole_weight', 
             'shucked_weight', 'viscera_weight', 'shell_weight', 'rings']
abalone_df = pd.read_csv('../data/Dataset.data', sep=' ', names=col_names)
abalone_df.head(n=3)

We define a subclass of PyTorch Dataset for our Abalone dataset.

In [None]:
class AbaloneDataset(data.Dataset):
    """Abalone dataset. Provides quick iteration over rows of data."""

    def __init__(self, csv):
        """
        Args: csv (string): Path to the Abalone dataset.
        """
        self.features = ['sex', 'length', 'diameter', 'height', 'whole_weight', 
                          'shucked_weight', 'viscera_weight', 'shell_weight']
        self.y = ['rings']
        self.abalone_df = pd.read_csv(csv, sep=' ', names=(self.features + self.y))
        
        # Turn categorical data into machine interpretable format (one hot)
        self.abalone_df['sex'] = pd.get_dummies(self.abalone_df['sex'])

    def __len__(self):
        return len(self.abalone_df)

    def __getitem__(self, idx):
        """Return (x,y) pair where x are abalone features and y is age."""
        features = self.abalone_df.iloc[idx][self.features].values
        y = self.abalone_df.iloc[idx][self.y]
        return torch.Tensor(features).float(), torch.Tensor(y).float()

# Neural Networks

The task is to predict the age (number of rings) of abalone from physical measurements. We build a simple neural network with one hidden layer to model the regression.

In [None]:
class Net(nn.Module):

    def __init__(self, feature_size):
        super(Net, self).__init__()
        # feature_size input channels (8), 1 output channels
        self.fc1 = nn.Linear(feature_size, 4)
        self.fc2 = nn.Linear(4, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

We instantiate an Abalone dataset instance and create DataLoaders for train and test sets.

In [None]:
dataset = AbaloneDataset('../data/Dataset.data')
train_split, test_split = math.floor(len(dataset) * 0.8), math.ceil(len(dataset) * 0.2)

trainset = [dataset[i] for i in range(train_split)]
testset = [dataset[train_split + j] for j in range(test_split)]
batch_sz = len(trainset) # Compact data allows for big batch size
trainloader = data.DataLoader(trainset, batch_size=batch_sz, shuffle=True, num_workers=4)
testloader = data.DataLoader(testset, batch_size=batch_sz, shuffle=False, num_workers=4)

Now, we can initialize our network and define train and test functions

In [None]:
net = Net(len(dataset.features))
loss_fn = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.1)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
gpu_ids = [0] # On Colab, we have access to one GPU. Change this value as you see fit

def train(epoch):
    """
    Trains our net on data from the trainloader for a single epoch
    """
    net.train()
    with tqdm(total=len(trainloader.dataset)) as progress_bar:
        for batch_idx, (inputs, targets) in enumerate(trainloader):
            inputs, targets = inputs.to(device), targets.to(device)
            optimizer.zero_grad() # Clear any stored gradients for new step
            outputs = net(inputs.float())
            loss = loss_fn(outputs, targets) # Calculate loss between prediction and label  
            loss.backward() # Backpropagate gradient updates through net based on loss
            optimizer.step() # Update net weights based on gradients
            progress_bar.set_postfix(loss=loss.item())
            progress_bar.update(inputs.size(0))
            
        
def test(epoch):
    """
    Run net in inference mode on test data. 
    """                       
    net.eval()
    # Ensures the net will not update weights
    with torch.no_grad():
        with tqdm(total=len(testloader.dataset)) as progress_bar:
            for batch_idx, (inputs, targets) in enumerate(testloader):
                inputs, targets = inputs.to(device).float(), targets.to(device).float()
                outputs = net(inputs)
                loss = loss_fn(outputs, targets)
                progress_bar.set_postfix(testloss=loss.item())
                progress_bar.update(inputs.size(0))


Now that everything is prepared, it's time to train!

In [None]:
test_freq = 5 # Frequency to run model on validation data

for epoch in range(0, 200):
    train(epoch)
    if epoch % test_freq == 0:
        test(epoch)

We use the network's eval mode to do a sample prediction to see how well it does.

In [None]:
net.eval()
sample = testset[0]
predicted_age = net(sample[0])
true_age = sample[1]

print(f'Input features: {sample[0]}')
print(f'Predicted age: {predicted_age.item()}, True age: {true_age[0]}')

Congratulations! You now know how to load your own datasets into PyTorch and run models on it. For an example of Computer Vision, check out the DenseNet notebook. Happy hacking!