# Machine Learning with PyTorch

## Tasks with Networks

<font size="+1"><u><b>A simple feature classifier</b></u></font>
<a href="NetworkExamples_0.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">An image classifier</font>
<a href="NetworkExamples_1.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">A regression prediction</font>
<a href="NetworkExamples_2.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Clustering with PyTorch</font>
<a href="NetworkExamples_3.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Generative Adversarial Networks (GAN)</font> 
<a href="NetworkExamples_4.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Part of Speech Tagger</font>
<a href="NetworkExamples_5.ipynb"><img src="img/open-notebook.png" align="right"/></a>

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

# For demonstration, we can use CPU target if CUDA not available
device = torch.device('cpu')

# Check the status of the GPU (if present)
if torch.cuda.is_available():
    print(f'Allocated GPU memory: {torch.cuda.memory_allocated():,}')
    # *MUCH* faster to run on GPU
    device = torch.device('cuda') 

## A simple feature classifer

The next lesson is a slightly simplified version of a problem I worked on at a former job.  The company sold services to big and small clothing retailers to help guide their customers toward the best garment size for each customer.

We made these recommendations based on a survey of a few measurements that consumers tend to know about themselves.  The set of measurements are slightly different for men's versus women's sizes. Moreover, different retailers size their garments—and sometimes different garment categories or lines—differently from other retailers.  This is a case where some machine learning might illuminate the patterns of the relationship between body sizes and clothing sizes.

First thing, let us read in some actual data used in one of our models.  In production, we did not use PyTorch, nor any DNN framework, but some more "conventional" machine learning techniques.  I was curious how a DNN might perform.  

You can see that the features below are from some women customers and garments; the data is anonymized in the sense that any customer or garment numbers used by the retailer have been removed.  But the sizes of bodies and clothes are actual, as well as their distribution and interrelationships. The features `bra_size_cup` numerically encodes the letter-sizes used in American garments (e.g. 'A', 'B', 'C', 'D', etc); the other features all start out as numeric values, measured in years, inches, or pounds. (or the not-really-inches encoded by numeric shoe sizes).

In [None]:
df = pd.read_csv('data/garments.csv.gz', dtype={'TARGET':str})
print(len(df))
df.head()

### Encoding the data

For this problem, we one-hot-encode the target classes.  Other loss functions prefer to deal with a single output that is directly ordinal encoded, so we could choose a different encoding if our model needed that. One possible benefit of the one-hot approach is that we can derive something akin to probabilities of the different predictions.  This is desirable for this use case, and many classic models provide a similar `model.predict_proba` for multi-class classifiers.

In [None]:
X = df[['age', 'bra_size_chest', 'bra_size_cup', 'height', 'shoe_size', 'weight']]

# One-hot encoding
df_one_hot = pd.get_dummies(df)
Y = df_one_hot[[col for col in df_one_hot.columns if col.startswith('TARGET')]]

# Nicer order for columns (sorted by garment size not lexicographically)
Y.columns = [col.replace('TARGET_', '') for col in Y.columns] 
Y = Y['00 0 2 4 6 8 10 12 14 16 18'.split()]
labels = list(Y.columns)
labels

In [None]:
# Take a look at the one-hot targets
Y.tail()

Choose some values for the size of different layers.  We use a first hidden layer of the size of the second order polynomial of the input features; we hope this will be roughly similar to using something like `sklearn.preprocessing.PolynomialFeatures()`, for those familiar with that.

In [None]:
# The number of input features
in_dim = X.shape[1]

# The number of "polynomial features" of order 2
hidden1 = int(in_dim * 2 + (in_dim * (in_dim-1) / 2) + 1)
out_dim = Y.shape[1]

# The sizes of the "inference layers"
hidden2 = hidden3 = hidden4 = 2 * out_dim   

# Remind ourselves of the layer sizes
in_dim, hidden1, hidden2, hidden3, hidden4, out_dim

The network we generate will resemble the image below.  Weights are colored arbitrarily in the the below image simply to illustrate the different connection strengths in a trained network; these specific colors/values are randomly selected, not as actually trained below.

![Garment Network](img/garment-model.png)

Drawn with [NN-SVG](http://alexlenail.me/NN-SVG/index.html)

### The training regime

The below training function is relatively generic.  It will take as arguments:

* A `model` to train that we have configured with `torch.nn.Sequential`, most likely
* `X_train` for the training data
* `Y_train` as a corresponding one-hot-encoded output vector
* An `optimizer` to use within the training regime
* A `loss_fn` to use for back propagation 
* We can also optionally pass in settings for `epochs`, `batch_size`, and `early_stop` for abandoning additional epochs of training.

In [None]:
def do_training(model, X_train, Y_train, optimizer, loss_fn, 
                epochs=500, batch_size=1000, early_stop=6, quiet=False):
    "Perform a training regime that includes automatic decay of learning rate"
    loss_history = []
    print("+++ Beginning %d epochs with batch size %d" % (epochs, batch_size))
    
    # We expect to decay out, but just in case something funny, 
    # hard limit of finite number of epochs
    for epoch in range(1, epochs+1):
        for start in range(0, len(X_train), batch_size):
            # Next batch of training rows
            X = X_train[start:start+batch_size]
            Y = Y_train[start:start+batch_size]

            # Forward pass: compute predicted Y by passing X to the model.
            Y_pred = model(X)
            # Compute loss
            loss = loss_fn(Y_pred, Y)
        
            # Before the backward pass, use the optimizer object to zero all of the
            # gradients for the variables it will update (which are the learnable
            # weights of the model). This is because by default, gradients are
            # accumulated in buffers (i.e, not overwritten) whenever .backward()
            # is called. Checkout docs of torch.autograd.backward for more details.
            optimizer.zero_grad()

            # Backward pass: compute gradient of the loss with respect to model
            # parameters
            loss.backward()

            # Calling the step function on an Optimizer makes an update to its
            # parameters
            optimizer.step()

        # Every epoch print out some information
        if quiet:
            print('.', end='', flush=True)
        else:
            print("Epoch %d; Loss: %0.6f (lr=%0.8f)" % (
                    epoch, loss.item(), optimizer.param_groups[0]['lr']))
        loss_history.append(loss.item())

        # Is this regime currently failing to reduce loss?
        ## Run for at least `early_stop` epochs
        if len(loss_history) < early_stop:
            continue
            
        ## Lower learning rate by 2x if no improvement in loss for multiple epochs
        diff = max(loss_history[-early_stop:]) - min(loss_history[-early_stop:])
        if  diff/loss_history[-1] < 0.005:
            optimizer.param_groups[0]['lr'] /= 2
            
        ## If learning rate is lowered to tiny value, we are not getting anywhere
        if optimizer.param_groups[0]['lr'] < 1e-8:
            print("+++ Discontinuing training regime when loss becomes constant")
            break

    # Final message in quiet mode
    if quiet:
        print("\nFinal: Epoch %d; Loss: %0.6f (lr=%0.8f)" % (
                epoch, loss.item(), optimizer.param_groups[0]['lr']))

### Creating the model

This model, and this type of problem, chooses to use multiple fully-connected or "linear" layers.  A first hidden layer we hope to use to capture all the immediate feature interactions.  After that, three additional hidden layers are very similar to each other, having what we hope is simply "enough" neurons at each layer.  In my experimentation, there is little difference if we use one fewer hidden layer.

In [None]:
# Create a sequential NN
model = torch.nn.Sequential(
    # This layer allows "polynomial features"
    torch.nn.Linear(in_dim, hidden1),
    # The activation is treated as a separate layer
    torch.nn.ReLU(),

    # This layer is "inference"
    torch.nn.Linear(hidden1, hidden2),
    torch.nn.ReLU(), 
    
    # A Dropout layer sometimes reduces co-adaptation of neurons
    torch.nn.Dropout(p=0.1),

    # This layer is "inference"
    torch.nn.Linear(hidden2, hidden3),
    # Often Leaky ReLU eliminates the "dead neuron" danger
    torch.nn.LeakyReLU(), 
    
    # Might try another "inference" layer
    torch.nn.Linear(hidden3, hidden4),
    torch.nn.LeakyReLU(), 

    # A sigmoid activation is used for a binary decision
    # Since we use one-hot encoding, we make an independent decision per size
    torch.nn.Linear(hidden4, out_dim),  
    torch.nn.Sigmoid()
    ).to(device)

### Train/test split on data, and convert to tensors

In [None]:
# Free up the GPU
torch.cuda.empty_cache()
print("Just the model itself:")
print(f"{torch.cuda.memory_allocated():,} bytes allocated on GPU")

# Split the original data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

# Convert arrays to tensors
X_train = torch.from_numpy(X_train.values).float().to(device)
X_test  = torch.from_numpy(X_test.values).float().to(device)
Y_train = torch.from_numpy(Y_train.values)[:, np.newaxis].float().to(device)
Y_test  = torch.from_numpy(Y_test.values)[:, np.newaxis].float().to(device)

print("Add the training and testing data to GPU:")
print(f"{torch.cuda.memory_allocated():,} bytes allocated on GPU")

In [None]:
from torchsummary import summary
summary(model, input_size=(1, X_train.shape[1]))

### Perform the training

In [None]:
# The target is 3D, but with the middle dimension 1
print(Y_train.size())
# We should remove the extra dimension to conform to model shape
target = Y_train.view(-1,11)
print(target.size())

In [None]:
%%time
## Now run model (start with high learning rate and decay)

# MSELoss is a common default, SmoothL1Loss does a little better here
loss_fn = torch.nn.SmoothL1Loss()
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# We should remove the extra 
do_training(model, X_train, target, optimizer, loss_fn, epochs=250)

### Evaluating the model

Unfortunately, this approach does not make especially good predictions.  One wrinkle in the problem is that while we would ideally like to predict the absolutely correct answer as often as possible, there is some value as well in making predictions that are *close* and/or having 2nd or 3rd ranked predictions that are right or close.  Perhaps a custom loss function that expressed the exact real goal would do better.

The main problem that shows up in this PyTorch model is one that we experienced using some other classes of machine learning approaches: the predictions tend to skew towards more central values.  This makes them correct, or at least close, for "average sized" customers, but often dramatically off for those at either end of the garment size scale.

In [None]:
from random import randrange

ndx = randrange(len(X_test))
probs = model(X_test[ndx])
truth = torch.argmax(Y_test[ndx]).item()

plt.figure(figsize=(8, 4))
plt.plot(labels, probs.cpu().detach().numpy())
plt.bar(truth, 1.1*max(probs.cpu().detach().numpy()), width=.25, color='red')
plt.xticks(range(len(labels)), labels, fontsize=8, rotation='vertical');

### Additional training

Even though it is not obvious in the loss, we actually **do** get a narrower peak and often more accurate estimates with more epochs.

In [None]:
%%time
# Expect ~7 min
do_training(model, X_train, target, optimizer, loss_fn, 
            early_stop=1001, quiet=True, epochs=1000)

## Next Lesson

**Tasks with Networks**: This example was a relatively common one of making categorical predictions from independent features.  For this type of network generally uses primarily or exclusively linear layers.  Next we will take a look at an image classifer that will need some layers to associate nearby pixels.

<a href="NetworkExamples_1.ipynb"><img src="img/open-notebook.png" align="left"/></a>