# Machine Learning with PyTorch

## Tasks with Networks

<font size="+1">A simple feature classifier</font>
<a href="NetworkExamples_0.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">An image classifier</font>
<a href="NetworkExamples_1.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1"><u><b>A regression prediction</b></u></font>
<a href="NetworkExamples_2.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Clustering with PyTorch</font>
<a href="NetworkExamples_3.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Generative Adversarial Networks (GAN)</font> 
<a href="NetworkExamples_4.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Part of Speech Tagger</font>
<a href="NetworkExamples_5.ipynb"><img src="img/open-notebook.png" align="right"/></a>

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

# For demonstration, we can use CPU target if CUDA not available
device = torch.device('cpu')

# Check the status of the GPU (if present)
if torch.cuda.is_available():
    torch.cuda.memory_allocated()
    # *MUCH* faster to run on GPU
    device = torch.device('cuda') 

## A regression prediction

In the classification problem I took from my work about predicting garment sizes there was a simplification to try to help our DNN.  Actual sizes are two dimensional, having both a numeric size which measures "width" and a descriptor ("LONG", "SHORT", "PETITE" that measures length.  Of course, cloth is flexible, so these things interact somewhat.  In any case, the data provided only utilizes the main size component—a "number", although `00` is a special number that is different from `0` in garment sizing.

Given only those single number sizes, we can actually put them in a linear order unambiguously.  Framing it that way suggests a regression problem rather than a classification problem.  In the rest of this notebook, we construct a model that is mostly the same as the classifcation we performed before, but is constructed as a regression instead.

In [None]:
df = pd.read_csv('data/garments.csv.gz', dtype={'TARGET':str})
print(len(df))
df.head()

### Encoding the target

For a regression problem, we do not wish to one-hot-encode the target.  Instead, we will simply convert the linearly ordered sizes to sequential integers.  These are, strictly speaking, ordinal rather than quantitative values, but it does not matter very much for this construction I.e. size `8` is not "twice as much" as size `4` in any measure, nor is it so in the integer sequence encoding; `8` is simply *more than* `4` to some degree.

In [None]:
X = df[['age', 'bra_size_chest', 'bra_size_cup', 'height', 'shoe_size', 'weight']]

sizes = ['00', '0', '2', '4', '6', '8', '10', '12', '14', '16', '18']
size_to_num = dict(zip(sizes, range(len(sizes))))
num_to_size = {v:k for (k, v) in size_to_num.items()}

y = df.TARGET.map(size_to_num)
y.tail()

The only real difference in setting up the layer sizes is that the output dimension is one rather than a larger number of one-hot-encoded elements.

In [None]:
# The number of input features
in_dim = X.shape[1]

# The number of "polynomial features" of order 2
hidden1 = int(in_dim * 2 + (in_dim * (in_dim-1) / 2) + 1)
out_dim = 1

# The sizes of the "inference layers"/
hidden2 = hidden3 = hidden4 = 2 * len(y.unique())  

# Remind ourselves of the layer sizes
in_dim, hidden1, hidden2, hidden3, hidden4, out_dim

The network we generate will resemble the image below.  Weights are colored arbitrarily in the the below image simply to illustrate the different connection strengths in a trained network; these specific colors/values are randomly selected, not as actually trained below.

![Garment Regression Network](img/garment-regressor.png)

Drawn with [NN-SVG](http://alexlenail.me/NN-SVG/index.html)

### Customizing the training regime

For the most part, the code in `do_training()` is the same as we used previously.  However, a lot of trial and error went into tweaking the learning rate decay to be "pretty good" for this problem.  Batches and epochs are trained *much* more quickly with this simplified target; but at the same time, it takes many more epochs for loss to reach a stable valley than it did with the classification version.

In [None]:
def do_training(model, X_train, y_train, optimizer, loss_fn, 
                epochs=5000, batch_size=5000, early_stop=6):
    "Perform a training regime that includes automatic decay of learning rate"
    loss_history = []
    print("+++ Beginning %d epochs with batch size %d" % (epochs, batch_size))
    for epoch in range(1, epochs+1):
        for start in range(0, len(X_train), batch_size):
            # Next batch of training rows
            X = X_train[start:start+batch_size]
            y = y_train[start:start+batch_size]

            # Forward pass: compute predicted Y by passing X to the model.
            y_pred = model(X)

            # Compute loss.
            loss = loss_fn(y_pred, y)
                
            # Before the backward pass, use the optimizer object to zero all of the
            # gradients for the variables it will update (which are the learnable
            # weights of the model). This is because by default, gradients are
            # accumulated in buffers( i.e, not overwritten) whenever .backward()
            # is called. Checkout docs of torch.autograd.backward for more details.
            optimizer.zero_grad()

            # Backward pass: compute gradient of the loss with respect to model
            # parameters
            loss.backward()

            # Calling the step function on an Optimizer makes an update to its
            # parameters
            optimizer.step()

        # Print some progress information each epoch
        print("Epoch %d; Loss: %0.6f (lr=%0.8f)" % (
               epoch, loss.item(), optimizer.param_groups[0]['lr']))
        loss_history.append(loss.item())
                    
        # Is this regime currently failing to reduce loss?
        ## Run for at least `early_stop` epochs
        if len(loss_history) < early_stop:
            continue
            
        ## Lower learning rate by 2x if no improvement in loss for multiple epochs
        diff = max(loss_history[-early_stop:]) - min(loss_history[-early_stop:])
        if  diff/loss_history[-1] < 0.005:
            optimizer.param_groups[0]['lr'] /= 2
            
        ## If learning rate is lowered to tiny value, we are not getting anywhere
        if optimizer.param_groups[0]['lr'] < 1e-8:           
            print("+++ Discontinuing training regime when loss becomes constant")
            break

### Defining the model

The only thing different in this model versus that used in the classifier is that the final layer has a single output, and there is no activation function applied to it.  This gives us a regression instead.  Perhaps not the optimal one possible, but at least framed the right way.

In [None]:
# Create a sequential NN
model = torch.nn.Sequential(
    # This layer allows "polynomial features"
    torch.nn.Linear(in_dim, hidden1),
    # The activation is treated as a separate layer
    torch.nn.ReLU(),

    # This layer is "inference"
    torch.nn.Linear(hidden1, hidden2),
    torch.nn.ReLU(), 
    
    # A Dropout layer sometimes reduces co-adaptation of neurons
    torch.nn.Dropout(p=0.1),

    # This layer is "inference"
    torch.nn.Linear(hidden2, hidden3),
    # Often Leaky ReLU eliminates the "dead neuron" danger
    torch.nn.LeakyReLU(), 
    
    # Might try another "inference" layer
    torch.nn.Linear(hidden3, hidden4),
    torch.nn.LeakyReLU(), 

    # A basic linear layer with one output for the "continuous" target 
    torch.nn.Linear(hidden4, out_dim),  
    ).to(device)

### Split the data, summarize the model

In [None]:
# Free up the GPU
torch.cuda.empty_cache()
print("Just the model itself:")
print(f"{torch.cuda.memory_allocated():,} bytes allocated on GPU")

# Split the original data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Convert arrays to tensors
X_train = torch.from_numpy(X_train.values).float().to(device)
X_test  = torch.from_numpy(X_test.values).float().to(device)
y_train = torch.from_numpy(y_train.values)[:, np.newaxis].float().to(device)
y_test  = torch.from_numpy(y_test.values)[:, np.newaxis].float().to(device)

print("Add the training and testing data to GPU:")
print(f"{torch.cuda.memory_allocated():,} bytes allocated on GPU")

In [None]:
from torchsummary import summary
summary(model, input_size=(1, X_train.shape[1]))

### Train the model

This will run for quite a few epochs.  On a good GPU, each epoch completes very quickly though.

In [None]:
%%time
## Now run model (start with high learning rate and decay)

loss_fn = torch.nn.SmoothL1Loss()
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
do_training(model, X_train, y_train, optimizer, loss_fn)

### Visualizing the predictions

Our regression approach, we cannot determine any prediction probabilities.  In some ways, this is an advantage because we do not get discontinuities between a "best guess" and a "second guess" as we saw in the classification approach.  As a domain matter, the second guess should almost surely be "a little bit smaller" or a "a little bit larger" than the best guess.

However, that simplification relies on our prior simplification of the target to ordinal values.  In the two dimensional sizing of the original garments, it is harder to say precisely what "next largest" or "next smallest" mean.  The visualization below simply makes a large number of predictions, and maps each point with an X-axis of the "ground truth" and a Y-axis of the prediction.  The predictions are continuous values, but those could easily be rounded to integers in the mapped range to make an actual garment size prediction.

Notice that some predictions are of numeric values greater than 10 (the ordinal encoding of size `18`).  That would be fine with an rounding-to-ordinal rule though.  If the predictions were perfect, all the blue circles would lie on top of the red line.  The actual model is much worse than that.  It seems to do better than the classifier in some ways, but to a lesser degree also under-predicts the largest and smallest sizes, favoring middle sizes.  Moreover, the spread of predictions around the ground truth is not yet especially tight.

Do you have ideas for approaches to improve these predictions?

In [None]:
from numpy.random import randint
labels = size_to_num.keys()

ndxs = randint(0, len(X_test), 10_000)
predictions = [p.item() for p in model(X_test[ndxs])]
truths = [t.item() for t in y_test[ndxs]]

plt.figure(figsize=(6, 6))
plt.scatter(truths, predictions, marker='o', alpha=0.1)
plt.xticks(range(len(labels)), labels, fontsize=8, rotation='vertical')
plt.yticks(range(len(labels)), labels, fontsize=8)
plt.xlabel("True size")
plt.ylabel("Predicted size")
ref = np.linspace(0, 10, 100);
plt.plot(ref, ref, color="red");

## Next Lesson

**Tasks with Networks**: This lesson constructed a basic regression model, with moderately good success in its domain.  Next we will look at unsupervised learning and perform clustering with PyTorch neural networks.

<a href="NetworkExamples_3.ipynb"><img src="img/open-notebook.png" align="left"/></a>