<p align="center">
    <img src="https://github.com/FRI-Energy-Analytics/energyanalytics/blob/main/EA_logo.jpg?raw=true" width="240" height="240" />
</p>

# Backpropagation
## Freshman Research Initiative Energy Analytics CS 309


So far we have done the forward passes through our networks, but to train our networks we need to update the weights and bias tensors with backpropagation. We do this by showing the network a bunch of different examples, measuring how far the predictions are from the correct answer, and updating the weights and bias tensors to get closer to the lowest error. This is called gradient descent. To do this, we need to define a loss/cost function. We can use mean squared loss for binary classification:

$$
\large \ell = \frac{1}{2n}\sum_i^n{\left(y_i - \hat{y}_i\right)^2}
$$

where $n$ is the number of training examples, $y_i$ are the true labels, and $\hat{y}_i$ are the predicted labels. When we minimize this loss, we optimize the weights and bias tensors so that the network can make correct predictions.

To train the weights with gradient descent, we propagate the gradient of the loss backwards through the network. Each operation has some gradient between the inputs and outputs. As we send the gradients backwards, we multiply the incoming gradient with the gradient for the operation. Mathematically, this is really just calculating the gradient of the loss with respect to the weights using the chain rule.

$$
\large \frac{\partial \ell}{\partial W_1} = \frac{\partial L_1}{\partial W_1} \frac{\partial S}{\partial L_1} \frac{\partial L_2}{\partial S} \frac{\partial \ell}{\partial L_2}
$$

We update our weights using this gradient with some learning rate $\alpha$. 

$$
\large W^\prime_1 = W_1 - \alpha \frac{\partial \ell}{\partial W_1}
$$

The learning rate $\alpha$ is set such that the weight update steps are small enough that the iterative method settles in a minimum.


## Losses

In the `nn` module there are a bunch of different losses we can use. Today we are going to focus on cross-entropy (`nn.CrossEntropyLoss`). You'll usually see the loss assigned to `criterion`. With a classification problem we're using softmax to predict class probabilities. With a softmax output you want to use cross-entropy as the loss. To calculate the loss, you define the criterion then pass in the output of your network and the correct labels.

However, `nn.LogSoftmax()` and `nn.NLLLoss()` are in one single class in PyTorch. This means that we have to pass the output of our network into the loss function, and not the output from the softmax function. This output is often called scores, or *logits*.

Before building our network, let's go through our normal data processing that we have seen for several weeks now.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import torch
from torch import nn
from sklearn import preprocessing #for label encoding
from sklearn.model_selection import train_test_split
%matplotlib inline

In [2]:
data = pd.read_csv(r'well_data.csv') #read it in
le = preprocessing.LabelEncoder()
top_names = data.TOP
le.fit(data.TOP)
tops = le.transform(data.TOP)
data.drop('TOP', axis=1, inplace=True)

In [3]:
train = torch.utils.data.TensorDataset(torch.Tensor(np.array(data)), torch.Tensor(np.array(tops)))
train_loader = torch.utils.data.DataLoader(train, batch_size = 64, shuffle = True)

features, labels = next(iter(train_loader))

Now we want to define our model, let's use `nn.Sequential` today

In [4]:
# Hyperparameters for our network
input_size = 27
hidden_sizes = [54, 10]
output_size = 2

# Build a feed-forward network
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[0], hidden_sizes[1]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[1], output_size),
                     )
print(model)

# Define our loss function
criterion = nn.CrossEntropyLoss()

# Do a forward pass
logps = model(features)

# Now we need to calculate our loss (how close are the predictions)
loss = criterion(logps, labels.type(torch.LongTensor))

print(loss)

Sequential(
  (0): Linear(in_features=27, out_features=54, bias=True)
  (1): ReLU()
  (2): Linear(in_features=54, out_features=10, bias=True)
  (3): ReLU()
  (4): Linear(in_features=10, out_features=2, bias=True)
)
tensor(6.8755, grad_fn=<NllLossBackward>)


# Backpropagation
Now we have made our forward pass through the network and calculated the loss, we need to perform backpropagation and update our weights and bias tensors. In PyTorch this is done with the `autograd` module. This module automatically calculates the gradients of tensors. It does this by keeping track of operations performed on tensors, then inverting those operations and caclulating gradients. In PyTorch keeping track of operations on tensors is done with the keyword `requires_grad = True` at creation, or `feature.requires_grad_(True)`.

To call gradients you just need to use the `.backward` method. We can call this on our loss function. First let's check the gradient on our model weights after the forward pass.

In [5]:
print(model[0].weight.grad)

None


Now let's call the `.backward` method on our loss, and see what the gradients are

In [6]:
loss.backward()
print(model[0].weight.grad)

tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-6.4139e+00, -1.6867e-02, -1.6701e-02,  ...,  6.2831e-03,
         -5.9042e-04, -4.6634e-03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [-1.3922e+01, -3.6611e-02, -3.6251e-02,  ...,  1.3638e-02,
         -1.2815e-03, -1.0122e-02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 4.2147e+01,  1.1084e-01,  1.0975e-01,  ..., -4.1288e-02,
          3.8798e-03,  3.0644e-02]])


We have our backpropagation and gradients in place, but we still need one last piece to update the weights with the gradients. This piece is called an optimizer. In PyTorch it is in the `optim` module. For this exercise, let's use stochastic gradient descent, which takes the model parameters, and a learning rate

In [7]:
from torch import optim
optimizer = optim.SGD(model.parameters(), lr=0.01)

Putting this all together now, to train our network in PyTorch we do the following:

1. Complete a forward pass through the network to get our logits
2. Use the logits with the actual values to calculate the loss
3. Complete a backward pass through the network to calculate the gradients
4. Take a step with the optimizer to update the weights

Let's take a single step through training our model

In [8]:
# set the model weights gradient to none
model[0].weight.grad = None

print(model[0].weight.grad)

None


In [9]:
# Clear the gradients that have accumulated
optimizer.zero_grad()

# 1. Forward pass
logps = model(features)

# 2. Calculate the loss
loss = criterion(logps, labels.type(torch.LongTensor))

# 3. Backward pass
loss.backward()

# 4. Update weights
optimizer.step()

print(model[0].weight.grad)

tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-6.4139e+00, -1.6867e-02, -1.6701e-02,  ...,  6.2831e-03,
         -5.9042e-04, -4.6634e-03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [-1.3922e+01, -3.6611e-02, -3.6251e-02,  ...,  1.3638e-02,
         -1.2815e-03, -1.0122e-02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 4.2147e+01,  1.1084e-01,  1.0975e-01,  ..., -4.1288e-02,
          3.8798e-03,  3.0644e-02]])


This is a first pass through the network both forwards and backwards updating gradients and taking optimal steps along the way. Put it all together for batches of 64 samples in 10 different epochs

In [10]:
epochs = 10
for e in range(epochs):
    running_loss = 0
    for features, labels in train_loader:
        # clear accumulated gradients
        optimizer.zero_grad()
        # Forward
        logps = model(features)
        # loss
        loss = criterion(logps, labels.type(torch.LongTensor))
        # backwards pass
        loss.backward()
        # update weights
        optimizer.step()
        
        running_loss += loss.item()
    else:
        print(f"Training loss: {running_loss/len(train_loader)}")

Training loss: 16.910012881076614
Training loss: 0.39766802066980406
Training loss: 0.39912986200909284
Training loss: 0.39758059295804
Training loss: 0.3971757744980413
Training loss: 0.39655751509721887
Training loss: 0.39765095502831216
Training loss: 0.3970917201665945
Training loss: 0.3979684360498606
Training loss: 0.3971339514435724


Your model is now trained. Let's see how we can make predictions with it on our dataset.