# Exploring gradients with Weights & Biases

First we have to install a couple additional packages into the Colab runtime for the training loop to work.

In [0]:
%%capture
!pip install wandb tqdm

Then you should log in with your Weights & Biases API key found [here](https://app.wandb.ai/authorize) to allow logging. 

In [0]:
!wandb login

## Defining our model and train/test loops

Here we'll set up the code we need to run a couple of different model to classify the [MNIST digits](http://yann.lecun.com/exdb/mnist/) dataset. We'll be borrowing a lot of the boilerplate code from the PyTorch MNIST example found [here](https://github.com/pytorch/examples/blob/master/mnist/main.py).


Our model has two major components that will illustrate a couple different advantages of tracking gradients while training a deep learning model. The first component is a pretty basic 2D CNN --> fully-connected model that will do the heavy lifting of making the actual prediction. The second part feeds in 10 random values, passes them through a fully connected layer and concatenates them to the flattened output of the second 2D CNN layer. These random parameters carry no real value for the prediction task at hand. Check out how the gradients flowing to these parameters (which appear as `gradients/rand_fc.weight` and `gradients/rand_fc.weight` in your W&D dashboard) compare to those of the other model parameters.

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import wandb
from tqdm import *


class CNN_Net(nn.Module):
    def __init__(self, device):
        super(CNN_Net, self).__init__()
        
        self.device = device
        
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.rand_fc = nn.Linear(10, 10)
        self.fc1 = nn.Linear((4*4*50) + 10, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        rand_x = torch.randn(x.shape[0], 10).to(self.device)
        
        rand_x = F.relu(self.rand_fc(rand_x))
      
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        
        x = torch.cat((x, rand_x), dim=1)
        
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)
      
    
def train(model, device, train_loader, optimizer, epoch):
    model.train()
    
    n_ex = len(train_loader)
    
    for batch_idx, (data, target) in tqdm(enumerate(train_loader), total=n_ex):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        
def test(model, device, test_loader, WANDB):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() 
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    tqdm.write('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    
    if WANDB:
        wandb.log({'test_loss': test_loss,
                   'accuracy': correct / len(test_loader.dataset)})
        
        
def main(config):
    
    if config['WANDB']:
        wandb.init(project='explore-gradients', reinit=True, config=config)
  
    use_cuda = torch.cuda.is_available()

    torch.manual_seed(config['SEED'])

    device = torch.device("cuda" if use_cuda else "cpu")

    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=config['BATCH_SIZE'], shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=config['TEST_BATCH_SIZE'], shuffle=True, **kwargs)

    model = CNN_Net(device).to(device)
    
    
    if config['WANDB']:
        wandb.watch(model, log='all')
    
    optimizer = optim.SGD(model.parameters(),
                          lr=config['LR'],
                          momentum=config['MOMENTUM'])

    for epoch in range(1, config['EPOCHS'] + 1):
        train(model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader, config['WANDB'])

## Training the model

Here you can edit the configuration dictionary to see how changing hyperparameters like the learning rate or momentum affect the gradients. If you want to turn off W&B experiment tracking, set `WANDB` to `False`.

In [0]:
config = {
    'BATCH_SIZE'         : 64,
    'TEST_BATCH_SIZE'    : 1000,
    'EPOCHS'             : 30,
    'LR'                 : 0.01,
    'MOMENTUM'           : 0,
    'SEED'               : 17,
    'WANDB'              : True,
}

main(config=config)