This tutorial walks through some uses of the PyTorch implementation of NeurOps.



Let's start by importing the PyTorch implementation and some other useful packages.

In [1]:
from pytorch.src import *

import copy
import numpy as np
from torchvision import datasets, transforms

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  return torch._C._cuda_getDeviceCount() > 0


We define a LeNet-style model, which has three convolutional model followed by two fully-connected model. We use the `ModSequential` class to wrap the `ModConv2d` and `ModLinear` model, which allows us to mask, prune, and grow the model. We also use the `track_activations` and `track_auxiliary_gradients` arguments to enable the tracking of activations and auxiliary gradients later. By adding the `input_shape` of the data, we can compute the conversion factor of how many input neurons to add to the first linear layer when a new output channel is added to the final convolutional layer. 

In [7]:
model = ModSequential(
        ModConv2d(in_channels=1, out_channels=8, kernel_size=7, masked=True, padding=1, learnable_mask=True),
        ModConv2d(in_channels=8, out_channels=16, kernel_size=7, masked=True, padding=1, prebatchnorm=True, learnable_mask=True),
        ModConv2d(in_channels=16, out_channels=16, kernel_size=5, masked=True, prebatchnorm=True, learnable_mask=True),
        ModLinear(64, 32, masked=True, learnable_mask=True),
        ModLinear(32, 10, masked=True),
        track_activations=True,
        track_auxiliary_gradients=True,
        input_shape = (1, 14, 14)
    ).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()

print("This model has {} effective parameters.".format(model.parameter_count(masked = True)))
print("The conversion factor of this model is {} after layer {}.".format(model.conversion_factor, model.conversion_layer))

This model has 15538 effective parameters.
The conversion factor of this model is 4 after layer 2.


Let's get a dataset and define standard training and testing functions.

In [10]:
dataset = datasets.MNIST('../data', train=True, download=True,
                     transform=transforms.Compose([ 
                            transforms.ToTensor(),
                            transforms.Normalize((0.1307,), (0.3081,)),
                            transforms.Resize((14,14))
                        ]))
train_set, val_set = torch.utils.data.random_split(dataset, lengths=[int(0.9*len(dataset)), int(0.1*len(dataset))])
train_loader = torch.utils.data.DataLoader(train_set, batch_size=128, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_set, batch_size=128, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                            transforms.ToTensor(),
                            transforms.Normalize((0.1307,), (0.3081,)),
                            transforms.Resize((14,14))
                        ])),
    batch_size=128, shuffle=True)

def train(model, train_loader, optimizer, criterion, epochs=10, val_loader=None, verbose=True):
    model.train()
    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 100 == 0 and verbose:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset),
                    100. * batch_idx / len(train_loader), loss.item()))
        if val_loader is not None:
            print("Validation: ", end = "")
            test(model, val_loader, criterion)

def test(model, test_loader, criterion):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += criterion(output, target).item() # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    
    print('Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)'.format(test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

We'll pretrain the model before changing its architecture.

In [9]:
train(model, train_loader, optimizer, criterion, epochs=5, val_loader=val_loader)

Validation: Average loss: 0.0059, Accuracy: 4720/6000 (78.67%)
Validation: Average loss: 0.0018, Accuracy: 5607/6000 (93.45%)
Validation: Average loss: 0.0013, Accuracy: 5702/6000 (95.03%)
Validation: Average loss: 0.0012, Accuracy: 5729/6000 (95.48%)
Validation: Average loss: 0.0009, Accuracy: 5779/6000 (96.32%)


Let's see how we can use NeurOps to optimize the model.

First, we use a heuristic from `metrics.py` to measure the existing channels and neurons to determine which ones to prune. The simplest one is measuring the norm of incoming weights to a neuron. We'll copy the model (so we have access to the original), then score each neuron and prune the lowest scoring ones within each layer. After running the following block, try uncommenting different lines to see how different metrics affect the model.

In [5]:
modded_model = copy.deepcopy(model)
modded_optimizer = torch.optim.SGD(modded_model.parameters(), lr=0.01)
modded_optimizer.load_state_dict(optimizer.state_dict())

for i in range(len(model)-1):
    scores = weight_sum(model[i].weight)
    # scores = weight_sum(model[i].weight) +  weight_sum(model[i+1].weight, fanin=False, conversion_factor=model.conversion_factor if i == model.conversion_layer else -1)
    # scores = activation_variance(model.activations[str(i)])
    # scores = svd_score(model.activations[str(i)])
    # scores = nuclear_score(model.activations[str(i)], average=i<3)
    # Before trying this line, run the following block: # scores = fisher_info(mask_grads[i])
    print("Layer {} scores: mean {:.3g}, std {:.3g}, min {:.3g}, smallest 25%:".format(i, scores.mean(), scores.std(), scores.min()), end=" ")
    to_prune = np.argsort(scores.detach().numpy())[:int(0.25*len(scores))]
    print(to_prune)
    modded_model.prune(i, to_prune, optimizer=modded_optimizer, clear_activations=True)
print("The pruned model has {} effective parameters.".format(modded_model.parameter_count(masked = True)))
print("Validation after pruning: ", end = "")
test(modded_model, val_loader, criterion)
train(modded_model, train_loader, modded_optimizer, criterion, epochs=2, val_loader=val_loader)

Layer 0 scores: mean 3.96, std 0.184, min 3.74, smallest 25%: [7 0]
Layer 1 scores: mean 11, std 0.502, min 9.76, smallest 25%: [ 2 12  8  5]
Layer 2 scores: mean 12.1, std 1.23, min 10.1, smallest 25%: [ 9  1 12 13]
Layer 3 scores: mean 4.51, std 0.498, min 3.74, smallest 25%: [ 7  1  5 10  4 16 15 24]
The pruned model has 8914 effective parameters.
Validation after pruning: Average loss: 0.0063, Accuracy: 4348/6000 (72.47%)
Validation: Average loss: 0.0011, Accuracy: 5763/6000 (96.05%)
Validation: Average loss: 0.0008, Accuracy: 5789/6000 (96.48%)


In [6]:
def collect_mask_grads():
    mask_grads = []
    for i in range(len(model.activations)-1):
        mask_grads.append(torch.empty(0, *model[i].mask_vector.shape))
    for data, target in train_loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        for i in range(len(model.activations)-1):
            mask_grads[i] = torch.cat([mask_grads[i], model[i].mask_vector.grad.unsqueeze(0)])
    return mask_grads

#mask_grads = collect_mask_grads()


We can also try iterative pruning.

In [7]:
modded_model_iterative = copy.deepcopy(model)
modded_optimizer_iterative = torch.optim.SGD(modded_model_iterative.parameters(), lr=0.01)
modded_optimizer_iterative.load_state_dict(optimizer.state_dict())

for iter in range(5):
    for i in range(len(modded_model_iterative)-1):
        scores = weight_sum(modded_model_iterative[i].weight)
        # scores = weight_sum(model[i].weight) +  weight_sum(model[i+1].weight, fanin=False, conversion_factor=model.conversion_factor if i == model.conversion_layer else -1)
        # scores = activation_variance(model.activations[str(i)])
        # scores = svd_score(model.activations[str(i)])
        # scores = nuclear_score(model.activations[str(i)], average=i<3)
        print("Layer {} scores: mean {:.3g}, std {:.3g}, min {:.3g}, smallest 15%:".format(i, scores.mean(), scores.std(), scores.min()), end=" ")
        to_prune = np.argsort(scores.detach().numpy())[:int(0.15*len(scores))]
        print(to_prune)
        modded_model_iterative.prune(i, to_prune, optimizer=modded_optimizer_iterative, clear_activations=True)
    print("The pruned model now has {} effective parameters.".format(modded_model_iterative.parameter_count(masked = True)))
    print("Validation after pruning: ", end = "")
    test(modded_model_iterative, val_loader, criterion)
    train(modded_model_iterative, train_loader, modded_optimizer_iterative, criterion, epochs=2, val_loader=val_loader)

Layer 0 scores: mean 3.96, std 0.184, min 3.74, smallest 15%: [7]
Layer 1 scores: mean 9.63, std 0.421, min 8.46, smallest 15%: [ 2 12]
Layer 2 scores: mean 10.6, std 1.07, min 8.81, smallest 15%: [ 9 13]
Layer 3 scores: mean 4, std 0.494, min 3.14, smallest 15%: [ 7  5 10  1]
The pruned model now has 12008 effective parameters.
Validation after pruning: Average loss: 0.0032, Accuracy: 5220/6000 (87.00%)
Validation: Average loss: 0.0009, Accuracy: 5792/6000 (96.53%)
Validation: Average loss: 0.0008, Accuracy: 5808/6000 (96.80%)
Layer 0 scores: mean 4.09, std 0.208, min 3.81, smallest 15%: [0]
Layer 1 scores: mean 8.75, std 0.275, min 8.29, smallest 15%: [13  4]
Layer 2 scores: mean 10.5, std 1.18, min 8.64, smallest 15%: [1 0]
Layer 3 scores: mean 3.84, std 0.476, min 2.95, smallest 15%: [ 3  2 11 20]
The pruned model now has 8914 effective parameters.
Validation after pruning: Average loss: 0.0019, Accuracy: 5593/6000 (93.22%)
Validation: Average loss: 0.0008, Accuracy: 5827/6000 (97.

Of course, we can also grow the model. The following cell uses a neurogenesis strategy similar to NORTH-Random.

In [8]:
modded_model_grow = copy.deepcopy(model)
modded_optimizer_grow = torch.optim.SGD(modded_model_grow.parameters(), lr=0.01)
modded_optimizer_grow.load_state_dict(optimizer.state_dict())

for iter in range(5):
    for i in range(len(modded_model_grow)-1):
        #score = orthogonality_gap(modded_model_grow.activations[str(i)])
        max_rank = modded_model_grow[i].out_features if i > modded_model_grow.conversion_layer else modded_model_grow[i].out_channels
        score = effective_rank(modded_model_grow.activations[str(i)])
        to_add = max(score-int(0.95*max_rank), 0)
        print("Layer {} score: {}/{}, neurons to add: {}".format(i, score, max_rank, to_add))
        modded_model_grow.grow(i, to_add, fanin_weights="iterative_orthogonalization", 
                               optimizer=modded_optimizer_grow)
    print("The grown model now has {} effective parameters.".format(modded_model_grow.parameter_count(masked = True)))
    print("Validation after growing: ", end = "")
    test(modded_model_grow, val_loader, criterion)
    train(modded_model_grow, train_loader, modded_optimizer_grow, criterion, epochs=2, val_loader=val_loader)

Layer 0 score: 8/8, neurons to add: 1
Layer 1 score: 16/16, neurons to add: 1
Layer 2 score: 16/16, neurons to add: 1
Layer 3 score: 30/32, neurons to add: 0
The grown model now has 16405 effective parameters.
Validation after growing: Average loss: 0.0007, Accuracy: 5826/6000 (97.10%)
Validation: Average loss: 0.0008, Accuracy: 5811/6000 (96.85%)
Validation: Average loss: 0.0007, Accuracy: 5828/6000 (97.13%)
Layer 0 score: 9/9, neurons to add: 1
Layer 1 score: 17/17, neurons to add: 1
Layer 2 score: 17/17, neurons to add: 1
Layer 3 score: 30/32, neurons to add: 0
The grown model now has 18713 effective parameters.
Validation after growing: Average loss: 0.0007, Accuracy: 5828/6000 (97.13%)
Validation: Average loss: 0.0007, Accuracy: 5834/6000 (97.23%)
Validation: Average loss: 0.0006, Accuracy: 5833/6000 (97.22%)
Layer 0 score: 10/10, neurons to add: 1
Layer 1 score: 18/18, neurons to add: 1
Layer 2 score: 18/18, neurons to add: 1
Layer 3 score: 30/32, neurons to add: 0
The grown mode

Let's try masking neurons for simple grow-and-prune strategy. first doubling each layer's capacity.

In [11]:
modded_model_masked = copy.deepcopy(model)
modded_optimizer_masked = torch.optim.SGD(modded_model_masked.parameters(), lr=0.01)
modded_optimizer_masked.load_state_dict(optimizer.state_dict())

for i in range(len(modded_model_masked)-1):
    neurons = modded_model_masked[i].out_features if i > modded_model_masked.conversion_layer else modded_model_masked[i].out_channels
    modded_model_masked.grow(i, neurons, fanin_weights="kaiming", fanout_weights="kaiming", optimizer=modded_optimizer_masked)
    modded_model_masked.mask(i, list(range(neurons, 2*neurons)))

for iter in range(5):
    for i in range(len(modded_model_masked)-1):
        scores = weight_sum(modded_model_masked[i].get_weights())
        print("Layer {} scores: mean {:.3g}, std {:.3g}, min {:.3g}, smallest 25% to mask:".format(i, scores[scores != 0].mean(), scores[scores != 0].std(), scores[scores != 0].min()), end=" ")
        to_mask = np.argsort(scores.detach().numpy())[sum(scores == 0):sum(scores == 0)+int(0.25*sum(scores != 0))]
        print(to_mask, end=", ")
        modded_model_masked.mask(i, to_mask)
        to_unmask = np.argsort(scores.detach().numpy())[:sum(scores == 0)]
        to_unmask = np.random.choice(to_unmask, size=len(to_mask), replace=False)
        print("random neurons to unmask:", to_unmask)
        modded_model_masked.unmask(i, to_unmask, optimizer=modded_optimizer_masked)
    print("The masked model now has {} effective parameters.".format(modded_model_masked.parameter_count(masked = True)))
    print("Validation after growing: ", end = "")
    test(modded_model_masked, val_loader, criterion)
    train(modded_model_masked, train_loader, modded_optimizer_masked, criterion, epochs=2, val_loader=val_loader, verbose=False)
    


Layer 0 scores: mean 4.22, std 0.359, min 3.77, smallest 25% to mask: [3 6], random neurons to unmask: [ 8 10]

Layer 1 scores: mean 18, std 0.597, min 17, smallest 25% to mask: [8 6 9 7], random neurons to unmask: [21 29 24 30]

Layer 2 scores: mean 19.5, std 1.24, min 17.6, smallest 25% to mask: [ 0 12  8 13], random neurons to unmask: [27 31 24 25]

Layer 3 scores: mean 7.35, std 0.45, min 6.65, smallest 25% to mask: [14 10 31  9 13 23 27 28], random neurons to unmask: [62 46 37 42 39 58 54 44]
The masked model now has 30578 effective parameters.
Validation after growing: Average loss: 0.0049, Accuracy: 4825/6000 (80.42%)
Validation: Average loss: 0.0010, Accuracy: 5788/6000 (96.47%)
Validation: Average loss: 0.0008, Accuracy: 5822/6000 (97.03%)

Layer 0 scores: mean 4.28, std 0.495, min 3.52, smallest 25% to mask: [ 8 10], random neurons to unmask: [ 9 12]

Layer 1 scores: mean 17.4, std 2.11, min 13.8, smallest 25% to mask: [29 30 24 21], random neurons to unmask: [20  7 22 25]

