I want to study the relationship between the rank of the weight matrices of each layer and the problem at hand. I have a hunch that the convergent rank of the entire network(to be defined) will match the computational complexity of the task defined and implied by the loss function and the data. What I would love to have is equivalence classes of networks whose convergent ranks under SGD with weight decay are equal. I would also like to know whether or not shrinking or expanding the layers based on the convergent rank of a previous iteration of the network will yeild similar accuracy.

Problem 1) Train N networks with randomly chosen networks with sufficient capacity for the problem. Train them under SGD with weight decay until the loss saturates. Compute the rank of the network(to be defined in a way that is comparable accross architechtures) and compare, look for trends. A quick expectation is that there will be cases of overfitting where the rank is higher than networks that perform better with lower rank. Essentially indicating that there is enough computational power to memorize the dataset.

Problem 2) Develop an iterative algorithm that trains a large network, then prunes it according the per-layer rank. Retrain and reprune. Study the behaviour of the final accuracy and its relationship to trianing time. Another variation could be reducing the weight decay each iteration.

In [1]:
#@title Import Dependencies

import torch
import torch.nn as nn
import torchvision.datasets as dsets
import torchvision.transforms as transforms
from torch.autograd import Variable

In [2]:
#@title Define Hyperparameters

input_size = 784 # img_size = (28,28) ---> 28*28=784 in total
hidden_size_base = 784 # number of nodes at hidden layer
num_classes = 10 # number of output classes discrete range [0,9]
num_epochs = 100 # number of times which the entire dataset is passed throughout the model
small_batch_size = 64 # the size of input data took for one iteration
large_batch_size = 128 # the size of input data took for one iteration
lr = 1e-3 # size of step
model_layer_configs = [2,3,4,5]

In [3]:
#@title Downloading MNIST data

train_data = dsets.MNIST(root = './data', train = True,
                        transform = transforms.ToTensor(), download = True)

test_data = dsets.MNIST(root = './data', train = False,
                       transform = transforms.ToTensor())

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 16308709.37it/s]


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 75568118.42it/s]


Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 24760097.51it/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 14929881.48it/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw






In [5]:
#@title Loading the data

train_gen_small_batch = torch.utils.data.DataLoader(dataset = train_data,
                                             batch_size = small_batch_size,
                                             shuffle = True,
                                             num_workers=0)

test_gen_small_batch = torch.utils.data.DataLoader(dataset = test_data,
                                      batch_size = small_batch_size,
                                      shuffle = False,
                                      num_workers=0)

train_gen_large_batch = torch.utils.data.DataLoader(dataset = train_data,
                                             batch_size = large_batch_size,
                                             shuffle = True,
                                             num_workers=0)

test_gen_large_batch = torch.utils.data.DataLoader(dataset = test_data,
                                      batch_size = large_batch_size,
                                      shuffle = False,
                                      num_workers=0)
# train_data.data.to("cuda:0")
# train_data.targets.to("cuda:0")
# test_data.data.to("cuda:0")
# test_data.targets.to("cuda:0")

In [6]:
#@title Define model class

class Net(nn.Module):
  def __init__(self, input_size, hidden_size, n_layers, num_classes, with_skip):
    super(Net,self).__init__()
    self.fc_layers = nn.ModuleList()
    self.fc_layers.append(nn.Linear(input_size, hidden_size[0]))
    for i in range(n_layers):
      self.fc_layers.append(nn.Linear(hidden_size[i], hidden_size[i+1]))
    self.fc_layers.append(nn.Linear(hidden_size[-1], num_classes))
    self.relu = nn.ReLU()
    self.n_layers = n_layers
    self.with_skip = with_skip


  def forward(self,x):
    in_val = x
    for i,layer in enumerate(self.fc_layers):
      out_val = layer(in_val)
      if i!=self.n_layers+1:
        out = self.relu(out_val)
        in_val=out


    return out

In [7]:
from itertools import chain
loss_function = nn.CrossEntropyLoss().cuda()
optimizers = {}
lr_scheds = {}
optimized_ranks = {}
history = {}


In [8]:
for n_layers in model_layer_configs:
  optimized_ranks[str(n_layers)] = [hidden_size_base for _ in range(int(n_layers)+1)]


In [9]:
#@title Training the model
import matplotlib.pyplot as plt
import time
from collections import defaultdict
import numpy as np

# meta training loop
for meta_i in range(15):
  #@title Build the model
  nets = {str(n_layers): Net(input_size, optimized_ranks[str(n_layers)], n_layers , num_classes, with_skip=True) for n_layers in model_layer_configs}
  for n_layers, net in nets.items():
    print(f"n_layers:{n_layers}, n_neurons: {sum(optimized_ranks[n_layers])+10+input_size}")
    # net.cuda()

  #@title Define loss-function & optimizer

  for n_layers, net in nets.items():
    decay_params = chain(*[net.fc_layers[i].parameters() for i in range(net.n_layers+1)])
    optimizers[n_layers] = torch.optim.Adam([{'params':decay_params, "weight_decay":0}, {'params':net.fc_layers[-1].parameters()}], lr=lr )

    lr_scheds[n_layers] = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizers[n_layers], threshold=1e-2, verbose=True)

    history[str(n_layers)] = []
  batch_time_sum = 0
  batch_count = 0
  metric_time_sum = 0
  metric_count = 0
  prev_acc = defaultdict(lambda:0)
  for epoch in range(num_epochs):
    batch_time = time.time()

    for i ,(images,labels) in enumerate(train_gen_small_batch):
      images = Variable(images.view(-1,28*28))#.cuda()
      labels = Variable(labels)#.cuda()

      loss = {}
      for n_layers, opt, net in zip(optimizers.keys(), optimizers.values(), nets.values()):
        opt.zero_grad()
        outputs = net(images)
        loss[str(n_layers)] = loss_function(outputs, labels)
        loss[str(n_layers)].backward(retain_graph=True)
        opt.step()




      if (i+1) % 500==0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_data)//small_batch_size}], Loss: {loss}')
        print(f"Time statistics: Ave. batch time: {batch_time_sum/(batch_count or 1)}")

    for i ,(images,labels) in enumerate(train_gen_large_batch):
      images = Variable(images.view(-1,28*28)).cuda()
      labels = Variable(labels).cuda()

      loss = {}
      for n_layers, opt, net in zip(optimizers.keys(), optimizers.values(), nets.values()):
        opt.zero_grad()
        outputs = net(images)
        loss[str(n_layers)] = loss_function(outputs, labels)
        loss[str(n_layers)].backward(retain_graph=True)
        opt.step()




      if (i+1) % 500==0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_data)//large_batch_size}], Loss: {loss}')
        print(f"Time statistics: Ave. batch time: {batch_time_sum/(batch_count or 1)}")
    for n_layers, lr_sched in lr_scheds.items():
      lr_sched.step(loss[str(n_layers)])
    batch_time_sum += (time.time()-batch_time)
    batch_count += 1

    if (epoch+1) % 3 == 0:
      metric_time = time.time()
      rank_fc={}
      rank_fc_grad={}
      norm_fc={}
      for n_layers, net in nets.items():
        rank_fc[str(n_layers)] = []
        rank_fc_grad[str(n_layers)] = []
        norm_fc[str(n_layers)] = []
        for layer in net.fc_layers:
          rank_fc[str(n_layers)].append(np.linalg.matrix_rank(layer.weight.detach().cpu().numpy()))
          if layer.weight.grad is not None:
            rank_fc_grad[str(n_layers)].append(np.linalg.matrix_rank(layer.weight.grad.detach().cpu().numpy()))
          norm_fc[str(n_layers)].append(np.linalg.norm(layer.weight.detach().cpu().numpy(), ord=2))

      correct = {}
      total = {}

      for n_layers, net in nets.items():
        correct[str(n_layers)] = 0
        total[str(n_layers)] = 0


        for images,labels in test_gen_large_batch:
          images = Variable(images.view(-1,28*28)).cuda()
          labels = labels.cuda()

          output = net(images)

          _, predicted = torch.max(output,1)

          correct[str(n_layers)] += (predicted == labels).sum()
          total[str(n_layers)] += labels.size(0)

      metric_time_sum += time.time()-metric_time
      metric_count += 1
      evolve = True
      for n_layers in nets.keys():
        history[str(n_layers)].append({"loss":loss[str(n_layers)].item(), "acc":correct[str(n_layers)]/total[str(n_layers)], "rank_fc":rank_fc[str(n_layers)],"rank_fc_grad":rank_fc_grad[str(n_layers)], "norm_fc":norm_fc[str(n_layers)]})

        print(f"{n_layers}")
        print(f"Test Accuracy:{correct[str(n_layers)]/total[str(n_layers)]}")
        print(f"Ranks:{rank_fc[str(n_layers)]}")
        print(f"Grad Ranks:{rank_fc_grad[str(n_layers)]}")
        print(f"Layer sizes: {optimized_ranks}")
        print(f"Trainable parameters:{sum(p.numel() for p in nets[n_layers].parameters() if p.requires_grad)}")
        print(f"{sum(rank_fc[n_layers])/(sum(optimized_ranks[n_layers])+10)}")
        print(f"Time statistics: Ave. batch time: {batch_time_sum/batch_count}, ave. metric time: {metric_time_sum/metric_count}")

        if correct[str(n_layers)]/total[str(n_layers)] < 0.9 or (correct[str(n_layers)]/total[str(n_layers)]-prev_acc[str(n_layers)])>0.01:
          evolve = False
        prev_acc[str(n_layers)] = correct[str(n_layers)]/total[str(n_layers)]


      for n_layers, layer_ranks in rank_fc.items():
        optimized_ranks[n_layers] = layer_ranks[:-1]
      print(optimized_ranks)
      if evolve:
        print("Evolving!")
        break


n_layers:2, n_neurons: 3146
n_layers:3, n_neurons: 3930
n_layers:4, n_neurons: 4714
n_layers:5, n_neurons: 5498
Epoch [1/100], Step [500/937], Loss: {'2': tensor(1.7919, grad_fn=<NllLossBackward0>), '3': tensor(0.6880, grad_fn=<NllLossBackward0>), '4': tensor(3.2156, grad_fn=<NllLossBackward0>), '5': tensor(0.2288, grad_fn=<NllLossBackward0>)}
Time statistics: Ave. batch time: 0.0


RuntimeError: ignored

In [None]:
#@title Evaluating the accuracy of the model

correct = {}
total = {}

for n_layers, net in nets.items():
  correct[str(n_layers)] = 0
  total[str(n_layers)] = 0


  for images,labels in test_gen:
    images = Variable(images.view(-1,28*28)).cuda()
    labels = labels.cuda()

    output = net(images)

    _, predicted = torch.max(output,1)

    correct[str(n_layers)] += (predicted == labels).sum()
    total[str(n_layers)] += labels.size(0)

  print(f'Accuracy of the model{n_layers}: {(100*correct[str(n_layers)])/(total[str(n_layers)]+1)}')
  print(f"{[sum(x['rank_fc']) for x in history[str(n_layers)]]}")
  print(f"{sum(optimized_ranks[str(net.n_layers)])+10+input_size}")
  print(f"{[sum(x['rank_fc'])/(sum(optimized_ranks[str(net.n_layers)])+10+input_size) for x in history[str(n_layers)]]}")

In [None]:
for n_layers, net in nets.items():
  plt.plot([x["loss"] for x in history[str(n_layers)]], [x["acc"].cpu() for x in history[str(n_layers)]], "*", label="loss")
  plt.title(f"{n_layers} Layers")
  plt.xlabel("Loss")
  plt.ylabel("Accuracy")
  plt.show()
  for i in range(net.n_layers+1):
    plt.plot([x["rank_fc"][i].cpu() for x in history[str(n_layers)]], [x["acc"].cpu() for x in history[str(n_layers)]], "*", label=f"rank {i}")
  plt.title(f"{n_layers} Layers")
  plt.xlabel("Rank")
  plt.ylabel("Accuracy")
  plt.legend()
  plt.show()

  for i in range(net.n_layers+1):
    plt.plot([x["norm_fc"][i].cpu().detach().numpy() for x in history[str(n_layers)]], [x["acc"].cpu() for x in history[str(n_layers)]], "*", label=f"norm {i}")
  plt.title(f"{n_layers} Layers")
  plt.xlabel("Norm")
  plt.ylabel("Accuracy")
  plt.legend()
  plt.show()


  plt.plot([x["loss"] for x in history[str(n_layers)]], "*", label="loss")
  plt.plot([x["acc"].cpu() for x in history[str(n_layers)]], "*", label="loss")

  plt.title(f"{n_layers} Layers")
  plt.ylabel("Loss and Accuracy")
  plt.xlabel("Epochs")
  plt.show()

  for i in range(net.n_layers+1):
    plt.plot([x["rank_fc"][i].cpu() for x in history[str(n_layers)]], "*", label=f"rank {i}")
    plt.plot([x["norm_fc"][i].cpu().detach().numpy() for x in history[str(n_layers)]], "*", label=f"norm {i}")
  plt.title(f"{n_layers} Layers")
  plt.ylabel("Rank and Norm")
  plt.xlabel("Epochs")
  plt.legend()
  plt.show()
