Here we investigate how we can optimize the weights of a trained neural network to achieve non-vacuous PAC-Bayes bounds on the error rate. For simplicity we will consider a two-layer fully connected ReLU network trained on a binary-classification problem. We use PyTorch for the training of the network.

In [133]:
# Import PyTorch

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Import other packages

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import truncnorm

The network we will consider will have $600$ hidden units an will be trained using stochastic gradient descent.

In [134]:
# Define the hyper-parameters of the model

input_size = 784
hidden_size = 600
num_classes = 1
num_epochs = 20
batch_size = 100
learning_rate = 0.01
momentum = 0.9

We will use MNIST with transformed labels to train the network. Images classified as $\{0,1,2,3,4\}$ will be given the label $1$, and images classified as $\{5,6,7,8,9\}$ will be given the label $-1$. The training dataset contains $60000$ examples, and the test dataset contains $10000$ examples.

In [135]:
# Import the MNIST dataset as two separate datasets

train_dataset = torchvision.datasets.MNIST(root='data', 
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           target_transform=lambda y: -1 if y<=4 else 1,  
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='data', 
                                          train=False,
                                          target_transform=lambda y: -1 if y<=4 else 1,
                                          transform=transforms.ToTensor())

# Create the data loader for training
train_dataset_loader = DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

# Create the data loader for validating

test_dataset_loader = DataLoader(dataset=test_dataset, 
                                          batch_size=1, 
                                          shuffle=False) 

In [136]:
# Defining a FC neural network

class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.fc3 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        x = self.fc1(x)
        x = x.relu()
        x = self.fc3(x)
        return x

def ReLU_glorot_init(model):
    for name, param in model.named_parameters():
        
        if name.endswith(".bias"):
            param.data.fill_(0)
        else:
            nn.init.xavier_normal_(param)

model = NeuralNet(input_size, hidden_size, num_classes)
ReLU_glorot_init(model)

We will employ the Soft Margin Loss to dictate the learning procedure. For the prediction $\hat{y}$ and true labels $y$ of a batch $X$, the Soft Margin Loss returns
$$\frac{1}{\vert X\vert}\sum_{i}\log(1+\exp(\hat{y}[i]y[i])).$$

In [137]:
criterion = nn.SoftMarginLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) 

# Training Epoch

def train_epoch():
    for i, (images, labels) in enumerate(train_dataset_loader):  
        # Move tensors to the configured device
        images = images.reshape(-1, 28*28)
        labels = labels
        
        # Forward pass
        outputs = model(images)
        loss = criterion(torch.reshape(outputs,(len(outputs),)), labels)
        
        # Backprpagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        

    return loss.item()

for epoch in range(num_epochs):
    loss=train_epoch()
    print ('Epoch {}, Loss: {:.3f}'.format(epoch+1,loss))

Epoch 1, Loss: 0.398
Epoch 2, Loss: 0.289
Epoch 3, Loss: 0.315
Epoch 4, Loss: 0.190
Epoch 5, Loss: 0.245
Epoch 6, Loss: 0.152
Epoch 7, Loss: 0.175
Epoch 8, Loss: 0.150
Epoch 9, Loss: 0.173
Epoch 10, Loss: 0.187
Epoch 11, Loss: 0.067
Epoch 12, Loss: 0.164
Epoch 13, Loss: 0.122
Epoch 14, Loss: 0.073
Epoch 15, Loss: 0.165
Epoch 16, Loss: 0.107
Epoch 17, Loss: 0.134
Epoch 18, Loss: 0.110
Epoch 19, Loss: 0.111
Epoch 20, Loss: 0.053


In [138]:
train_error=0
with torch.no_grad():
    for image, label in train_dataset_loader:
        image = image.reshape(-1, 28*28)
        outputs = torch.reshape(torch.sign(model(image)),(len(label),))
        train_error+=torch.sum(torch.abs(outputs-label))/len(train_dataset)

test_error=0
with torch.no_grad():
    for image, label in test_dataset_loader:
        image = image.reshape(-1, 28*28)
        outputs = torch.reshape(torch.sign(model(image)),(len(label),))
        test_error+=torch.sum(torch.abs(outputs-label))/len(train_dataset)

no_parameters=0
for param in model.named_parameters():
    if 'weight' in param[0]:
        no_parameters+=param[1].size()[0]*param[1].size()[1]
    else:
        no_parameters+=len(param[1])

print('Number of Parameters: {}, Train Error: {:.3f}, Test Error: {:.3f}'.format(no_parameters,train_error,test_error))

Number of Parameters: 471601, Train Error: 0.060, Test Error: 0.011


In [None]:
def newt_inv_KL(q,c):
    p_0=q+np.sqrt(c/2)
    for n in range(5):
        if p_0>1:
            return 1
        p_0=p_0-(q*np.log(q/c)+(1-q)*np.log((1-q)/(1-c))-c)/((1-q)/(1-p_0)-q/p_0)
    return p_0

Now that we have trained the network, we optimize the error bound of a stochastic neural network. The weights of which are random perturbations of the ones learned. 