# Introduction
In this notebook, we demonstrate the potential of combining the deep learning capabilities of PyTorch with Gaussian process models using GPyTorch. In this notebook, we will use deep kernel learning to train a deep neural network with a Gaussian process prediction layer for classification, using the MNIST dataset as a simple example.

For an introduction to DKL see these papers:
https://arxiv.org/abs/1511.02222
https://arxiv.org/abs/1611.00336

In [2]:
# Import our GPyTorch library
import gpytorch

# Import some classes we will use from torch
from torch.autograd import Variable
from torch.optim import SGD, Adam
from torch.utils.data import DataLoader

## Loading data

First, we must load the standard train and test sets for MNIST. To do this, we use the standard MNIST dataset available through torchvision

In [3]:
# Import datasets to access MNISTS and transforms to format data for learning
from torchvision import transforms, datasets

# Download and load the MNIST dataset to train on
# Compose lets us do multiple transformations. Specically make the data a torch.FloatTensor of shape
# (colors x height x width) in the range [0.0, 1.0] as opposed to an RGB image with shape (height x width x colors)
# then normalize using  mean (0.1317) and standard deviation (0.3081) already calculated (not here)

# Transformation documentation here: http://pytorch.org/docs/master/torchvision/transforms.html
train_dataset = datasets.MNIST('/tmp', train=True, download=True,
                               transform=transforms.Compose([
                                   transforms.ToTensor(),
                                   transforms.Normalize((0.1307,), (0.3081,))
                               ]))
test_dataset = datasets.MNIST('/tmp', train=False, download=True,
                              transform=transforms.Compose([
                                  transforms.ToTensor(),
                                  transforms.Normalize((0.1307,), (0.3081,))
                              ]))

# But the data into a DataLoader. We shuffle the training data but not the test data because the order
# training data is presented will affect the outcome unlike the test data
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False, pin_memory=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


## Define the feature extractor for our deep kernel

In this cell, we define the deep neural network we will use as the basis for our deep kernel. To keep things simple, we use the classic LeNet architecture.

In [4]:
# Import torch's neural network
# Documentation here: http://pytorch.org/docs/master/nn.html
from torch import nn
# Import torch.nn.functional for various activation/pooling functions
# Documentation here: http://pytorch.org/docs/master/nn.html#torch-nn-functional
from torch.nn import functional as F

# We make a classic LeNet Architecture sans a final prediction layer to 10 outputs. This will serve as a feature
# extractor reducing the dimensionality of our data down to 64. We will pretrain these layers by adding on a 
# final classifying 64-->10 layer
# https://medium.com/@siddharthdas_32104/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
class LeNetFeatureExtractor(nn.Module):
    def __init__(self):
        super(LeNetFeatureExtractor, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=5, padding=2)
        self.norm1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=5, padding=2)
        self.norm2 = nn.BatchNorm2d(32)
        self.fc3 = nn.Linear(32 * 7 * 7, 64)
        self.norm3 = nn.BatchNorm1d(64)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.norm1(self.conv1(x))), 2)
        x = F.max_pool2d(F.relu(self.norm2(self.conv2(x))), 2)
        x = x.view(-1, 32 * 7 * 7)
        x = F.relu(self.norm3(self.fc3(x)))
        return x
    
feature_extractor = LeNetFeatureExtractor().cuda()

### Pretrain the feature extractor a bit

We next pretrain the deep feature extractor using a simple linear classifier. While this step is in general not necessary, we include it to demonstrate that GPs can be added on to a neural network as a simple fine-tuning step that adds minimal training overhead.

In [5]:
# Make a final classifier layer that operates on the feature extractor's output
classifier = nn.Linear(64, 10).cuda()
# Make list of parameters to optimize (both the parameters of the feature extractor and classifier)
params = list(feature_extractor.parameters()) + list(classifier.parameters())
# We train the network using stochastic gradient descent
optimizer = SGD(params, lr=0.1, momentum=0.9)

def pretrain(epoch):
    # Set feature extract to training model
    feature_extractor.train()
    train_loss = 0.
    # Basic training loop for a DNN
    for data, target in train_loader:
        data, target = Variable(data.cuda()), Variable(target.cuda())
        optimizer.zero_grad()
        # Forward data through the feature extractor and soft max
        features = feature_extractor(data)
        output = F.log_softmax(classifier(features), 1)
        # Compute the loss
        loss = F.nll_loss(output, target)
        # Back propagate and update weights
        loss.backward()
        optimizer.step()
        train_loss += loss.data[0] * len(data)
    print('Train Epoch: %d\tLoss: %.6f' % (epoch, train_loss / len(train_dataset)))

def pretest():
    # Change feature extract to eval mode
    feature_extractor.eval()
    test_loss = 0
    correct = 0
    # Loop over minibatches of test data and compute accuracy
    for data, target in test_loader:
        data, target = data.cuda(), target.cuda()
        data, target = Variable(data, volatile=True), Variable(target)
        features = feature_extractor(data)
        output = F.log_softmax(classifier(features), 1)
        test_loss += F.nll_loss(output, target, size_average=False).data[0] # sum up batch loss
        pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()
    test_loss /= len(test_loader.dataset)
    print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.3f}%)'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

n_epochs = 3
for epoch in range(1, n_epochs + 1):
    pretrain(epoch)
    pretest()

Train Epoch: 1	Loss: 0.153308
Test set: Average loss: 0.0542, Accuracy: 9817/10000 (98.170%)
Train Epoch: 2	Loss: 0.039772
Test set: Average loss: 0.0285, Accuracy: 9898/10000 (98.980%)
Train Epoch: 3	Loss: 0.025896
Test set: Average loss: 0.0295, Accuracy: 9892/10000 (98.920%)


## Define the deep kernel GP

We next define a DKLModel that uses the feature extractor. This is a Gaussian process that applies an additive RBF kernel to the features extracted by the deep neural network. The key thing that is different between this model and models we've seen in other example notebooks is in forward: rather than working directly with x, we first extract features using the deep feature extractor.

The loss used for training is the standard variational lower bound used for training Gaussian processes. Since we use an additive RBF kernel, we can make use of the AdditiveGridInducingVariationalGP model, which efficiently performs inference using SKI in this setting.

In [6]:
# now this is our first exposure to the usefulness of gpytorch

# A gpytorch module is superclass of torch.nn.Module
class DKLModel(gpytorch.Module):
    def __init__(self, feature_extractor, n_features=64, grid_bounds=(-10., 10.)):
        super(DKLModel, self).__init__()
        # We add the feature-extracting network to the class
        self.feature_extractor = feature_extractor
        # The latent function is what transforms the features into the output
        self.latent_functions = LatentFunctions(n_features=n_features, grid_bounds=grid_bounds)
        # The grid bounds are the range we expect the features to fall into
        self.grid_bounds = grid_bounds
        # n_features in the dimension of the vector extracted (64)
        self.n_features = n_features
    
    def forward(self, x):
        # For the forward method of the Module, first feed the xdata through the
        # feature extraction network
        features = self.feature_extractor(x)
        # Scale to fit inside grid bounds
        features = gpytorch.utils.scale_to_bounds(features, self.grid_bounds[0], self.grid_bounds[1])
        # The result is hte output of the latent functions
        res = self.latent_functions(features.unsqueeze(-1))
        return res
    
# The AdditiveGridInducingVariationalGP trains multiple GPs on the features
# These are mixed together by the likelihoo function to generate the final
# classification output

# Grid bounds specify the allowed values of features
# grid_size is the number of subdivisions along each dimension
class LatentFunctions(gpytorch.models.AdditiveGridInducingVariationalGP):
    # n_features is the number of features from feature extractor
    # mixing params = False means the result of the GPs will simply be summed instead of mixed
    def __init__(self, n_features=64, grid_bounds=(-10., 10.), grid_size=128):
        super(LatentFunctions, self).__init__(grid_size=grid_size, grid_bounds=[grid_bounds],
                                              n_components=n_features, mixing_params=False, sum_output=False)
        #  We will use the very common universal approximator RBF Kernel
        cov_module = gpytorch.kernels.RBFKernel()
        # Initialize the lengthscale of the kernel
        cov_module.initialize(log_lengthscale=0)
        self.cov_module = cov_module
        self.grid_bounds = grid_bounds
        
    def forward(self, x):
        # Zero mean
        mean = Variable(x.data.new(len(x)).zero_())
        # Covariance using RBF kernel as described in __init__
        covar = self.cov_module(x)
        # Return as Gaussian
        return gpytorch.random_variables.GaussianRandomVariable(mean, covar)
    
# Intialize the model  
model = DKLModel(feature_extractor).cuda()
# Choose that likelihood function to use
# Here we use the softmax likelihood (e^z_i)/SUM_over_i(e^z_i)
# https://en.wikipedia.org/wiki/Softmax_function
likelihood = gpytorch.likelihoods.SoftmaxLikelihood(n_features=model.n_features, n_classes=10).cuda()

## Train the DKL model
In this cell we train the DKL model we defined above. 

In [None]:
# Simple DataLoader
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True, pin_memory=True)

# We use an adam optimizer over both the model and likelihood parameters
# https://arxiv.org/abs/1412.6980
optimizer = Adam([
    {'params': model.parameters()},
    {'params': likelihood.parameters()},  # SoftmaxLikelihood contains parameters
], lr=0.01)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.VariationalMarginalLogLikelihood(likelihood, model, n_data=len(train_dataset))

def train(epoch):
    model.train()
    likelihood.train()
    
    train_loss = 0.
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
        loss = -mll(output, target)
        loss.backward()
        optimizer.step()
        print('Train Epoch: %d [%03d/%03d], Loss: %.6f' % (epoch, batch_idx + 1, len(train_loader), loss.data[0]))

def test():
    model.eval()
    likelihood.eval()

    test_loss = 0
    correct = 0
    for data, target in test_loader:
        data, target = data.cuda(), target.cuda()
        data, target = Variable(data, volatile=True), Variable(target)
        output = likelihood(model(data))
        pred = output.argmax()
        correct += pred.eq(target.view_as(pred)).data.cpu().sum()
    test_loss /= len(test_loader.dataset)
    print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.3f}%)'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

n_epochs = 10

# While we have theoretically fast algorithms for toeplitz matrix-vector multiplication, the hardware of GPUs
# is so well-designed that naive multiplication on them beats the current implementation of our algorith (despite
# theoretically fast computation). Because of this, we set the use_toeplitz flag to false to minimize runtime
with gpytorch.settings.use_toeplitz(False):
    for epoch in range(1, n_epochs + 1):
        %time train(epoch)
        test()

Train Epoch: 1 [001/235], Loss: 65.087952
Train Epoch: 1 [002/235], Loss: 58.054047
Train Epoch: 1 [003/235], Loss: 51.670380
Train Epoch: 1 [004/235], Loss: 46.254906
Train Epoch: 1 [005/235], Loss: 41.669289
Train Epoch: 1 [006/235], Loss: 37.813416
Train Epoch: 1 [007/235], Loss: 34.537491
Train Epoch: 1 [008/235], Loss: 31.772659
Train Epoch: 1 [009/235], Loss: 29.402857
Train Epoch: 1 [010/235], Loss: 27.366085
Train Epoch: 1 [011/235], Loss: 25.626835
Train Epoch: 1 [012/235], Loss: 24.105766
Train Epoch: 1 [013/235], Loss: 22.802094
Train Epoch: 1 [014/235], Loss: 21.577768
Train Epoch: 1 [015/235], Loss: 20.498735
Train Epoch: 1 [016/235], Loss: 19.505106
Train Epoch: 1 [017/235], Loss: 18.606359
Train Epoch: 1 [018/235], Loss: 17.759821
Train Epoch: 1 [019/235], Loss: 16.960762
Train Epoch: 1 [020/235], Loss: 16.218292
Train Epoch: 1 [021/235], Loss: 15.443375
Train Epoch: 1 [022/235], Loss: 14.865941
Train Epoch: 1 [023/235], Loss: 14.219132
Train Epoch: 1 [024/235], Loss: 13

Train Epoch: 1 [201/235], Loss: 0.636681
Train Epoch: 1 [202/235], Loss: 0.576975
Train Epoch: 1 [203/235], Loss: 0.572340
Train Epoch: 1 [204/235], Loss: 0.623887
Train Epoch: 1 [205/235], Loss: 0.576107
Train Epoch: 1 [206/235], Loss: 0.560232
Train Epoch: 1 [207/235], Loss: 0.529076
Train Epoch: 1 [208/235], Loss: 0.536193
Train Epoch: 1 [209/235], Loss: 0.570294
Train Epoch: 1 [210/235], Loss: 0.550720
Train Epoch: 1 [211/235], Loss: 0.534817
Train Epoch: 1 [212/235], Loss: 0.565627
Train Epoch: 1 [213/235], Loss: 0.546166
Train Epoch: 1 [214/235], Loss: 0.563529
Train Epoch: 1 [215/235], Loss: 0.537592
Train Epoch: 1 [216/235], Loss: 0.614546
Train Epoch: 1 [217/235], Loss: 0.571719
Train Epoch: 1 [218/235], Loss: 0.522561
Train Epoch: 1 [219/235], Loss: 0.539072
Train Epoch: 1 [220/235], Loss: 0.576180
Train Epoch: 1 [221/235], Loss: 0.546636
Train Epoch: 1 [222/235], Loss: 0.542657
Train Epoch: 1 [223/235], Loss: 0.539473
Train Epoch: 1 [224/235], Loss: 0.523308
Train Epoch: 1 [

  softmax = nn.functional.softmax(mixed_fs.t()).view(n_data, n_samples, self.n_classes)


Test set: Average loss: 0.0000, Accuracy: 9800/10000 (98.000%)
Train Epoch: 2 [001/235], Loss: 0.492063
Train Epoch: 2 [002/235], Loss: 0.494656
Train Epoch: 2 [003/235], Loss: 0.530519
Train Epoch: 2 [004/235], Loss: 0.489570
Train Epoch: 2 [005/235], Loss: 0.486440
Train Epoch: 2 [006/235], Loss: 0.482859
Train Epoch: 2 [007/235], Loss: 0.492164
Train Epoch: 2 [008/235], Loss: 0.461867
Train Epoch: 2 [009/235], Loss: 0.530505
Train Epoch: 2 [010/235], Loss: 0.467194
Train Epoch: 2 [011/235], Loss: 0.460434
Train Epoch: 2 [012/235], Loss: 0.510900
Train Epoch: 2 [013/235], Loss: 0.455643
Train Epoch: 2 [014/235], Loss: 0.468740
Train Epoch: 2 [015/235], Loss: 0.452816
Train Epoch: 2 [016/235], Loss: 0.474421
Train Epoch: 2 [017/235], Loss: 0.449498
Train Epoch: 2 [018/235], Loss: 0.453194
Train Epoch: 2 [019/235], Loss: 0.481237
Train Epoch: 2 [020/235], Loss: 0.514508
Train Epoch: 2 [021/235], Loss: 0.466621
Train Epoch: 2 [022/235], Loss: 0.464630
Train Epoch: 2 [023/235], Loss: 0.4

Train Epoch: 2 [202/235], Loss: 0.400685
Train Epoch: 2 [203/235], Loss: 0.377952
Train Epoch: 2 [204/235], Loss: 0.328984
Train Epoch: 2 [205/235], Loss: 0.388258
Train Epoch: 2 [206/235], Loss: 0.383484
Train Epoch: 2 [207/235], Loss: 0.362737
Train Epoch: 2 [208/235], Loss: 0.382919
Train Epoch: 2 [209/235], Loss: 0.348237
Train Epoch: 2 [210/235], Loss: 0.356975
Train Epoch: 2 [211/235], Loss: 0.367515
Train Epoch: 2 [212/235], Loss: 0.432005
Train Epoch: 2 [213/235], Loss: 0.378355
Train Epoch: 2 [214/235], Loss: 0.330096
Train Epoch: 2 [215/235], Loss: 0.347276
Train Epoch: 2 [216/235], Loss: 0.361132
Train Epoch: 2 [217/235], Loss: 0.470303
Train Epoch: 2 [218/235], Loss: 0.359323
Train Epoch: 2 [219/235], Loss: 0.355758
Train Epoch: 2 [220/235], Loss: 0.370636
Train Epoch: 2 [221/235], Loss: 0.361874
Train Epoch: 2 [222/235], Loss: 0.400282
Train Epoch: 2 [223/235], Loss: 0.365867
Train Epoch: 2 [224/235], Loss: 0.385278
Train Epoch: 2 [225/235], Loss: 0.419722
Train Epoch: 2 [

  softmax = nn.functional.softmax(mixed_fs.t()).view(n_data, n_samples, self.n_classes)


Test set: Average loss: 0.0000, Accuracy: 9860/10000 (98.600%)
Train Epoch: 3 [001/235], Loss: 0.344784
Train Epoch: 3 [002/235], Loss: 0.385207
Train Epoch: 3 [003/235], Loss: 0.353562
Train Epoch: 3 [004/235], Loss: 0.375506
Train Epoch: 3 [005/235], Loss: 0.360863
Train Epoch: 3 [006/235], Loss: 0.346552
Train Epoch: 3 [007/235], Loss: 0.352084
Train Epoch: 3 [008/235], Loss: 0.407934
Train Epoch: 3 [009/235], Loss: 0.346343
Train Epoch: 3 [010/235], Loss: 0.344270
Train Epoch: 3 [011/235], Loss: 0.361802
Train Epoch: 3 [012/235], Loss: 0.356864
Train Epoch: 3 [013/235], Loss: 0.392363
Train Epoch: 3 [014/235], Loss: 0.357057
Train Epoch: 3 [015/235], Loss: 0.425715
Train Epoch: 3 [016/235], Loss: 0.356624
Train Epoch: 3 [017/235], Loss: 0.393174
Train Epoch: 3 [018/235], Loss: 0.338429
Train Epoch: 3 [019/235], Loss: 0.348844
Train Epoch: 3 [020/235], Loss: 0.363367
Train Epoch: 3 [021/235], Loss: 0.355188
Train Epoch: 3 [022/235], Loss: 0.405230
Train Epoch: 3 [023/235], Loss: 0.3

Train Epoch: 3 [200/235], Loss: 0.347229
Train Epoch: 3 [201/235], Loss: 0.329555
Train Epoch: 3 [202/235], Loss: 0.342013
Train Epoch: 3 [203/235], Loss: 0.351251
Train Epoch: 3 [204/235], Loss: 0.358006
Train Epoch: 3 [205/235], Loss: 0.327806
Train Epoch: 3 [206/235], Loss: 0.334583
Train Epoch: 3 [207/235], Loss: 0.348838
Train Epoch: 3 [208/235], Loss: 0.349632
Train Epoch: 3 [209/235], Loss: 0.352472
Train Epoch: 3 [210/235], Loss: 0.322700
Train Epoch: 3 [211/235], Loss: 0.313943
Train Epoch: 3 [212/235], Loss: 0.328301
Train Epoch: 3 [213/235], Loss: 0.367909
Train Epoch: 3 [214/235], Loss: 0.414840
Train Epoch: 3 [215/235], Loss: 0.338532
Train Epoch: 3 [216/235], Loss: 0.335628
Train Epoch: 3 [217/235], Loss: 0.352955
Train Epoch: 3 [218/235], Loss: 0.345070
Train Epoch: 3 [219/235], Loss: 0.334396
Train Epoch: 3 [220/235], Loss: 0.340255
Train Epoch: 3 [221/235], Loss: 0.338765
Train Epoch: 3 [222/235], Loss: 0.317491
Train Epoch: 3 [223/235], Loss: 0.322715
Train Epoch: 3 [

  softmax = nn.functional.softmax(mixed_fs.t()).view(n_data, n_samples, self.n_classes)


Test set: Average loss: 0.0000, Accuracy: 9892/10000 (98.920%)
Train Epoch: 4 [001/235], Loss: 0.338766
Train Epoch: 4 [002/235], Loss: 0.367297
Train Epoch: 4 [003/235], Loss: 0.347085
Train Epoch: 4 [004/235], Loss: 0.335469
Train Epoch: 4 [005/235], Loss: 0.326927
Train Epoch: 4 [006/235], Loss: 0.351749
Train Epoch: 4 [007/235], Loss: 0.331996
Train Epoch: 4 [008/235], Loss: 0.329588
Train Epoch: 4 [009/235], Loss: 0.343325
Train Epoch: 4 [010/235], Loss: 0.339506
Train Epoch: 4 [011/235], Loss: 0.339820
Train Epoch: 4 [012/235], Loss: 0.340228
Train Epoch: 4 [013/235], Loss: 0.338951
Train Epoch: 4 [014/235], Loss: 0.354065
Train Epoch: 4 [015/235], Loss: 0.341825
Train Epoch: 4 [016/235], Loss: 0.338622
Train Epoch: 4 [017/235], Loss: 0.330435
Train Epoch: 4 [018/235], Loss: 0.351709
Train Epoch: 4 [019/235], Loss: 0.315660
Train Epoch: 4 [020/235], Loss: 0.358695
Train Epoch: 4 [021/235], Loss: 0.326777
Train Epoch: 4 [022/235], Loss: 0.359385
Train Epoch: 4 [023/235], Loss: 0.3

Train Epoch: 4 [200/235], Loss: 0.331628
Train Epoch: 4 [201/235], Loss: 0.324979
Train Epoch: 4 [202/235], Loss: 0.324731
Train Epoch: 4 [203/235], Loss: 0.355444
Train Epoch: 4 [204/235], Loss: 0.332848
Train Epoch: 4 [205/235], Loss: 0.353993
Train Epoch: 4 [206/235], Loss: 0.326868
Train Epoch: 4 [207/235], Loss: 0.329959
Train Epoch: 4 [208/235], Loss: 0.329309
Train Epoch: 4 [209/235], Loss: 0.309408
Train Epoch: 4 [210/235], Loss: 0.328027
Train Epoch: 4 [211/235], Loss: 0.330238
Train Epoch: 4 [212/235], Loss: 0.308669
Train Epoch: 4 [213/235], Loss: 0.341492
Train Epoch: 4 [214/235], Loss: 0.321429
Train Epoch: 4 [215/235], Loss: 0.370373
Train Epoch: 4 [216/235], Loss: 0.370197
Train Epoch: 4 [217/235], Loss: 0.323293
Train Epoch: 4 [218/235], Loss: 0.368431
Train Epoch: 4 [219/235], Loss: 0.332610
Train Epoch: 4 [220/235], Loss: 0.341482
Train Epoch: 4 [221/235], Loss: 0.331595
Train Epoch: 4 [222/235], Loss: 0.349746
Train Epoch: 4 [223/235], Loss: 0.375028
Train Epoch: 4 [

  softmax = nn.functional.softmax(mixed_fs.t()).view(n_data, n_samples, self.n_classes)


Test set: Average loss: 0.0000, Accuracy: 9872/10000 (98.720%)
Train Epoch: 5 [001/235], Loss: 0.343467
Train Epoch: 5 [002/235], Loss: 0.325436
Train Epoch: 5 [003/235], Loss: 0.325466
Train Epoch: 5 [004/235], Loss: 0.337124
Train Epoch: 5 [005/235], Loss: 0.353041
Train Epoch: 5 [006/235], Loss: 0.332303
Train Epoch: 5 [007/235], Loss: 0.341696
Train Epoch: 5 [008/235], Loss: 0.334159
Train Epoch: 5 [009/235], Loss: 0.348334
Train Epoch: 5 [010/235], Loss: 0.325254
Train Epoch: 5 [011/235], Loss: 0.324216
Train Epoch: 5 [012/235], Loss: 0.327630
Train Epoch: 5 [013/235], Loss: 0.339285
Train Epoch: 5 [014/235], Loss: 0.331850
Train Epoch: 5 [015/235], Loss: 0.320544
Train Epoch: 5 [016/235], Loss: 0.337012
Train Epoch: 5 [017/235], Loss: 0.327656
Train Epoch: 5 [018/235], Loss: 0.329944
Train Epoch: 5 [019/235], Loss: 0.325355
Train Epoch: 5 [020/235], Loss: 0.338756
Train Epoch: 5 [021/235], Loss: 0.336839
Train Epoch: 5 [022/235], Loss: 0.322360
Train Epoch: 5 [023/235], Loss: 0.3

Train Epoch: 5 [200/235], Loss: 0.328927
Train Epoch: 5 [201/235], Loss: 0.353309
Train Epoch: 5 [202/235], Loss: 0.336863
Train Epoch: 5 [203/235], Loss: 0.347983
Train Epoch: 5 [204/235], Loss: 0.344681
Train Epoch: 5 [205/235], Loss: 0.370748
Train Epoch: 5 [206/235], Loss: 0.338474
Train Epoch: 5 [207/235], Loss: 0.325556
Train Epoch: 5 [208/235], Loss: 0.330005
Train Epoch: 5 [209/235], Loss: 0.333893
Train Epoch: 5 [210/235], Loss: 0.338651
Train Epoch: 5 [211/235], Loss: 0.335742
Train Epoch: 5 [212/235], Loss: 0.337552
Train Epoch: 5 [213/235], Loss: 0.326950
Train Epoch: 5 [214/235], Loss: 0.335224
Train Epoch: 5 [215/235], Loss: 0.361109
Train Epoch: 5 [216/235], Loss: 0.340011
Train Epoch: 5 [217/235], Loss: 0.334965
Train Epoch: 5 [218/235], Loss: 0.319811
Train Epoch: 5 [219/235], Loss: 0.369815
Train Epoch: 5 [220/235], Loss: 0.352917
Train Epoch: 5 [221/235], Loss: 0.323578
Train Epoch: 5 [222/235], Loss: 0.340981
Train Epoch: 5 [223/235], Loss: 0.348733
Train Epoch: 5 [

  softmax = nn.functional.softmax(mixed_fs.t()).view(n_data, n_samples, self.n_classes)


Test set: Average loss: 0.0000, Accuracy: 9925/10000 (99.250%)
Train Epoch: 6 [001/235], Loss: 0.310753
Train Epoch: 6 [002/235], Loss: 0.326897
Train Epoch: 6 [003/235], Loss: 0.327825
Train Epoch: 6 [004/235], Loss: 0.333024
Train Epoch: 6 [005/235], Loss: 0.337635
Train Epoch: 6 [006/235], Loss: 0.325754
Train Epoch: 6 [007/235], Loss: 0.326003
Train Epoch: 6 [008/235], Loss: 0.334552
Train Epoch: 6 [009/235], Loss: 0.327853
Train Epoch: 6 [010/235], Loss: 0.329951
Train Epoch: 6 [011/235], Loss: 0.329439
Train Epoch: 6 [012/235], Loss: 0.335213
Train Epoch: 6 [013/235], Loss: 0.334272
Train Epoch: 6 [014/235], Loss: 0.322461
Train Epoch: 6 [015/235], Loss: 0.318361
Train Epoch: 6 [016/235], Loss: 0.331106
Train Epoch: 6 [017/235], Loss: 0.337697
Train Epoch: 6 [018/235], Loss: 0.333111
Train Epoch: 6 [019/235], Loss: 0.349010
Train Epoch: 6 [020/235], Loss: 0.335571
Train Epoch: 6 [021/235], Loss: 0.408719
Train Epoch: 6 [022/235], Loss: 0.327110
Train Epoch: 6 [023/235], Loss: 0.3

KeyboardInterrupt: 

  softmax = nn.functional.softmax(mixed_fs.t()).view(n_data, n_samples, self.n_classes)


Test set: Average loss: 0.0000, Accuracy: 9887/10000 (98.870%)
Train Epoch: 7 [001/235], Loss: 0.339845
Train Epoch: 7 [002/235], Loss: 0.334462
Train Epoch: 7 [003/235], Loss: 0.356453
Train Epoch: 7 [004/235], Loss: 0.366893
Train Epoch: 7 [005/235], Loss: 0.329700
Train Epoch: 7 [006/235], Loss: 0.393469
Train Epoch: 7 [007/235], Loss: 0.311550
Train Epoch: 7 [008/235], Loss: 0.339231
Train Epoch: 7 [009/235], Loss: 0.344048
Train Epoch: 7 [010/235], Loss: 0.323837
Train Epoch: 7 [011/235], Loss: 0.335915
Train Epoch: 7 [012/235], Loss: 0.310757
Train Epoch: 7 [013/235], Loss: 0.339466
Train Epoch: 7 [014/235], Loss: 0.329001
Train Epoch: 7 [015/235], Loss: 0.314345
Train Epoch: 7 [016/235], Loss: 0.337905
Train Epoch: 7 [017/235], Loss: 0.355110
Train Epoch: 7 [018/235], Loss: 0.329310
Train Epoch: 7 [019/235], Loss: 0.335341
Train Epoch: 7 [020/235], Loss: 0.318637
Train Epoch: 7 [021/235], Loss: 0.359748
Train Epoch: 7 [022/235], Loss: 0.322279
Train Epoch: 7 [023/235], Loss: 0.3