In [1]:
import copy
import random

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from scipy.stats import entropy

# Assignment 3

## Generative Adversarial Networks on MNIST

In this assignment you will create and train a GAN to generate images of digits that mimic those in the MNIST dataset.

### Evaluation metric: Inception Score

Rather than just eye-balling whether GAN samples look good or not, researchers have come up with mulitple objective metrics for determining the quality and the diversity of GAN outputs. We will use one of the metrics called the *Inception Score*.

Calculating the Inception Score involves running a pretrained neural network. This is where the name is from: the authors who proposed this metric used a pretrained [Inception Network](https://arxiv.org/pdf/1409.4842.pdf) from Tensorflow in their [paper](https://proceedings.neurips.cc/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf). Since we will be using the MNIST dataset in this assignment, we provide a simpler neural work pretrained on MNIST as the scoring model.

The idea behind the Inception Score is simple: a good GAN should generate *meaningful* and *diverse* samples. For MNIST, a specific sample is "meaningful" if it looks like any of the 10 digits. When we take a good digit classifier and run it on this sample, it should assign high probability to one of the 10 classes and low probability to the others. In information theory terms, this means the predicted label distribution $p(y|x)$ for any specific sample $x$ should have high entropy. On the other hand, if the generated samples are diverse, they should be able to cover all 10 classes when we generate a large enough set of samples. This means that the "average" label distribution $p(y) = \int p(y|x=G(z)) \mathrm{d}z$ should have low entropy. The Inception Score is define by $\exp (\mathbb{E}_x \mathrm{KL}(p(y|x) || p(y)))$, where $\mathrm{KL}(P||Q)$ is the K-L divergence, which is often used to measure how probability distribution $P$ is different from distribution $Q$. Intuitively, if the generated samples are good, $p(y|x)$ should be different from $p(y)$, since one should have high entropy while the other should have low entropy.

Don't be too worried if you don't fully get how the score is defined and calculated. Just remember that in this assignment, we want our GAN to have a high Inception Score.

In [2]:
# Pretrained model used to evaluation the inception score.
class ScoringModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.drop_out = nn.Dropout()
        self.fc1 = nn.Linear(7 * 7 * 64, 1000)
        self.fc2 = nn.Linear(1000, 10)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.drop_out(out)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

    
def inception_score_mnist(
    imgs,
    model_path='weights/mnist.ckpt',
    batch_size=32,
    num_splits=10,
):
    """Computes the inception score of `imgs`.
    
    Args:
    - imgs: Array of size (number of data points, 1, 28, 28)
    - batch_size: Batch size for feeding data into the pretrained MNIST model.
    - num_splits: Number of splits. We split the samples into multiple subsets
        and calculate the scores on each of them. Their mean is used as the
        final score.
    """
    # Verify that input arguments have the correct formats.
    assert type(imgs) == np.ndarray
    assert imgs.shape[1:] == (1, 28, 28)
    assert batch_size > 0
    assert len(imgs) > batch_size
    
    # Choose device to be used.
    device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

    # Preprocess input.
    imgs = copy.copy(imgs)
    imgs = (imgs - 0.1307) / 0.3081
    
    # Set up dataloader.
    dataloader = torch.utils.data.DataLoader(imgs, batch_size=batch_size)

    # Load pretrained scoring model.
    model = ScoringModel()
    model.load_state_dict(torch.load(model_path))
    model = model.to(device)
    model.eval()

    # Get predictions.
    preds = []
    for i, batch in enumerate(dataloader):
        batch = batch.to(device)
        with torch.no_grad():
            logits = model(batch)
            probs = F.softmax(logits, dim=1).cpu().numpy()
        preds.append(probs)
    preds = np.concatenate(preds)

    # Compute the mean KL divergence.
    split_scores = []

    for i in range(num_splits):
        n = len(imgs) // num_splits
        split = preds[i*n:(i+1)*n, :]
        py = np.mean(split, axis=0)
        scores = []
        for i in range(split.shape[0]):
            pyx = split[i, :]
            scores.append(entropy(pyx, py))
        split_scores.append(np.exp(np.mean(scores)))
    
    return np.mean(split_scores), np.std(split_scores)

Now, let's try to calculate the Inception score on the actual MNIST dataset.

Make sure that the provided file `mnist.ckpt` is under `./weights`. Alternatively, you can specify its path via the `model_path` argument of `inception_score_mnist()`. If using Google Colab, click `View > Table of Contents > Files` and then upload it.

In [3]:
transform = torchvision.transforms.ToTensor()
train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST('./data', train=False, download=True, transform=transform),
    batch_size=500, shuffle=True)

x, _ = next(iter(train_loader))
x = x.cpu().data.numpy()
x = x.reshape((-1,1,28,28))
print('Shape of data:',x.shape)
mean, std = inception_score_mnist(x)
print(f'Inception Score: mean={mean:.3f}, std={std:.3f}')

Shape of data: (500, 1, 28, 28)
Inception Score: mean=8.933, std=0.498


The score for the real MNIST dataset should be above 8.


### Generating MNIST images (100 points)

As you did with the Gaussian distribution example in the weekly notebook, define and train a GAN to generate images that mimic those in the MNIST dataset.

#### Deliverables

- After training your model, generate at least 1500 samples using the trained generator, and evaluate your model by calculating the Inception score on the generated samples.
- Pick a few generated samples and visualize them.
- Plot the training losses for the discriminator and the generator.

Given the limited computational resources, you will want to achieve an Inception score of 1.5 or greater for full credits. A score of 1.5 won't yield great images. For nice looking images, you'll need an Inception score of around 6.0, but it is not needed for full credits.

#### Model Submission

For more complicated architectures, if your model takes a long time to train, you will need to save the model and write a code snippet that loads it such that the code runs with no errors and we can grade it easily. In this case, set `epochs = 0` and include the saved model in your submission (or a Google drive share link if its too large).

#### Tips

- It will be easier to get better results with a convolutional GAN. You may find this [tutorial](https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html) on [DCGAN](https://arxiv.org/pdf/1511.06434.pdf) helpful. The generators of DCGANs make use of transposed convolutions (`nn.ConvTranspose2d` in PyTorch) to map features to larger sizes. This [article](https://d2l.ai/chapter_computer-vision/transposed-conv.html) does a good job illustrating how they work.
- Feel free to try different architectures, layers, optimizers, training schemes and other hyperparameters. Particularly, if training with one type of optimizer is slow or unstable, give other types of optimizers a try.

There are plenty of online resources about GAN that you can reference for inspiration. But do not plagiarize. Please write your own custom networks.

In [1]:
# TODO