# DCGAN - Parzen Window-based Log-Likelihood Estimates

This notebook is a wrapper around the Parzen log-liklihood estimator described and implemented
in the [original DCGAN paper](https://github.com/goodfeli/adversarial/blob/master/parzen_ll.py). 

> We estimate probability of the test set data under pg by fitting a Gaussian Parzen window to the
samples generated with G and reporting the log-likelihood under this distribution. The σ parameter of the Gaussians was obtained by cross validation on the validation set. This procedure was intro-
duced in Breuleux et al. [8] and used for various generative models for which the exact likelihood
is not tractable

Slight modifications are made in the local file (`parzen_ll.py`) for the following:

- Migrate from Python2 -> Python3 syntax
- Add comments and docstrings for clarity

The goal of this project is not to develop a new  ramework for estimating generative models, consequently, the log-likelihoods calculated here are meant only for internal comparison between models. As you'll notice, I do not use `MNIST`, `TFD`, or `CIFAR-10` as a validation set, but rather a sample of MSLS images held-out from training.

--------------------

In [None]:
# These are NOT on all `conda_amazonei_pytorch_latest_p3X` or `conda_pytorch_p3X` builds
! pip3 install \
    tensorboard \
    theano

In [None]:
# General
import numpy as np
import theano
import re
import datetime
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

# Torch Deps
import torch
import torchvision.datasets as dset
import torchvision.transforms as transforms
from torch.utils.tensorboard import SummaryWriter

# DCGAN
import gaudi_dcgan as dcgan
import parzen_ll as plle
import msls_dcgan_utils as utils

In [None]:
class LimitDataset(torch.utils.data.Dataset):
    """
    Simple wrapper around torch.utils.data.Dataset to limit # of data-points
    passed to a DataLoader; used to 
    """

    def __init__(self, dataset, n):
        self.dataset = dataset
        self.n = n

    def __len__(self):
        """Clobber the old Length"""
        return self.n

    def __getitem__(self, i):
        return self.dataset[i]

In [None]:
# Inputs 

# NOTE: 
#    - The directory (`CV_DATAROOT_00X`) is assumed to be populated with a holdout of images from MSLS. 
#    - `DATASET_SIGMA` can be set below to skip the sigma estimation/cross-validation step...
CV_DATAROOT_001 = "/data/cross_validation_images/set001/" 
CV_DATAROOT_002 = "/data/cross_validation_images/set002/" 
IMG_SIZE = 64
BATCH_SIZE = 128

# If you want to skip sigma validation, use the following as a *ROUGH* value; else 
# set DATASET_SIGMA == None
DATASET_SIGMA = np.logspace(-1., 0, 10)[5]

VALIDATION_SAMPLE_SIZE = 5000

# Set Estimation Epoch && Number of Samples to Generate
ESTIMATION_EPOCH = 16
N_SAMPLES = 1000

# See `Data and Transformations` section for details.
TORCH_DL_COMPOSED_TRANSFORMS = transforms.Compose([
    transforms.RandomAffine(degrees=0, translate=(0.3, 0.0)),
    transforms.CenterCrop(IMG_SIZE * 4),
    transforms.Resize(IMG_SIZE),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])


In [None]:
# ImageFolder/Dataloader reads from the directory of images and applys a transformation
# at runtime to generate our training images (tensors) from source images
dataset = torchvision.datasets.ImageFolder(
    root=CV_DATAROOT_001,
    transform=TORCH_DL_COMPOSED_TRANSFORMS
)

# Use LimitDataset wrapper to ensure we're able to transfer into memory safely
limited_msls_data = utils.LimitDataset(
    dataset, min(VALIDATION_SAMPLE_SIZE, len(dataset))
)

# Create the dataloader
#
# WARNING: By setting `batch_size=len(dataset)`, we're forcing the loader to read all data in a single
# iteration. This is "safe" ONLY because we've used `utils.LimitDataset` above to fix the amount of
# images/memory that operation will use!
msls_real_dataloader = torch.utils.data.DataLoader(
    limited_msls_data,
    shuffle=False,
    num_workers=8,
    pin_memory=True,
    batch_size=len(dataset),  # Force Single Batch
)

# The implementation of Parzen LL expects data in a particular shape; convert from
# [N x 3 x 64 x 64] => [N x (3 x 64 x 64)]
msls_real_data = next(iter(msls_real_dataloader))[0].numpy()
print(f"Shape after Fetch From Loader: {msls_real_data.shape}")

msls_real_data = msls_real_data.reshape(
    (msls_real_data.shape[0], np.prod(msls_real_data.shape[1:]))
)

print(f"Shape after Reshape: {msls_real_data.shape}")

## Initialize Model and Training Configs 

Model and training configs are required to specificy the model to load and generate samples from `G` as of a specific epoch

-------

In [None]:
# Initialize Model and Training Configs w. default args. Not strictly required, just for clarity.
model_cfg = dcgan.ModelCheckpointConfig(
    model_name="msls_dcgan_ml_p3_8xlarge_001",  # Custom Model Name To Identify Gaudi vs GPU Trained!
    model_dir="/efs/trained_model",
    save_frequency=1,
    log_frequency=50,
    gen_progress_frequency=250,
)

train_cfg = dcgan.TrainingConfig(
    batch_size=128,
    img_size=64,
    nc=3,
    nz=100,
    ngf=64,
    ndf=64,
    lr=0.0002,
    beta1=0.5,
    beta2=0.999,
)

## Calculating Log-Likelihood For a Single Epoch

In the section below we generate samples from `G` as of a specific epoch, `ESTIMATION_EPOCH`. We then use these samples to calculate a `Sigma` from a set of candidate values, and then estimate the Log-Likelihood of the test set.

--------------

In [None]:
# Generate N samples
generated_data = dcgan.generate_fake_samples(
    n_samples=N_SAMPLES,
    train_cfg=train_cfg,
    model_cfg=model_cfg,
    as_of_epoch=ESTIMATION_EPOCH,
).numpy()

print(f"Shape after Generation: {generated_data.shape}")

# The implementation of Parzen LL expects data in a particular shape; convert from
# [N x 3 x 64 x 64] => [N x (3 x 64 x 64)]

generated_data = generated_data.reshape(
    (generated_data.shape[0], np.prod(generated_data.shape[1:]))
)

print(f"Shape after Reshape: {generated_data.shape}")

# Skip Sigma Calcs if Sigma is already set...
if DATASET_SIGMA:
    print(f"Skipping Sigma Estimation, Using: {DATASET_SIGMA}")
    sigma = DATASET_SIGMA
else:
    # Estimate Sigma on G(Z) and MSLS Data...
    sigma = plle.cross_validate_sigma(
        generated_data,
        msls_real_data,
        np.logspace(-1.0, 0, num=10),  # Default Sigma-space from DCGAN
        BATCH_SIZE,  # Default Batch Size
    )

# Fit Parzen Estimator && Calculate LL
parzen = plle.theano_parzen(generated_data, sigma)

ll = plle.get_nll(msls_real_data, parzen, batch_size=BATCH_SIZE)

se = ll.std() / np.sqrt(msls_real_data.shape[0])

print(f"Log-Likelihood of Test Set = {ll.mean()}, se: {se}")

## Calculate Log-Likelihood For Multiple Epochs

Same procedure as above, calculate the LL over multiple epochs. Loading and generating from `G` across multiple
checkpoints. Uses a fixed `sigma`. This may not be the optimal method to asses the quality of a GAN's output, but it does demonstrate progress over time.

----------

In [None]:
# Fit Parzen Estimator && Calculate LL
EPOCH_FREQUENCY = 1
PLOT_DTTM = re.sub(":|-| |\.", "_", datetime.datetime.utcnow().__str__())

log_likelihoods = []
std_errs = []

# Create new TensorBoard Writer...
writer = SummaryWriter(f"{model_cfg.model_dir}/{model_cfg.model_name}/events")

for cur_epoch in range(0, ESTIMATION_EPOCH, EPOCH_FREQUENCY):

    # Generate Data as of Epoch
    generated_data = dcgan.generate_fake_samples(
        n_samples=N_SAMPLES,
        train_cfg=train_cfg,
        model_cfg=model_cfg,
        as_of_epoch=cur_epoch,
    ).numpy()

    generated_data = generated_data.reshape(
        (generated_data.shape[0], np.prod(generated_data.shape[1:]))
    )

    parzen = plle.theano_parzen(generated_data, sigma)

    # Estimate Log-Likelihood
    ll = plle.get_nll(msls_real_data, parzen, batch_size=BATCH_SIZE)

    se = ll.std() / np.sqrt(msls_real_data.shape[0])
    log_likelihoods.append(ll.mean())
    std_errs.append(se)

    # Write to TensorBoard...
    writer.add_scalar(
        f"parzen_estimated_LL",
        ll.mean(),
        cur_epoch,
    )
    writer.flush()

    print(f"Epoch {cur_epoch}: Log-Likelihood of Test Set = {ll.mean()}, se: {se}")


# Plot LL over Range && Save to TensorBoard

plt.figure(figsize=(12, 6))
plt.title(f"Log Likliehood over Training Epochs - {model_cfg.model_name}")

plt.plot(range(0, ESTIMATION_EPOCH, EPOCH_FREQUENCY), lls)

plt.errorbar(
    range(0, ESTIMATION_EPOCH, EPOCH_FREQUENCY), log_likelihoods, yerr=std_errs, fmt="o"
)

plt.xlabel("Epoch")
plt.ylabel("Log-Likelihood")

plt.show()

plt.savefig(
    f"{model_cfg.model_dir}/{model_cfg.model_name}/figures/log_likelihood_{PLOT_DTTM}.png"
)

writer.close()

## Calculating Log Likelihood on Samples from the Test Data

> We estimate probability of the test set data under *Pg* by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution.

To put these values into context, I'm going to run the procedure against a sample of real images...

------

In [None]:
## Duplicate Steps from 2.3

# Create Secondary Dataset of Real Images; Treat these as if they were "fake"
cv_dataset_002 = torchvision.datasets.ImageFolder(
    root=CV_DATAROOT_002,
    transform=TORCH_DL_COMPOSED_TRANSFORMS
)

# Use LimitDataset wrapper to ensure we're able to transfer into memory safely
limited_msls_data_002 = utils.LimitDataset(
    dataset, min(VALIDATION_SAMPLE_SIZE, len(cv_dataset_002))
)

# Create the Dataloader
msls_real_data_batch_002 = torch.utils.data.DataLoader(
    limited_msls_data_002,
    shuffle=False,
    num_workers=8,
    pin_memory=True,
    batch_size=len(cv_dataset_002),  # Force Single Batch
)

# Convert from [N x 3 x 64 x 64] => [N x (3 x 64 x 64)]
msls_real_data_batch_002 = next(iter(msls_real_data_batch_002))[0].numpy()

msls_real_data_batch_002 = msls_real_data.reshape(
    (msls_real_data_batch_002.shape[0], np.prod(msls_real_data_batch_002.shape[1:]))
)

# Fit Parzen Estimator && Calculate LL on Two Batches of Real Data
parzen = plle.theano_parzen(msls_real_data_batch_002, sigma)

ll = plle.get_nll(msls_real_data_batch_001, parzen, batch_size=BATCH_SIZE)

se = ll.std() / np.sqrt(msls_real_data_batch_001.shape[0])

print(f"Log-Likelihood of Test Set = {ll.mean()}, se: {se}")