# HW4: RNNs and GANs


Designed by Anil Kag drawing from prior work by Samarth Mishra, Kun He, Xide Xia, Kubra Cilingir, Vijay Thakkar, Ali Siahkamari, and Brian Kulis.


This assignment will introduce you to 

1. Understanding state transitions in Vanilla RNNs by solving a simple sequential addition task. [20 points]

2. Building a sentiment classifier using LSTMs. This task is based on a popular IMDB review dataset. [40 points]

3. Introduce Generative Adversarial Learning building blocks. [40 points]

4. Use above blocks to change hairstyles of popular celebrities available in the CelebA dataset. [60 points]

**NOTE** : Problem 3 and 4 are time consuming, not just from code perspective but it takes some effort in training to get reasonable GAN performance (you should expect the training to take roughly 10 hours or more depending on the complexity you choose), so **PLEASE START EARLY.**

## Preamble

We recommend using GPU as the primary compute for these problems. For the first two tasks, we do not expect you to require too much GPU compute time (efficient code should finish within 20mins on a GPU like K80). For the GAN problems, we expect longer run time (specially the problem 4 should take more than 6 hours to get a reasonable generator).

## Environment setup

We provide you the datasets required for this assignment as shared folders in both SCC (`'/projectnb/ec523/anilkag/datasets/'`) and [Google Drive](https://drive.google.com/drive/folders/1EFJKm8mZk6GPrN2BRuadL_HRg0STb7V0?usp=sharing).


Below we describe the environment setup steps depending on your preferrence. Through [(SCC)](https://www.bu.edu/tech/support/research/computing-resources/scc/) you can get access to GPU compute, below we describe the steps for setting up a tunnel so that you can open a Jupyter notebook on your end.

If you intend to use Google colab, feel free to skip the below steps. Remember, Google colab only provides you a preemptive access to a GPU, so technically you could loose access to a GPU in the middle of your experiments. Ideally, it will allow you access to a GPU for 12 hours (free quota). Also, since its a shared resource, be respectful of the Google colab usage policy. 

## SCC Configuration

For this assignment, we recommend that you use the shared computing cluster [(SCC)](https://www.bu.edu/tech/support/research/computing-resources/scc/). Each of you has an account on the SCC and can login using (more info on the [quick start guide](https://www.bu.edu/tech/support/research/system-usage/scc-quickstart/)):
```
ssh <bu_loginname>@scc1.bu.edu
```

[comment]: #   ( Copy `HW4_datasets.zip` to the SCC, extract and change working directory to the extracted folder. )

Here, we provide some instructions to start a jupyter notebook or jupyter lab server on the SCC. More detailed instructions can be found on the SCC's info website (linked above). These instructions are also fairly easy to find with a google search.
You can then request an interactive session on a compute node using:
```
qrsh -pe omp <num_cpu_cores> -l gpus=<num_gpus_per_cpu_core> -l gpu_c=<gpu_compute_capacity>
```
Recommended values for the parameters in <> above are 2, 0.5 and 3.5 repectively, which will assign you 2 cpu cores, 1 gpu of compute capacity at least 3.5 for an interactive session of 12 hours. We also have access to some GPUs with compute capacity 6.0 (but they may be somewhat limited in numbers), if you want to explore these GPUs, feel free to request with gpu_c=6.0. 

Now that you are on a compute node, you can load modules you need as:
```
module load cuda/10.1
module load python3/3.6.9
module load pytorch/1.3
```
Running a jupyter server (you can replace the following with `jupyter notebook` for the old interface):
```
jupyter lab --no-browser --ip=0.0.0.0 --port=8888
```
To access the jupyter interface on a browser on your personal machine, use ssh port forwarding as,
```
ssh -N -f -L 8889:<scc-compute-node>:8888 <bu_loginname>@scc1.bu.edu
```
where `<scc_compute_node>` is the compute node on the scc that you have access to through your interactive session (e.g. `scc-k02`). Open a browser and the interface should be accessible at `localhost:8889`. The ports `8888` and `8889` can be changed for something else. If these ports are already in use, feel free to use unused port numbers in the above command.

### Assignment Shared Data path

If you are using your local machine or SSC, please update the python variable `EXPERIMENTS_DIRECTORY` and `DATA_DIR`, prepopulated below to the correct folder where you have stored the files provided for this assignment.

Note that 
1. `EXPERIMENTS_DIRECTORY` : refers to the folder where your experiment results, models, logs and the sample generated will be stored. Please update the path to a folder you have write access. For SCC, it is recommended that you create a folder, ` '/projectnb/ec523/<USER-NAME>/experiments'`, change `<USER-NAME>` to your scc username. For Google colab, use any folder you have write access.

2. `DATA_DIR` : refers to the folder where the datasets required for this assignment are stored (For scc, I've already copied the datasets to a folder ( `'/projectnb/ec523/anilkag/datasets/'` ) where everyone has read accesss. For Google colab, I've shared the dataset at the following directory, `https://drive.google.com/drive/folders/1EFJKm8mZk6GPrN2BRuadL_HRg0STb7V0?usp=sharing`, please update this variable according the path that your BU Google drive account shows. )

## Google Colab Configuration

No special setup is required (just ensure that the runtime is setup to utilize a GPU).

### Assignment Shared Data path

If you are using Google Colab for the experiments, it is easy to link your BU google drive account for storage purposes.

1. First step is to give access permissions to the google colab. Following commands perform the authentication (run these two lines of code, it'll provide you the link for authentication, copy & paste the authorization code in the fill in box provided by the script.)
```
from google.colab import drive
drive.mount('/content/gdrive')
```

2. Finding the right folder path for experiments. Similar to the ```ls``` command in linux, the following command lists the content of the drive (use this to choose your base experiment directory)
```
ls "/content/gdrive/My Drive"
```

3. Please copy all the data provided for this assignment  to the experiment directory, say in the location ```"/content/gdrive/My Drive/Experiments-Deep-Learning-HW4"``` (If you change the path, please update the same in the script below.)
```
ls "/content/gdrive/My Drive/Experiments-Deep-Learning-HW4"
```


In [None]:
# Change this part to locate the correct directory where dataset resides
use_colab = True

if use_colab:
    from google.colab import drive
    drive.mount('/content/gdrive')
    #!ls "/content/gdrive/My Drive/Datasets/shared/HW4_shared_files"

    ## Update the experiments directory
    EXPERIMENTS_DIRECTORY = '/content/gdrive/My Drive/Datasets/shared/experiments/'
    DATA_DIRECTORY = '/content/gdrive/My Drive/Datasets/shared/HW4_shared_files/'
    CELEBA_GOOGLE_DRIVE_PATH = DATA_DIRECTORY + 'celeba_attributes_images.hdf5'
    IMDB_REVIEWS_FILE_PATH = DATA_DIRECTORY + 'data/'
else:
    ## Update the experiments directory
    EXPERIMENTS_DIRECTORY = '/projectnb/ec523/anilkag/experiments'
    DATA_DIRECTORY = '/projectnb/ec523/anilkag/datasets/'
    CELEBA_GOOGLE_DRIVE_PATH = DATA_DIRECTORY + 'celeba_attributes_images.hdf5'
    IMDB_REVIEWS_FILE_PATH = DATA_DIRECTORY + 'data/'


### Modules needed

In [None]:
import os
import random
import math
import datetime
import time
import numpy as np
import h5py
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML
from PIL import Image

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim as optim
import torch.utils.data
from torch.utils import data
import torchvision.datasets as dset
import torchvision.transforms as transforms
import torchvision.utils as vutils
from torchvision import transforms as T
from torchvision.datasets import ImageFolder
from torch.autograd import Variable
from torchvision.utils import save_image
from torch.backends import cudnn

%matplotlib inline

# Set random seed for reproducibility
manualSeed = 999
# manualSeed = random.randint(1, 10000) # use if you want new results
print("Random Seed: ", manualSeed)
random.seed(manualSeed)
np.random.seed(manualSeed)
torch.manual_seed(manualSeed)



## Q1. RNN Example [20 points]

In this example we train an RNN to solve a simple addition task presented in a sequential manner. The input data of the dataset consists of two rows. The first row contains random float numbers between 0 and 1; the second row are all zeros, expect two randomly chosen locations being marked as 1. The corresponding output label is a float number summing up two numbers in the first row of the input data where marked as 1 in the second row. The length of the row T is the length of the input sequence.


<img src='https://minpy.readthedocs.io/en/latest/_images/adding_problem.png' style="width: 600px;"/>

We will first implement a vanilla RNN that stores the hidden state $h_t$ as a summary of the sequential input seen up to timestep $t$.

1. We assume that $\{ x_t \}^T_{t=1}$ is the sequential input data to an RNN.

2. A vanilla RNN maintains $h_t = \phi( U h_{t-1} + W x_{t} + b )$ where $h_{t-1}$ is the hidden state upto timestep $t-1$ and $x_t$ is the current input. 

3. The parameters $\{ U, W, b \}$ are learnt by training the RNN on the given training data through back-propagation.


To start with, the following code segment (after the next cell, which is meant for imports) generates the data. 

In [None]:
def adding_problem_generator(N, seq_len=6, high=1, number_of_ones=2): 
    X_num = np.random.uniform(low=0, high=high, size=(N, seq_len, 1))
    X_mask = np.zeros((N, seq_len, 1))
    Y = np.ones((N, 1))
    for i in range(N):
        # Default uniform distribution on position sampling
        positions1 = np.random.choice(np.arange(math.floor(seq_len/2)), size=math.floor(number_of_ones/2), replace=False)
        positions2 = np.random.choice(np.arange(math.ceil(seq_len/2), seq_len), size=math.ceil(number_of_ones/2), replace=False)

        positions = []
        positions.extend(list(positions1))
        positions.extend(list(positions2))
        positions = np.array(positions)

        X_mask[i, positions] = 1        
        Y[i, 0] = np.sum(X_num[i, positions])
    X = np.append(X_num, X_mask, axis=2)
    return torch.FloatTensor(X), torch.FloatTensor(Y)

In [None]:
#tmpX, tmpY = adding_problem_generator(1, seq_len=6, high=1, number_of_ones=2)
#print('First data point : ')
#print(tmpX[0])
#print(tmpY[0]) 
print('Uncomment the above function calls to see the input data point and the output ')

Uncomment the above function calls to see the input data point and the output 


### Q1.1 (10 points)
Write the transition function that takes the previous hidden state $h_{t-1}$ and current observation $x_t$ as input and outputs the next hidden state $h_t$ using the parameter matrices. (HINT: initialize the bias to be close to 0.)


In [None]:
class VanillaRNNCell( nn.Module ):
  def __init__(self, input_size, hidden_size, nonlinearity='tanh'):
    super(VanillaRNNCell, self).__init__()
    self.input_size = input_size
    self.hidden_size = hidden_size
    self.nonlinearity = nonlinearity

    ################################################
    ##### TODO CODE HERE
    #####     Declare your parameters here (U, W, b)
    #####     Use nn.Parameter and shapes given above for declaration
    raise NotImplementedError
    ################################################
  
    ################################################
    ##### TODO CODE HERE
    #####     Initialize your parameters (if not then it'll use default initialization)
    #####     Use nn.init functions 
    raise NotImplementedError
    ################################################

  def forward(self, x, h):
    if h is None:
      h = x.new_zeros(x.shape[0], self.hidden_size)

    ## 
    ## [CODE] Write the transition step and return new hidden state
    ##        h_t = nonlinearity( U * h_{t-1} + W * x_t + b )
    ##  
    ##        Use torch.relu, torch.tanh to differentiate the two case           
    if self.nonlinearity == 'relu':
      raise NotImplementedError
    elif self.nonlinearity == 'tanh':
      raise NotImplementedError
    else:
      raise RuntimeError("Unknown nonlinearity: {}".format(self.nonlinearity))

    return h


Below is the boiler plate code which takes as input an RNN implementation and trains a simple network to solve the Addition task.

In [None]:
class Model(nn.Module):
    def __init__(self, rec_net):
        super(Model, self).__init__()
        self.rnn = rec_net
        self.lin = nn.Linear(hidden_size, 1) 
        self.loss_func = nn.MSELoss()

        nn.init.xavier_normal_(self.lin.weight)

    def forward(self, x, y):        
        loss = 0
        hidden = None

        for i in range(len(x)):
            hidden = self.rnn.forward(x[i], hidden)

        out = self.lin(hidden)           
        loss += self.loss_func(out, y.squeeze(1).t())
        return loss

def train_model(net, optimizer, batch_size, n_steps, c_length=20):
    accs = []
    losses = []
    rec_nets = []
    first_hid_grads = []
    
    for i in range(n_steps):        
        s_t = time.time()
        x,y = adding_problem_generator(batch_size, seq_len=c_length, number_of_ones=2)        
        if CUDA:
            x = x.cuda()
            y = y.cuda()
        x = x.transpose(0, 1)
        y = y.transpose(0, 1)
        
        optimizer.zero_grad()
        
        loss = net.forward(x, y)
        loss_act = loss
        
        loss.backward()        
        losses.append(loss_act.item())

        optimizer.step()

        if i%200 == 0:
          print('Update {}, Time for Update: {} , Average Loss: {}'
              .format(i +1, time.time()- s_t, loss_act.item() ))
    
    print("Average loss: ", np.mean(np.array(losses)))
    return losses

def run_addition_task( rnn, rnn_cell_name ):
  print('\n\nWill solve using RNN=', rnn_cell_name)
  net = Model(rnn)
  if CUDA:
    net = net.cuda()
    net.rnn = net.rnn.cuda()

  optimizer = optim.Adam(net.parameters(), lr=0.001)
  losses = train_model(net, optimizer, batch_size, n_steps = 1000)
  return losses

### Q1.2 (10 points)
Utilize the VanillaRNNCell written above and run the train code resulting in the convergence plot which shows the progress on this learning task. (As a reference, you should say that the problem is solved when the loss becomes less than 0.001) 


In [None]:
input_size = 2
hidden_size = 128
batch_size = 100

from torch.nn import RNNCell, GRUCell 

################################################
##### TODO CODE HERE
#####     Invoke RNNCells written above with both tanh and relu nonlinearities
#####     Also, invoke a GRUCell in order to compare the performance 
vanilla_rnn_tanh = raise NotImplementedError
vanilla_rnn_relu = raise NotImplementedError
gru = raise NotImplementedError
################################################


## This collects the losses and generates plot for visualization
data = {}
data['VanillaRNN(tanh)'] = run_addition_task(vanilla_rnn_tanh, 'VanillaRNN(tanh)' )
data['VanillaRNN(relu)'] = run_addition_task(vanilla_rnn_relu, 'VanillaRNN(relu)' )
data['GRU'] = run_addition_task( gru, 'GRU' )


Run the below written script to generate the convergence plot (shows the progress on the learning task). Please leave the graph in the submission pdf so that we can evaluate the convergence (not doing so will result in 5 point deduction.)

In [None]:
legends = []
for label in data.keys():
    losses = data[label]
    plt.plot(1 + np.arange(len(losses)),  losses )
    legends.append(label)
    
plt.legend(legends, loc='upper right')
plt.title('Training Error ')
plt.xlabel('Training Steps')
plt.ylabel('Mean Squared Error')
plt.ylim(0.0, 0.5)

plt.show()
plt.close()

## Q2: Movie Review Sentiment Classification using an LSTM [40 points]

In this part you will train a sentiment classifier using LSTM that predicts whether a movie review is positive or negative.  

### IMDB Reviews

This dataset contains reviews taken from the IMDB website for many popular movies. This has been curated to avoid spurious characters and contains only english reviews. 

Below we provide an interface which loads the training IMDB review set and pre-process it in order to create a vocabulary, as well as split this train set into train, test, and validation set for the purposes of this experiment. Do not modify this `ReviewDataset` interface.

Note that we also provide two sample positive and negative reviews which are used later to test whether your learnt LSTM can detective the positive and negative sentiments attached to the reviews. 

In [None]:
import numpy as np
from string import punctuation
from collections import Counter

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

class ReviewDataset():
    def __init__(self, seq_length = 200):
        # Maximum sequence length for reviews
        self.seq_length = seq_length

        self.reviews_file = IMDB_REVIEWS_FILE_PATH + 'reviews.txt' 
        self.labels_file  = IMDB_REVIEWS_FILE_PATH + 'labels.txt'  

        # Read the reviews and labels 
        self.read_reviews_labels()

        # Create a vocabulary from the reviews
        self.create_vocabulary()

        # Convert the reviews into integer tokens using vocab
        self.tokenize_reviews()

        self.create_train_test_valid_split()

    def read_reviews_labels(self):
        print('Reading reviews and labels from the data folder.')

        # read data from text files
        with open(self.reviews_file, 'r') as f:
            self.reviews = f.read()

        with open(self.labels_file, 'r') as f:
            self.labels = f.read()

        # lowercase, standardize
        self.reviews = self.reviews.lower() 

        print('Will remove punctuation = ', punctuation)

        # get rid of punctuation
        all_text = ''.join([c for c in self.reviews if c not in punctuation])

        # split by new lines and spaces
        self.reviews_split = all_text.split('\n')

        # 1=positive, 0=negative label conversion
        self.labels_split = self.labels.split('\n')
        self.encoded_labels = np.array([1 if label == 'positive' else 0 for label in self.labels_split])

    def create_vocabulary(self):
        # Gather all text from the Corpus
        all_text = ' '.join(self.reviews_split)

        # create a list of words
        words = all_text.split()

        # Build a dictionary that maps words to integers
        counts = Counter(words)
        self.vocab = sorted(counts, key=counts.get, reverse=True)
        self.vocab_to_int = {word: ii for ii, word in enumerate(self.vocab,1)} 

        # stats about vocabulary
        print('Unique words: ', len((self.vocab_to_int)))  # should ~ 74000+

    def pad_features(self, reviews_ints, seq_length):
        ''' Return features of review_ints, where each review is padded with 0's 
            or truncated to the input seq_length.
        '''
        ## getting the correct rows x cols shape
        features = np.zeros((len(reviews_ints), seq_length), dtype=int)
        
        ## for each review, I grab that review
        for i, row in enumerate(reviews_ints):
          features[i, -len(row):] = np.array(row)[:seq_length]
        
        return features

    def tokenize_reviews(self):
        ## use the dict to tokenize each review in reviews_split
        ## store the tokenized reviews in reviews_ints
        self.reviews_ints = []
        for review in self.reviews_split:
            self.reviews_ints.append([self.vocab_to_int[word] for word in review.split()])

        ## get any indices of any reviews with length 0
        non_zero_idx = [ii for ii, review in enumerate(self.reviews_ints) if len(review) != 0]

        # remove 0-length review with their labels
        self.reviews_ints = [self.reviews_ints[ii] for ii in non_zero_idx]
        self.encoded_labels = np.array([self.encoded_labels[ii] for ii in non_zero_idx])
        self.reviews_split = [self.reviews_split[ii] for ii in non_zero_idx]
        self.labels_split = [self.labels_split[ii] for ii in non_zero_idx]

        print('Number of reviews after removing outliers: ', len(self.reviews_ints))

        self.features = self.pad_features(self.reviews_ints, self.seq_length)

        ## test statements - do not change - ##
        assert len(self.features)==len(self.reviews_ints), "Your features should have as many rows as reviews."
        assert len(self.features[0])==self.seq_length, "Each feature row should contain seq_length values."

        # print tokens in first review
        print('Raw review: \n', self.reviews_split[0])
        print('Tokenized review: \n', self.reviews_ints[0])
        print('Raw label: ', self.labels_split[0])
        print('Encoded label: ', self.encoded_labels[0])

    def tokenize_review(self, review):
        # lowercase
        review = review.lower() 

        # get rid of punctuatuon
        text = ''.join([c for c in review if c not in punctuation])
        
        # splitting by spaces
        words = text.split()
        
        # tokens
        tokens = []
        tokens.append([self.vocab_to_int[word] for word in words])
        #print(tokens)
        
        # sequence padding
        features = self.pad_features(tokens, self.seq_length)
        #print(features)

        # test conversion to tensor and pass it to model
        feature_tensor = torch.from_numpy(features)
        #print(feature_tensor.size())

        return feature_tensor

    def get_positive_and_negative_reviews(self):
        # negative test review
        review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

        # positive test review
        review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'

        #features_pos = tokenize_review(review_pos)
        #features_neg = tokenize_review(review_neg)
        #return review_pos, review_neg, features_pos, features_neg
        return review_pos, review_neg

    def create_train_test_valid_split(self):
        split_frac = 0.8

        # split data into training, validation, and test data (features and labels, x and y)
        split_idx = int(len(self.features)*0.8)
        self.train_x, remaining_x = self.features[:split_idx], self.features[split_idx:]
        self.train_y, remaining_y = self.encoded_labels[:split_idx], self.encoded_labels[split_idx:]

        test_idx = int(len(remaining_x)*0.5)
        self.val_x, self.test_x = remaining_x[:test_idx], remaining_x[test_idx:]
        self.val_y, self.test_y = remaining_y[:test_idx], remaining_y[test_idx:]

        # print out the shapes of your resultant feature data
        print("\t\t\tFeatures Shapes:")
        print("Train set: \t\t{}".format(self.train_x.shape),
              "\nValidation set: \t{}".format(self.val_x.shape),
              "\nTest set: \t\t{}".format(self.test_x.shape))

    def get_data_loaders(self, batch_size = 50):
        # create Tensor datasets
        train_data = TensorDataset(torch.from_numpy(self.train_x), torch.from_numpy(self.train_y))
        valid_data = TensorDataset(torch.from_numpy(self.val_x), torch.from_numpy(self.val_y))
        test_data = TensorDataset(torch.from_numpy(self.test_x), torch.from_numpy(self.test_y))

        # make sure to SHUFFLE your data
        train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
        valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
        test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

        return train_loader, valid_loader, test_loader

### Q2.1 Instantiate dataset and gather data loaders [5 points]

We need to get hold of the data loaders for train, valid, and test splits. The data handling utility is available in the `ReviewDataset` class. Please invoke the appropriate functions to get hold of the corresponding data loaders. 

After setting up the data loaders, please print one sample batch of  features and labels from the training set.

Please use the batch_size provided below.

In [None]:
batch_size = 50

################################################
##### TODO CODE HERE
#####   First instantiate a ReviewDataset class
#####   Then call the data loader function
#####   Then invoke an iterator on the train data loader
dset = raise NotImplementedError
train_loader, valid_loader, test_loader = raise NotImplementedError
sample_x, sample_y = raise NotImplementedError
################################################

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

### Q2.2 LSTM implementation [20 points]

We will implement an [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) in this problem. 

We have already provided functions for loading the training data. Please define your LSTM model in the class `LSTM`.

Feel free to change the paramenters or code if needed (we recommend that you do not change the signature of the methods). 

In [None]:
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTM, self).__init__()
        
        ################################################
        ##### TODO CODE HERE
        ##### Declare your parameters for various gates
        raise NotImplementedError
        ################################################

        ################################################
        ##### TODO CODE HERE
        ##### Initialize all the paramters with nn.init.kaiming_normal_
        ##### NOTE : initialize biases as 0.0
        for param in self.parameters():
            raise NotImplementedError
        ################################################
                
    def forward(self, inputs, state=None):
        # Argument: 
        #     input : (batch_size, seq_length, input_size)
        # Returns:
        #     output : (batch_size, seq_length, hidden_size)
        #     hidden : ( (batch_size, hidden_size), (batch_size, hidden_size) )
        
        if state is None:
            state = ( inputs.new_zeros(x.shape[0], self.hidden_size), 
                      inputs.new_zeros(x.shape[0], self.hidden_size) )

        h, c = state
        outputs = []
        for input in torch.unbind(inputs, dim=1):

            ################################################
            ##### TODO CODE HERE
            ##### Use input, and previous (h,c) to update (h,c) for this timestep
            c = raise NotImplementedError
            h = raise NotImplementedError
            ################################################

            outputs.append( h )
            
        output = torch.stack( outputs, dim=0 )
        return output, (h, c)

Below we define the `SentimentRNN` class which utilizes the `LSTM` defined above as well as introduces a word embedding and a logistic classifier. 

Please go through this module in order to understand the sentiment classifier and its building blocks.

Please do not modify this code.

In [None]:
class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform "sentiment analysis".
    """

    def __init__(self, vocab_size, embedding_dim, hidden_dim, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = LSTM(embedding_dim, hidden_dim)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layer
        self.fc = nn.Linear(hidden_dim, 1)
        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)
        
        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        # stack up lstm outputs
        lstm_out = hidden[0] 
        
        # dropout and fully connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size, cuda=True):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if(cuda):
          hidden = (weight.new(batch_size, self.hidden_dim).zero_().cuda(),
                   weight.new(batch_size, self.hidden_dim).zero_().cuda())
        else:
          hidden = (weight.new(batch_size, self.hidden_dim).zero_(),
                   weight.new(batch_size, self.hidden_dim).zero_())
        
        return hidden
        

### Q2.3 Set up optimization [10 points]

We instantiate the `SentimentRNN` module and ask you to specify the loss function as well as the optimizer for the learning task.

In [None]:
# Instantiate the model w/ hyperparams
vocab_size = len(dset.vocab_to_int) + 1 # +1 for zero padding + our word tokens
embedding_dim = 400 
hidden_dim = 256

net = SentimentRNN(vocab_size, embedding_dim, hidden_dim)
print(net)

# loss and optimization functions
################################################
##### TODO CODE HERE
##### Note that we are dealing with binary classification (so choose the loss function appropriately)
##### Also, specify the optimizer (feel free to use any)
criterion = raise NotImplementedError
optimizer = raise NotImplementedError

# Set epochs to be the epoch number where validation loss stop decreasing
epochs = 4 
################################################

# training params
counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        ################################################
        ##### TODO CODE HERE
        #####     1) find out loss value (between the output and labels)
        #####     2) perform backpropagation
        #####     3) try using nn.utils.clip_grad_norm_ to clip gradients 
        #####         in order to avoid exploding gradients (common issue with LSTMS)
        #####     4) take optimizer step 
        #####      
        raise NotImplementedError
        ################################################

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

### Q2.4 Evaluation [5 points]

We will evaluate the performance of the trained sentiment classifier below.
There are two aspects to evaluation 

1. Accuracy on the test data
2. Qualitative results on the given positive and negative reviews.

If you have been able to train the network correctly, you should easily achieve 80% accuracy on test data and should be easily able to detect the correct sentiments for the provided reviews.

Please run the following two blocks of data and leave the results as it is in the submission.

In [None]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = 100.0*(  num_correct/len(test_loader.dataset) )
print("Test accuracy: {:.2f}%".format(test_acc))

In [None]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    
    net.eval()
    
    feature_tensor = dset.tokenize_review(test_review)

    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
      feature_tensor = feature_tensor.cuda()
      
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())
    # printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response based on whether test_review is pos/neg
    if(pred.item()==1):
      print('Positive review detected!')
    else:
      print('Negative review detected!')
    
# call function
# try negative and positive reviews!
review_pos, review_neg = dset.get_positive_and_negative_reviews()

# Prediction should be negative sentiment
predict(net, review_neg, dset.seq_length)

# Prediction should be positive sentiment
predict(net, review_pos, dset.seq_length) 

### Q2.5 Bonus: Improve the classifier [20 points]

The 80% test performance on this task is not so great. Try improving the performance with various tools 

1. implementing a bi-directional LSTM
2. use multiple LSTM layers for the classifier
3. incorporate some form of regularization in the loss or dropout in the LSTM variables

Please create new LSTM class for such implementation.

**Solution**

## Q3 : GAN model on Celeb-A face dataset (40 points)

We will implement a Generative Adversarial Network (GAN) in Q3 and Q4. In this problem, we will start by implementing basic helper functions and a working training routine which would be modified a bit in problem Q4 to present a hairstyle change application. 

This assignment is inspired by the following research paper, 

[StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation](https://arxiv.org/abs/1711.09020)

Note that the basic building blocks of GAN are 
1. **Generator** : generates images similar to the real images provided in the dataset in order to fool the discriminator.

2. **Discriminator** : acts as a fact checker in order to determine which images are fake and which are real 

3. **Latent representation** : generator cannot arbitrarily generate images out of thin air. It picks up a latent representation (usually a noise or some fixed pattern), and utilizes this source as a latent code and generates image during this process.

For an easy tutorial to understand GAN's building blocks, please follow the PyTorch DC-GAN [tutorial](https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html). It is recommended you study the tutorial before proceeding with the assignment. 


Note that f and g in the following image are the same as $D$ and $G$ respectively

<img src="https://i.imgur.com/FhSycJD.png" style="width: 600px;"/>

Although the tutorial describes the GAN framework very well with appropriate links to the original paper, for completeness, we mention the losses for the discriminator and generator. The GAN optimization objective is the following : 

$$\underset{G}{\text{min}} ~~ \underset{D}{\text{max}} ~~ V(D,G) = \mathbb{E}_{x\sim p_{x}(x)}\big[\log D(x)\big] + \mathbb{E}_{z\sim p_{z}(z)}\big[\log (1-D(G(z)))\big]$$

where $p_{x}(x)$ is the distribution of the real image data and $p_{z}(z)$ is the distribution from which latent vectors are sampled to input to the generator. $G$ is the generator, parametrized by parameters $w$ and $D$, the discriminator with parameters $\phi$. In practice, these parameters are usually optimized in an alternating fashion, fixing one when optimizing the other, with the following loss functions:

$$ loss_D(\phi) = - \mathbb{E}_{x\sim p_{x}(x)}\big[\log D(x; \phi)\big] - \mathbb{E}_{z\sim p_{z}(z)}\big[\log (1-D(G(z; w); \phi))\big] $$

$$ loss_G(w) = - \mathbb{E}_{z\sim p_{z}(z)}\big[\log (D(G(z; w); \phi))\big] $$

In this assignment you shall use these loss functions and method of training $G$ and $D$.

### Arguments

First, we first define some arguments for the training run. 

-  **selected_attributes** -  the attributes which will be used for generating various style attributes on the celebrity face.
-  **c_dim** - the number of attributes we will use from the CelebA dataset. its set to `len(selected_attributes)`
-  **image_size** - the spatial size of the images used for training.
   This implementation defaults to 64x64. 
-  **g_conv_dim** - the number of convolutional fitlers for generator
-  **d_conv_dim** - the number of convolutional fitlers for discriminator 
-  **g_repeat_num** - the number of residual blocks in the generator 
-  **d_repeat_num** - the number of residual blocks in the discriminator
-  **lambda_cls** - the regularization hyper-parameter for classification error
-  **lambda_rec** -  the regularization hyper-parameter for reconstruction error
-  **lambda_gp** -  the regularization hyper-parameter for gradient penalty

-  **batch_size** - the batch size used in training. The paper
   uses a batch size of 16 for large configuration
-  **num_iters** - the number of training iterations
   the DataLoader
-  **num_iters_decay** - the number of iterations after which learning rate will decay
-  **g_lr** - the learning rate for generator
-  **d_lr** - the learning rate for discriminator
-  **n_critic** - generator will be updated every n_critic iterations.
-  **beta1** - beta1 hyperparameter for Adam optimizers. As described in
   paper, this number should be 0.5
-  **beta2** - beta2 hyperparameter for Adam optimizers. As described in
   paper, this number should be 0.999
-  **num_workers** - the number of worker threads for loading the data with
   the DataLoader
-  **log_step** - log the update every log_steps
-  **sample_step** - generate a new sample every sample_step
-  **model_save_step** - save the model every model_save_step
-  **lr_update_step** - update the learning rate every lr_update_step
-  **log_dir** - directory where logs are stored
-  **sample_dir** - directory where samples are stored
-  **model_save_dir** - directory where trained models are stored
-  **result_dir** - directory where results are stored



In [None]:
cudnn.benchmark = True

def get_experiment_configuration(repeat_num=6, num_iters=200000,
              log_step=100, sample_step=100, model_save_step=10000, 
              lr_update_step=1000, batch_size=16, mode='train', resume_iters=False,
              selected_attributes = ['Black_Hair', 'Blond_Hair', 'Brown_Hair', 'Male', 'Young']):
    config = {}

    # Model configurations.
    config['c_dim'] = len(selected_attributes)
    config['image_size'] = 64
    config['g_conv_dim'] = 64
    config['d_conv_dim'] = 64
    config['g_repeat_num'] = repeat_num
    config['d_repeat_num'] = repeat_num
    config['lambda_cls'] = 1
    config['lambda_rec'] = 10
    config['lambda_gp'] = 10
    config['selected_attributes'] = selected_attributes 

    # Training configurations.
    config['batch_size'] = batch_size #16
    config['num_iters'] = num_iters
    config['num_iters_decay'] = num_iters//2
    config['g_lr'] = 0.0001
    config['d_lr'] = 0.0001
    config['n_critic'] = 5
    config['beta1'] = 0.5
    config['beta2'] = 0.999
    config['resume_iters'] = resume_iters

    # Test configurations.
    config['test_iters'] = num_iters

    # Miscellaneous.
    config['device'] = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    config['num_workers'] = 1
    config['mode'] = mode

    # Step size.
    config['log_step'] = log_step #10
    config['sample_step'] = sample_step
    config['model_save_step'] =  model_save_step #10000
    config['lr_update_step'] = lr_update_step # 1000

    EXPERIMENT_RESULTS_FOLDER = EXPERIMENTS_DIRECTORY + 'gan-experiments/'

    suffix = str(repeat_num) + '-cdim-' + str(len(selected_attributes))
    config['log_dir'] = EXPERIMENT_RESULTS_FOLDER + 'logs-' + suffix
    config['sample_dir'] = EXPERIMENT_RESULTS_FOLDER + 'sample_dir-' + suffix
    config['model_save_dir'] = EXPERIMENT_RESULTS_FOLDER + 'model_save_dir-' + suffix
    config['result_dir'] = EXPERIMENT_RESULTS_FOLDER + 'result_dir-' + suffix

    print('\n\nPlease ensure you are using a GPU for computation')
    print('Will be using the following device for computation : ', config['device'])

    # Create directories if not exist.
    if not os.path.exists(config['log_dir']):
        os.makedirs(config['log_dir'])
    if not os.path.exists(config['sample_dir']):
        os.makedirs(config['sample_dir'])
    if not os.path.exists(config['model_save_dir']):
        os.makedirs(config['model_save_dir'])
    if not os.path.exists(config['result_dir']):
        os.makedirs(config['result_dir'])

    return config

### Data

We have downloaded pre-processed and stored data in the HDF5 format on the zip file provided with the assignment. Think of the file storing a large numpy ndarray of images (shape : `num_imgs x num_channels x height x width`). `celebA` class implemented below derives from `torch.utils.data.Dataset` and provides the code infrastucture to read images from this file. 

This data contains images of many celebrities along with labels for various image attributes (hair, gender, age, etc). There are 40 such attributes. We will use them later for cool applications. 

In [None]:
ALL_ATTRIBUTES = ['5_o_Clock_Shadow', 'Arched_Eyebrows', 'Attractive',  
      'Bags_Under_Eyes',  'Bald', 'Bangs', 'Big_Lips', 'Big_Nose', 'Black_Hair', 
      'Blond_Hair', 'Blurry', 'Brown_Hair', 'Bushy_Eyebrows', 'Chubby',
      'Double_Chin', 'Eyeglasses', 'Goatee', 'Gray_Hair', 'Heavy_Makeup',
      'High_Cheekbones', 'Male', 'Mouth_Slightly_Open', 'Mustache', 
      'Narrow_Eyes', 'No_Beard', 'Oval_Face', 'Pale_Skin', 'Pointy_Nose', 
      'Receding_Hairline', 'Rosy_Cheeks', 'Sideburns', 'Smiling', 'Straight_Hair',
      'Wavy_Hair', 'Wearing_Earrings', 'Wearing_Hat', 'Wearing_Lipstick',
      'Wearing_Necklace', 'Wearing_Necktie', 'Young' ]
print('# attributes = ', len(ALL_ATTRIBUTES))

In [None]:
class CelebA(torch.utils.data.Dataset):
    """Dataset class for the CelebA dataset."""

    def __init__(self, transform, mode, config):
        """Initialize and preprocess the CelebA dataset."""

        self.file = h5py.File(CELEBA_GOOGLE_DRIVE_PATH, 'r')
        self.total_num_imgs, self.H, self.W, self.C = self.file['images'].shape

        self.images = self.file['images']
        self.attributes = self.file['attributes']

        self.selected_attrs = config['selected_attributes'] 
        self.all_attr_names = ALL_ATTRIBUTES

        self.transform = transform
        self.mode = mode

        self.train_dataset = []
        self.test_dataset = []
        self.attr2idx = {}
        self.idx2attr = {}
        self.preprocess()

        if mode == 'train':
            self.num_images = len(self.train_dataset)
        else:
            self.num_images = len(self.test_dataset)

    def preprocess(self):
        """Preprocess the CelebA attribute file."""
        for i, attr_name in enumerate(self.all_attr_names):
            self.attr2idx[attr_name] = i
            self.idx2attr[i] = attr_name

        self.all_idxs = np.arange(self.total_num_imgs)
        N_test = 9
        self.train_dataset = self.all_idxs[:-N_test] 
        self.test_dataset = self.all_idxs[-N_test:]

        random.seed(1234)
        np.random.seed(1234)        
        np.random.shuffle(self.train_dataset)

        print('Finished preprocessing the CelebA dataset...')

    def __getitem__(self, index):
        """Return one image and its corresponding attribute label."""
        dataset = self.train_dataset if self.mode == 'train' else self.test_dataset
        idx = dataset[index]

        image = self.file['images'][idx]
        attributes = self.file['attributes'][idx]

        label = []
        for attr_name in self.selected_attrs:
            idx = self.attr2idx[attr_name]
            label.append(attributes[idx])
        
        return self.transform(image), torch.FloatTensor(label)

    def __len__(self):
        """Return the number of images."""
        return self.num_images


def get_loader(config, mode='train'):
    """Build and return a data loader."""
    
    batch_size = config['batch_size']
    num_workers = config['num_workers']
    
    transform = []
    transform.append(T.ToPILImage())
    if mode == 'train':
        transform.append(T.RandomHorizontalFlip())
    transform.append(T.ToTensor())
    transform.append(T.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)))
    transform = T.Compose(transform)
    
    dataset = CelebA(transform, mode, config)

    data_loader = torch.utils.data.DataLoader(dataset=dataset,
                                  batch_size=batch_size,
                                  shuffle=(mode=='train'),
                                  num_workers=num_workers)
    return data_loader
  
def denorm(x):
    """Convert the range from [-1, 1] to [0, 1]."""
    out = (x + 1) / 2
    return out.clamp_(0, 1)

In [None]:
SELECTED_ATTRIBUTES = ['Black_Hair', 'Blond_Hair', 'Brown_Hair']

small_config = get_experiment_configuration(repeat_num=1, num_iters=20000,
              batch_size=128, selected_attributes = SELECTED_ATTRIBUTES)

loader = get_loader(small_config, mode='test')
data_iter = iter(loader)
x_fixed, _ = next(data_iter)

from torchvision.transforms import ToPILImage
to_img = ToPILImage()

# display tensor
to_img( denorm( x_fixed[0]  ) )

### Modules for Generator and Discriminator

The following cell defines the generator and discriminator networks as `nn.Modules` .

In [None]:
class ResidualBlock(nn.Module):
    """Residual Block with instance normalization."""
    def __init__(self, dim_in, dim_out):
        super(ResidualBlock, self).__init__()
        self.main = nn.Sequential(
            nn.Conv2d(dim_in, dim_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.InstanceNorm2d(dim_out, affine=True, track_running_stats=True),
            nn.ReLU(inplace=True),
            nn.Conv2d(dim_out, dim_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.InstanceNorm2d(dim_out, affine=True, track_running_stats=True))

    def forward(self, x):
        return x + self.main(x)

class Generator(nn.Module):
    """Generator network."""
    def __init__(self, conv_dim=64, c_dim=5, repeat_num=6):
        super(Generator, self).__init__()

        layers = []
        layers.append(nn.Conv2d(3+c_dim, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.InstanceNorm2d(conv_dim, affine=True, track_running_stats=True))
        layers.append(nn.ReLU(inplace=True))

        # Down-sampling layers.
        curr_dim = conv_dim
        for i in range(2):
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True, track_running_stats=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim * 2

        # Bottleneck layers.
        for i in range(repeat_num):
            layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))

        # Up-sampling layers.
        for i in range(2):
            layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True, track_running_stats=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim // 2

        layers.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.Tanh())
        self.main = nn.Sequential(*layers)

    def forward(self, x, c):
        # Replicate spatially and concatenate domain information.
        # Note that this type of label conditioning does not work at all if we use reflection padding in Conv2d.
        # This is because instance normalization ignores the shifting (or bias) effect.
        c = c.view(c.size(0), c.size(1), 1, 1)
        c = c.repeat(1, 1, x.size(2), x.size(3))
        x = torch.cat([x, c], dim=1)
        return self.main(x)

class Discriminator(nn.Module):
    """Discriminator network."""
    def __init__(self, image_size=128, conv_dim=64, c_dim=5, repeat_num=6):
        super(Discriminator, self).__init__()
        layers = []
        layers.append(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1))
        layers.append(nn.LeakyReLU(0.01))

        curr_dim = conv_dim
        for i in range(1, repeat_num):
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1))
            layers.append(nn.LeakyReLU(0.01))
            curr_dim = curr_dim * 2

        kernel_size = int(image_size / np.power(2, repeat_num))
        self.main = nn.Sequential(*layers)
        self.conv1 = nn.Conv2d(curr_dim, 1, kernel_size=3, stride=1, padding=1, bias=False)
        self.conv2 = nn.Conv2d(curr_dim, c_dim, kernel_size=kernel_size, bias=False)
        
    def forward(self, x):
        h = self.main(x)
        out_src = self.conv1(h)
        out_cls = self.conv2(h)
        return out_src, out_cls.view(out_cls.size(0), out_cls.size(1))

Miscellaneous functions for updating learning rates, resetting the gradients and restoring a trained model from storage.

In [None]:
def update_lr(g_optimizer, d_optimizer, g_lr, d_lr):
    """Decay learning rates of the generator and discriminator."""
    for param_group in g_optimizer.param_groups:
        param_group['lr'] = g_lr
    for param_group in d_optimizer.param_groups:
        param_group['lr'] = d_lr

def reset_grad(g_optimizer, d_optimizer):    
    g_optimizer.zero_grad()
    d_optimizer.zero_grad()

### Helper Functions

Implement the following helper functions to complete the training code. Follow the instructions in the questions below.



### Q3.1  Print number of parameters in the networks (5 points)

Write a function that takes input as a model and the model name, and prints the model and the number of parameters in the model.


In [None]:
def print_network(model, name):
  """Print out the network information."""
  num_params = 0

  ################################################
  ##### TODO CODE HERE
  raise NotImplementedError
  ################################################

  print("The number of parameters: {}".format(num_params))

### Q3.2 Invoke the optimizers on the Generator and Discriminator parameters (5 points)
Write a function that takes optimization parameters as input and returns optimizer functions for the discriminator and generator in PyTorch. Use [ADAM](https://arxiv.org/pdf/1412.6980.pdf) optimizer with the parameters $\beta_1$ and $\beta_2$ specified earlier.

In [None]:
def get_optimizers(G, D, g_learning_rate, d_learning_rate, beta1, beta2):
    """
    Returns a 2-tuple, optimizers for parameters of netD and netG
    """

    ################################################
    ##### TODO CODE HERE
    g_optimizer = raise NotImplementedError
    d_optimizer = raise NotImplementedError
    ################################################

    return g_optimizer, d_optimizer

### Q3.3 Compute classification loss (5 points)

Given the logits and the target labels, compute binary cross entropy loss $\ell_{\text{cls}}$ .

In [None]:
def classification_loss(logit, target):
    """
    Compute binary cross entropy loss.
    """

    loss = 0.0

    ################################################
    ##### TODO CODE HERE
    loss = raise NotImplementedError
    ################################################

    return loss

### Q3.4 Compute reconstruction loss (5 points)

This is a very popular loss function used in situations where you are given an original input $x$. In generative learning, through some latent space, you'll generate an almost replica of $x$, let us denote it by $\hat{x}$.

The reconstruction loss measures the distance between the replica and the original. Let $N$ be the number of elements in $x$ and $\hat{x}$, then the loss can be written as 

$$
\ell_{\text{rec}} =  \frac{1}{N} \sum^N_{i=1} | x_i - \hat{x}_i |
$$

In [None]:
def reconstruction_loss( x_real, x_reconstructed ):
    """
    Compute the reconstruction loss.
    """

    ################################################
    ##### TODO CODE HERE
    loss = raise NotImplementedError
    ################################################

    return loss

### Q3.5 Implement the discriminator loss (5 points)
Write a function that returns the discriminator loss written as :

$$
\ell_{discriminator} = \ell_{real} + \ell_{fake}  +  \lambda_{cls} * \ell_{cls}
$$

Note that we add an additional term associated with the generated data (its simple to compute so we do it for you :P ).

In [None]:
def get_discriminator_loss(G, D, label_org, x_real, c_trg, lambda_cls, lambda_gp):
    """
    """

    out_src, out_cls = D(x_real)
    d_loss_real = -torch.mean(out_src)
    ################################################
    ##### TODO CODE HERE
    ##### classification loss between out_cls, label_org
    d_loss_cls = raise NotImplementedError
    ################################################

    # Compute loss with fake images.
    x_fake = G(x_real, c_trg)
    out_src, out_cls = D(x_fake.detach())
    d_loss_fake = torch.mean(out_src)

    d_loss_gp = 0

    # Backward and optimize.
    d_loss = d_loss_real + d_loss_fake
    ################################################
    ##### TODO CODE HERE
    ##### add remaining loss terms as described in the problem
    d_loss += raise NotImplementedError
    ################################################
    
    return d_loss, d_loss_real, d_loss_fake, d_loss_cls, d_loss_gp

### Q3.6 Implement the generator loss (5 points)

Write a function that returns the generator loss written as :

$$
\ell_{generator} = \ell_{fake} + \lambda_{rec} * \ell_{rec} +  \lambda_{cls} * \ell_{cls}
$$

Note that we add an additional term associated with the generated data (its simple to compute so we do it for you :P ).

In [None]:
def get_generator_loss(G, D, x_real, c_trg, c_org, label_trg, lambda_rec, lambda_cls):
    """
    """

    # Original-to-target domain.
    x_fake = G(x_real, c_trg)
    out_src, out_cls = D(x_fake)
    g_loss_fake = - torch.mean(out_src)

    ################################################
    ##### TODO CODE HERE
    ##### classification loss between out_cls and label_trg
    g_loss_cls = raise NotImplementedError
    ################################################

    # Target-to-original domain.
    x_reconst = G(x_fake, c_org)
    ################################################
    ##### TODO CODE HERE
    ##### reconstruction loss between x_real and x_reconst
    g_loss_rec = raise NotImplementedError
    ################################################

    # Backward and optimize.
    g_loss = g_loss_fake
    ################################################
    ##### TODO CODE HERE
    ##### add remaining loss terms as described in the problem
    g_loss += raise NotImplementedError
    ################################################

    return g_loss, g_loss_fake, g_loss_cls, g_loss_rec

The following cell initializes the generator and discriminator. Prints both  neural networks, and allocates optimizers.

In [None]:
# Instantiate Generator and Discriminator

SELECTED_ATTRIBUTES = ['Blond_Hair']

config = get_experiment_configuration(repeat_num=1, num_iters=10000, 
              log_step=100, sample_step=1000, model_save_step=1000, 
              batch_size=64, selected_attributes = SELECTED_ATTRIBUTES)

G = Generator(config['g_conv_dim'], config['c_dim'], config['g_repeat_num'])

D = Discriminator(config['image_size'], 
                  config['d_conv_dim'], 
                  config['c_dim'], 
                  config['d_repeat_num']) 

g_optimizer, d_optimizer = get_optimizers(G, D, 
                                          config['g_lr'], config['d_lr'], 
                                          config['beta1'], config['beta2'])

print_network(G, 'G')
print_network(D, 'D')
    
G = G.to(config['device'])
D = D.to(config['device'])

In [None]:
def create_labels(c_org, c_dim=5, selected_attrs=SELECTED_ATTRIBUTES):
    """Generate target domain labels for debugging and testing."""
    # Get hair color indices.
    hair_color_indices = []
    for i, attr_name in enumerate(selected_attrs):
        if attr_name in ['Black_Hair', 'Blond_Hair', 'Brown_Hair', 'Gray_Hair']:
            hair_color_indices.append(i)

    c_trg_list = []
    for i in range(c_dim):
        c_trg = c_org.clone()
        if i in hair_color_indices:  # Set one hair color to 1 and the rest to 0.
            c_trg[:, i] = 1
            for j in hair_color_indices:
                if j != i:
                    c_trg[:, j] = 0
        else:
            c_trg[:, i] = (c_trg[:, i] == 0)  # Reverse attribute value.

        c_trg_list.append(c_trg.to(config['device']))
    return c_trg_list

# Set data loader.
data_loader = get_loader(config, 'train')
device = config['device']

# Fetch fixed inputs for debugging.
data_iter = iter(data_loader)
x_fixed, c_org = next(data_iter)
x_fixed = x_fixed.to(device)
c_fixed_list = create_labels(c_org, config['c_dim'], config['selected_attributes'])

### Q3.7 : Training loop (10 points)
Now, using the functions defined above, implement the main training loop. Some of it has already been done for you. Fill in code where indicated.

Note that your after every `config['sample_step']`, the code generates new samples in the directory indicated in the configuration. Please monitor this to see how your generated images look like.


In [None]:
# Learning rate cache for decaying.
g_lr = config['g_lr']
d_lr = config['d_lr']

# Start training from scratch or resume training.
start_iters = 0

G_losses = []
D_losses = []
cur_g_loss = 0
cur_d_loss = 0

# Start training.
print('Start training...')
start_time = time.time()
for i in range(start_iters, config['num_iters']):
    # =================================================================================== #
    #                             1. Preprocess input data                                #
    # =================================================================================== #

    # Fetch real images and labels.
    try:
        x_real, label_org = next(data_iter)
    except:
        data_iter = iter(data_loader)
        x_real, label_org = next(data_iter)

    # Generate target domain labels randomly.
    rand_idx = torch.randperm(label_org.size(0))
    label_trg = label_org[rand_idx]

    c_org = label_org.clone()
    c_trg = label_trg.clone()

    x_real = x_real.to(device)           # Input images.
    c_org = c_org.to(device)             # Original labels.
    c_trg = c_trg.to(device)             # Target labels.
    label_org = label_org.to(device)     # Labels for computing classification loss.
    label_trg = label_trg.to(device)     # Labels for computing classification loss.

    # Train discriminator
    ################################################
    ##### TODO CODE HERE
    ##### Get the discriminator loss and optimize discriminator
    d_loss, d_loss_real, d_loss_fake, d_loss_cls, d_loss_gp = raise NotImplementedError
    
    # Now Optimize discriminator
    ################################################

    cur_d_loss = d_loss.item()
    # Logging.
    loss = {}
    loss['D/loss_real'] = d_loss_real.item()
    loss['D/loss_fake'] = d_loss_fake.item()
    loss['D/loss_cls'] = d_loss_cls.item()
    loss['D/loss_gp'] = d_loss_gp.item()
    
    # Train the generator                         
    ################################################
    ##### TODO CODE HERE
    ##### Get the generator loss and optimize generator
    g_loss, g_loss_fake, g_loss_cls, g_loss_rec =  raise NotImplementedError     
    
    # Now Optimize generator
    ################################################

    # Logging.
    loss['G/loss_fake'] = g_loss_fake.item()
    loss['G/loss_rec'] = g_loss_rec.item()
    loss['G/loss_cls'] = g_loss_cls.item()
    cur_g_loss = g_loss.item() 

    # Save Losses for plotting later
    G_losses.append(cur_g_loss)
    D_losses.append(cur_d_loss)

    # Print out training information.
    if (i+1) % config['log_step']  == 0:
        et = time.time() - start_time
        et = str(datetime.timedelta(seconds=et))[:-7]
        log = "Elapsed [{}], Iteration [{}/{}]".format(et, i+1, config['num_iters'])
        for tag, value in loss.items():
            log += ", {}: {:.4f}".format(tag, value)
        print(log)

    # Translate fixed images for debugging.
    if (i+1) %  config['sample_step']  == 0:
        with torch.no_grad():
            x_fake_list = [x_fixed]
            for c_fixed in c_fixed_list:
                x_fake_list.append(G(x_fixed, c_fixed))
            x_concat = torch.cat(x_fake_list, dim=3)
            sample_path = os.path.join(config['sample_dir'], '{}-images.jpg'.format(i+1))
            save_image(denorm(x_concat.data.cpu()), sample_path, nrow=1, padding=0)
            print('Saved real and fake images into {}...'.format(sample_path))

    # Decay learning rates.
    if (i+1) % config['lr_update_step'] == 0 and (i+1) > (config['num_iters'] - config['num_iters_decay']):
        g_lr -= (config['g_lr'] / float(config['num_iters_decay']))
        d_lr -= (config['d_lr'] / float(config['num_iters_decay']))
        update_lr(g_optimizer, d_optimizer, g_lr, d_lr)
        print ('Decayed learning rates, g_lr: {}, d_lr: {}.'.format(g_lr, d_lr))

- Plot the generator and discriminator losses. Remember to leave this output intact when you submit the notebook. Not doing so would result in a 2 points penalty.

In [None]:
# Losses
plt.figure(figsize=(10,5))
plt.title("Generator and Discriminator Loss During Training")
plt.plot(G_losses,label="G")
plt.plot(D_losses,label="D")
plt.xlabel("iterations")
plt.ylabel("Loss")
plt.legend()
plt.show()

## Q4: Hair-style transformation

In this problem, we will take image and change hair styles as per the trained GAN in the previous step

### Q4.1 : Save trained model (5 points)

We will implement a routine to save the trained generator and discrimantor models, so that we can simply load these later on for inference.

In [None]:
def save_model(G, D, config, step):
    """
    Save the trained generator and discriminator
    """
    model_save_dir = config['model_save_dir']
    G_path = os.path.join(model_save_dir, '{}-G.ckpt'.format(step+1))
    D_path = os.path.join(model_save_dir, '{}-D.ckpt'.format(step+1))

    ################################################
    ##### TODO CODE HERE
    raise NotImplementedError
    ################################################
    
    print('Saved model checkpoints into {}...'.format(model_save_dir))


### Q4.2 : Load trained model (5 points)

We will implement a routine to load the trained generator and discrimantor models.

In [None]:
def restore_model(resume_iters, model_save_dir):
    """
    Restore the trained generator and discriminator.
    """

    print('Loading the trained models from step {}...'.format(resume_iters))
    G_path = os.path.join(model_save_dir, '{}-G.ckpt'.format(resume_iters))
    D_path = os.path.join(model_save_dir, '{}-D.ckpt'.format(resume_iters))

    ################################################
    ##### TODO CODE HERE
    raise NotImplementedError
    ################################################

    return G, D

### Q4.3 Implement gradient penalty (10 points)

Given $y = f(x)$, we refer $ \frac{ dy }{ dx } $ as the gradient in this problem. We want to include a gradient penalty in the GAN loss. 
We can write gradient penalty $\ell_{gp}$ as 

$$
\ell_{gp} = \Bigg\|  \Big\| \frac{ dy }{ dx } \Big\|_2 - 1 \Bigg\|^2
$$

(Hint : Using ```grad``` function in the ```torch.autograd``` modules, compute the gradient penalty )

In [None]:
def gradient_penalty(y, x):
    """
    Compute gradient penalty: (L2_norm(dy/dx) - 1)**2.
    """

    ################################################
    ##### TODO CODE HERE
    dydx_l2norm = raise NotImplementedError
    ################################################

    loss = torch.mean((dydx_l2norm-1)**2)
    return loss

### Q4.4 Implement the discriminator loss (5 points)
Write a function that returns the discriminator loss written as :

$$
\ell_{discriminator} = \ell_{real} + \ell_{fake} + \lambda_{gp} \times \ell_{gp} +  \lambda_{cls} \times \ell_{cls}
$$

Note that we add an additional term associated with the generated data (its simple to compute so we do it for you :P ).

In [None]:

def get_new_discriminator_loss(G, D, label_org, x_real, c_trg, lambda_cls, lambda_gp):
    """
    """

    out_src, out_cls = D(x_real)
    d_loss_real = -torch.mean(out_src)
    ################################################
    ##### TODO CODE HERE
    ##### classification loss between out_cls, label_org
    d_loss_cls = raise NotImplementedError
    ################################################

    # Compute loss with fake images.
    x_fake = G(x_real, c_trg)
    out_src, out_cls = D(x_fake.detach())
    d_loss_fake = torch.mean(out_src)

    # Compute loss for gradient penalty.
    alpha = torch.rand(x_real.size(0), 1, 1, 1).to(device)
    x_hat = (alpha * x_real.data + (1 - alpha) * x_fake.data).requires_grad_(True)
    out_src, _ = D(x_hat)
    ################################################
    ##### TODO CODE HERE
    ##### gradient penalty on y=out_src, x=x_hat
    d_loss_gp = raise NotImplementedError
    ################################################

    # Backward and optimize.
    d_loss = d_loss_real + d_loss_fake
    ################################################
    ##### TODO CODE HERE
    ##### add remaining loss terms as described in the problem
    d_loss += raise NotImplementedError
    ################################################
    
    return d_loss, d_loss_real, d_loss_fake, d_loss_cls, d_loss_gp

### Q4.5 : Initialize a larger GAN using ```get_experiment_configuration``` (10 points)

We will use three hair style attributes in this experiment.

We will train a larger GAN in this problem. First, lets get a larger generator and discriminator models (use more than 3 repeat blocks in the experiment configuration which will increase the number of residual blocks in both the models).

Initialize the generator and discriminator accordingly and get the optimizers.

In [None]:

SELECTED_ATTRIBUTES = ['Black_Hair', 'Blond_Hair', 'Brown_Hair']

################################################
##### TODO CODE HERE
raise NotImplementedError
################################################

print_network(G, 'G')
print_network(D, 'D')
    
G = G.to(config['device'])
D = D.to(config['device'])

### Q4.6 : Train the larger GAN (15 points)

At the heart of a GAN network is a minimax problem. Earlier we were optimizing the Generator and the Discriminator at the same speed. 

It turns out that in this case, its recommended that the generator is updated at a slower pace than the discriminator. 

One way to achieve this is to run the generator with smaller learning rate . Instead we want to use the same learning rates as before but we will update the generator every few iterations, i.e. discriminator will be trained in every iteration but the generator will be trained every 5 iterations or more specifically ```config['n_critic']``` number of iterations. Use this recommended value or update as per your intuition.

Note that your after every `config['sample_step']`, the code generates new samples in the directory indicated in the configuration. Please monitor this to see how your generated images look like.

The script also saves your model to the `config['model_save_dir']` every `config['mode_save_step']` so that you can resume training (in case your script crashes after making significant progress) and we can also restore this model when we generate new hairstyles on the test images.

Hopefully your results will be better with these updates.

This will take more than 10 hours to give you reasonable images. So start early.


In [None]:
# Learning rate cache for decaying.
g_lr = config['g_lr']
d_lr = config['d_lr']

# Start training from scratch or resume training.
start_iters = 0
if config['resume_iters']:
    start_iters = config['resume_iters'] 
    G, D = restore_model(config['resume_iters'], config )

G_losses = []
D_losses = []
cur_g_loss = 0
cur_d_loss = 0

# Start training.
print('Start training...')
start_time = time.time()
for i in range(start_iters, config['num_iters']):
    # =================================================================================== #
    #                             1. Preprocess input data                                #
    # =================================================================================== #

    # Fetch real images and labels.
    try:
        x_real, label_org = next(data_iter)
    except:
        data_iter = iter(data_loader)
        x_real, label_org = next(data_iter)

    # Generate target domain labels randomly.
    rand_idx = torch.randperm(label_org.size(0))
    label_trg = label_org[rand_idx]

    c_org = label_org.clone()
    c_trg = label_trg.clone()

    x_real = x_real.to(device)           # Input images.
    c_org = c_org.to(device)             # Original labels.
    c_trg = c_trg.to(device)             # Target labels.
    label_org = label_org.to(device)     # Labels for computing classification loss.
    label_trg = label_trg.to(device)     # Labels for computing classification loss.

    # Train discriminator
    ################################################
    ##### TODO CODE HERE
    ##### Get the discriminator loss and optimize discriminator
    d_loss, d_loss_real, d_loss_fake, d_loss_cls, d_loss_gp = raise NotImplementedError
    
    # Now Optimize discriminator
    ################################################

    cur_d_loss = d_loss.item()
    # Logging.
    loss = {}
    loss['D/loss_real'] = d_loss_real.item()
    loss['D/loss_fake'] = d_loss_fake.item()
    loss['D/loss_cls'] = d_loss_cls.item()
    loss['D/loss_gp'] = d_loss_gp.item()
    
    # Train the generator                         
    ################################################
    ##### TODO CODE HERE
    ##### Get the generator loss and optimize generator (every n_critic iterations)
    g_loss, g_loss_fake, g_loss_cls, g_loss_rec =  raise NotImplementedError     
    
    # Now Optimize generator
    # Logging.
    loss['G/loss_fake'] = g_loss_fake.item()
    loss['G/loss_rec'] = g_loss_rec.item()
    loss['G/loss_cls'] = g_loss_cls.item()
    ################################################
        
    # Print out training information.
    if (i+1) % config['log_step']  == 0:
        et = time.time() - start_time
        et = str(datetime.timedelta(seconds=et))[:-7]
        log = "Elapsed [{}], Iteration [{}/{}]".format(et, i+1, config['num_iters'])
        for tag, value in loss.items():
            log += ", {}: {:.4f}".format(tag, value)
        print(log)

    # Translate fixed images for debugging.
    if (i+1) %  config['sample_step']  == 0:
        with torch.no_grad():
            x_fake_list = [x_fixed]
            for c_fixed in c_fixed_list:
                x_fake_list.append(G(x_fixed, c_fixed))
            x_concat = torch.cat(x_fake_list, dim=3)
            sample_path = os.path.join(config['sample_dir'], '{}-images.jpg'.format(i+1))
            save_image(denorm(x_concat.data.cpu()), sample_path, nrow=1, padding=0)
            print('Saved real and fake images into {}...'.format(sample_path))

    # Save model checkpoints.
    if (i+1) % config['model_save_step'] == 0:
        save_model(G, D, config, i)

    # Decay learning rates.
    if (i+1) % config['lr_update_step'] == 0 and (i+1) > (config['num_iters'] - config['num_iters_decay']):
        g_lr -= (config['g_lr'] / float(config['num_iters_decay']))
        d_lr -= (config['d_lr'] / float(config['num_iters_decay']))
        update_lr(g_optimizer, d_optimizer, g_lr, d_lr)
        print ('Decayed learning rates, g_lr: {}, d_lr: {}.'.format(g_lr, d_lr))

### Q4.7 : Generate new hairstyles for test images (10 points)

We will apply the hairstyle transformations through the trained Generator model.
We will restore the already trained model, load the test data and invoke the Generator with various hairstyle attributes.

You should expect the output to look like the following image.

<img src="https://i.imgur.com/gbNzQk8.jpg" style="width: 600px;"/>

In [None]:
from IPython.display import Image, display

# Choose the iteration number of the saved model (both G, D should be present)
################################################
##### TODO CODE HERE
##### Load the trained model
##### Also, load the data_loader in test mode
G, D = raise NotImplementedError
data_loader = raise NotImplementedError
################################################

with torch.no_grad():
    for i, (x_real, c_org) in enumerate(data_loader):

        # Prepare input images and target domain labels.
        x_real = x_real.to(config['device'])
        c_trg_list = create_labels(c_org, config['c_dim'], config['selected_attributes'])

        # Translate images.
        x_fake_list = [x_real]
        for c_trg in c_trg_list:
            x_fake = G(x_real, c_trg)
            x_fake_list.append(x_fake)

        # Save the translated images.
        x_concat = torch.cat(x_fake_list, dim=3)
        result_path = os.path.join( config['result_dir'], '{}-images.jpg'.format(i+1) )
        save_image(denorm(x_concat.data.cpu()), result_path, nrow=1, padding=0)
        print('Saved real and fake images into {}...'.format(result_path))
        display(Image(filename=result_path))

### Q4.8 : (Bonus) Use other attributes and develop something cool. (20 points)

We have seen how to change hair style using GANs so far, but its possible to use any other attributes and develop something much cooler. You can use any other loss functions or the generator/discriminator architecture. Feel free to be creative and develop something cooler. 