$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
$$

# Part 3: Mini-Project
<a id=part3></a>

In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.

You must **choose one** of the project options specified below.

### Guidelines

- You should implement the code which displays your results in this notebook, and add any additional code files for your implementation in the `project/` directory. You can import these files here, as we do for the homeworks.
- Running this notebook should not perform any training - load your results from some output files and display them here. The notebook must be runnable from start to end without errors.
- You must include a detailed write-up (in the notebook) of what you implemented and how. 
- Explain the structure of your code and how to run it to reproduce your results.
- Explicitly state any external code you used, including built-in pytorch models and code from the course tutorials/homework.
- Analyze your numerical results, explaining **why** you got these results (not just specifying the results).
- Where relevant, place all results in a table or display them using a graph.
- Before submitting, make sure all files which are required to run this notebook are included in the generated submission zip.
- Try to keep the submission file size under 10MB. Do not include model checkpoint files, dataset files, or any other non-essentials files. Instead include your results as images/text files/pickles/etc, and load them for display in this notebook. 

## Sentiment Analysis with Self-Attention and Word Embeddings

Based on Tutorials 6 and 7, we'll implement and train an improved sentiment analysis model.
We'll use self-attention instead of RNNs and incorporate pre-trained word embeddings.

In tutorial 6 we saw that we can train word embeddings together with the model.
Although this produces embeddings which are customized to the specific task at hand,
it also greatly increases training time.
A common technique is to use pre-trained word embeddings.
This is essentially a large mapping from words (e.g. in english) to some
high-dimensional vector, such that semantically similar words have an embedding that is
"close" by some metric (e.g. cosine distance).
Use the [GloVe](https://nlp.stanford.edu/projects/glove/) 6B embeddings for this purpose.
You can load these vectors into the weights of an `nn.Embedding` layer.

In tutorial 7 we learned how attention can be used to learn to predict a relative importance
for each element in a sequence, compared to the other elements.
Here, we'll replace the RNN with self-attention only approach similar to Transformer models, roughly based on [this paper](https://www.aclweb.org/anthology/W18-6219.pdf).
After embedding each word in the sentence using the pre-trained word-embedding a positional-encoding vector is added to provide each word in the sentence a unique value based on it's location.
One or more self-attention layers are then applied to the results, to obtain an importance weighting for each word.
Then we classify the sentence based on the average these weighted encodings.


Now, using these approaches, you need to:

- Implement a **baseline** model: Use pre-trained embeddings with an RNN-based model.
You can use LSTM/GRU or bi-directional versions of these, in a way very similar to what we implemented in the tutorial.
-  Implement an **improved** model: Based on the self-attention approach, implement an attention-based sentiment analysis model that has 1-2 self-attention layers instead of an RNN. You should use the same pre-trained word embeddings for this model.
- You can use pytorch's built-in RNNs, attention layers, etc.
- For positional encoding you can use the sinosoidal approach described in the paper (first proposed [here](https://arxiv.org/pdf/1706.03762.pdf)). You can use existing online implementations (even though it's straightforward to implement). 
- You can use the SST database as shown in the tutorial.

Your results should include:
- Everything written in the **Guidelines** above.
- A comparative analysis: compare the baseline to the improved model. Compare in terms of overall classification accuracy and show a multiclass confusion matrix.
- Visualize of the attention maps for a few movie reviews from each class, and explain the results.

## Spectrally-Normalized Wasserstein GANs

In HW3 we implemented a simple GANs from scratch, using an approach very similar to the original GAN paper. However, the results left much to be desired and we discovered first-hand how hard it is to train GANs due to their inherent instability.

One of the prevailing approaches for improving training stability for GANs is to use a technique called [Spectral Normalization](https://arxiv.org/pdf/1802.05957.pdf) to normalize the largest singular value of a weight matrix so that it equals 1.
This approach is generally applied to the discriminator's weights in order to stabilize training. The resulting model is sometimes referred to as a SN-GAN.
See Appendix A in the linked paper for the exact algorithm. You can also use pytorch's `spectral_norm`.

Another very common improvement to the vanilla GAN is known a [Wasserstein GAN](https://arxiv.org/pdf/1701.07875.pdf) (WGAN). It uses a simple modification to the loss function, with strong theoretical justifications based on the Wasserstein (earth-mover's) distance.
See also [here](https://developers.google.com/machine-learning/gan/loss) for a brief explanation of this loss function.

One problem with generative models for images is that it's difficult to objectively assess the quality of the resulting images.
To also obtain a quantitative score for the images generated by each model,
we'll use the [Inception Score](https://arxiv.org/pdf/1606.03498.pdf).
This uses a pre-trained Inception CNN model on the generated images and computes a score based on the predicted probability for each class.
Although not a perfect proxy for subjective quality, it's commonly used a way to compare generative models.
You can use an implementation of this score that you find online, e.g. [this one](https://github.com/sbarratt/inception-score-pytorch) or implement it yourself.

Based on the linked papers, add Spectral Normalization and the Wassertein loss to your GAN from HW3.
Compare between:
- The baseline model (vanilla GAN)
- SN-GAN (vanilla + Spectral Normalization)
- WGAN (using Wasserstein Loss)
- Optional: SN+WGAN, i.e. a combined model using both modifications.

As a dataset, you can use [LFW](http://vis-www.cs.umass.edu/lfw/) as in HW3 or [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html), or even choose a custom dataset (note that there's a dataloder for CelebA in `torchvision`). 

Your results should include:
- Everything written in the **Guidelines** above.
- A comparative analysis between the baseline and the other models. Compare:
  - Subjective quality (show multiple generated images from each model)
  - Inception score (can use a subset of the data).
- You should show substantially improved subjective visual results with these techniques.

## Implementation

#### Loading Existing Data

In [75]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import torch
import torchtext
import torch.nn as nn
import project.model as model
import project.self_attention_model as attn_model
import project.glove_parser as glove
import numpy as np 

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# embedding,word_dict = glove.load_embedding(glove.GloveDimSize.HUNDRED)
# learning_model = model.SimplePredictionModel(embedding_dim = encoding.embedding_dim, hidden_dim=100, num_layers = 4, embedding = encoding,device=device)

In [34]:
ds_train,ds_valid,ds_test,embedding_tensor = model.load_data()
embeding = nn.Embedding.from_pretrained(embedding_tensor)
BATCH_SIZE = 32

# BucketIterator creates batches with samples of similar length
# to minimize the number of <pad> tokens in the batch.
dl_train, dl_valid, dl_test = torchtext.data.BucketIterator.splits(
    (ds_train, ds_valid, ds_test), batch_size=BATCH_SIZE,
    shuffle=True, device='cpu')





In [73]:
batch = next(iter(dl_train))
X, y = batch.text.t(), batch.label
print(x.shape)
lr=0.01
learning_model = attn_model.AttentionModel(embedding = embeding,embedding_dim = embeding.embedding_dim)
optimizer = optim.Adam(learning_model.parameters(),lr=lr)
learning_model.to(device)
X, y = batch.text, batch.label
X = X.t()
loss = nn.NLLLoss()
for i in range(500):
    optimizer.zero_grad()
    output = learning_model(X)
    loss_item=loss(output,y)
    loss_item.backward()
    optimizer.step()
    print(loss_item)


torch.Size([47, 32])
shape of embedded is torch.Size([32, 42, 100])
tensor(1.1649, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(1.3608, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.9767, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.8732, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.8927, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.8356, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.7249, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.6384, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.6009, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.5543, grad_fn=<NllLossBackward>)
shape of embedded is torch.Size([32, 42, 100])
tensor(0.4647, grad_fn=<NllLossBackward>)


KeyboardInterrupt: 

In [None]:
from project.HW3_additions import training
from project.HW3_additions.training import LSTMTrainer, AttentionTrainer
import torch.optim as optim
lr=0.01
learning_model = attn_model.AttentionModel(embedding = embeding,embedding_dim = embeding.embedding_dim)
learning_model.to(device)
optimizer = optim.Adam(learning_model.parameters(),lr=lr)
print(learning_model)
# print(f"parameters are {len(params)}")
# print(f"optimizer is {optimizer}")
# scheduler = optim.lr_scheduler.ReduceLROnPlateau(
#     optimizer, mode='max', factor=0.1, patience=0.1, verbose=True
# )


trainer = AttentionTrainer(learning_model,loss,optimizer,device)
trainer.fit(dl_train,dl_test,500)
# for epoch in range(500):
#     epoch_result = trainer.train_epoch(dl_train, verbose=True)
#     epoch_test_result = trainer.test_epoch(dl_test, verbose=True)
#     # Every X epochs, we'll generate a sequence starting from the first char in the first sequence
#     # to visualize how/if/what the model is learning.
#     if epoch == 0 or (epoch+1) % 25 == 0:
#         avg_loss = np.mean(epoch_result.losses)
#         accuracy = np.mean(epoch_result.accuracy)
#         print(f'\nEpoch #{epoch+1}: Avg. loss = {avg_loss:.3f}, Accuracy = {accuracy:.2f}%')

AttentionModel(
  (embedding): Embedding(15482, 100)
  (pos_enc): PositionalEncoding(
    (dropout): Dropout(p=0.4, inplace=False)
  )
  (attn1): MultiplicativeAttention(
    (softmax): Softmax(dim=-1)
  )
  (bnorm_1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
  (attn2): MultiplicativeAttention(
    (softmax): Softmax(dim=-1)
  )
  (bnorm_2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
  (attn3): MultiplicativeAttention(
    (softmax): Softmax(dim=-1)
  )
  (bnorm_3): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
  (fc): Linear(in_features=100, out_features=3, bias=False)
)
device is cpu
--- EPOCH 1/500 ---
train_batch (Avg. Loss 0.954, Accuracy 56.4): 100%|██████████| 267/267 [00:04<00:00, 57.33it/s]
test_batch (Avg. Loss 0.860, Accuracy 61.8): 100%|██████████| 70/70 [00:00<00:00, 184.92it/s]
--- EPOCH 2/500 ---
train_batch (Avg. Loss 0.868, Accuracy 61.5): 100%|██████████| 267/267 [00:05<00:00, 51.79it/s]
test_batch (Avg. Loss 0.851, Accuracy 63.8): 100%