# ECE 590, Spring 2024
## Homework 2

## Full name:

Feel free to submit in any format, but with content and code.
- Example 1: pdf of report and folder of codes
- Example 2: this jupyter-notebook

## Question 1: Training a Light Score-Based Generative Model with Sliced Score Matching on MNIST

**Objective:** Implement and train a lightweight score-based generative model using the sliced score matching technique. The goal is to learn the data distribution's score for generating new samples similar to the training data.

**Dataset:** Use the MNIST dataset, which consists of 70,000 28x28 grayscale images of handwritten digits (0-9). It is divided into 60,000 training images and 10,000 test images. MNIST can be found at https://pytorch.org/vision/main/generated/torchvision.datasets.MNIST.html To reduce computational complexity, you can downscale the MNIST images to 7x7. Both score models trained with 28x28 and with 7x7 MNIST will get full credits.

**Tasks:**
1. Data preparation: Normalize the MNIST images to have pixel values between -1 and 1.
2. Model Architecture: Construct a simple convolutional neural network (CNN) for estimating the data distribution's score. This network should accept a noisy image as input and output a score estimate.
3. Sliced Score Matching: Implement the sliced score matching objective. Add Gaussian noise to the input images, and train the model to approximate the score of the noise-perturbed data distribution.
4. Training: Use a smaller batch size if necessary to accommodate memory constraints. Train the model using a straightforward optimizer like Adam, with a conservative learning rate (e.g., 1e-3). Consider reducing the number of training epochs and implementing checkpointing to save the model intermittently.
5. Evaluation and Generation: Evaluate the model qualitatively by visual inspection of generated digits.

(A helpful website: https://github.com/mfkasim1/score-based-tutorial/blob/main/01-SGM-without-SDE.ipynb)

In [None]:
""" Data Preparation """

from torchvision.datasets import MNIST
from torchvision import transforms
from torch.utils.data import DataLoader

# Define a transform to normalize the data and apply basic augmentations
transform = transforms.Compose([
    transforms.RandomAffine(degrees=5, translate=(0.05, 0.05)),  # slight rotation and translation
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Normalize to [-1, 1]
])

# Load the MNIST dataset
train_dataset = MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = MNIST(root='./data', train=False, download=True, transform=transform)

# DataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)


In [None]:
""" Model Architecture """
# score_network takes input of MNIST image dimension and returns the output of
# the same size, socre network usually follows the structure of U-Net
class scoreNet(nn.Module):
    def __init__(self):
        super(scoreNet, self).__init__()
        """YOUR CODE"""


    def forward(self, x):
        return

In [None]:
"""Sliced Score Matching"""
# Implement the sliced score matching function, as illustrated in sldes 26-28
# of lectur 4 of Teaching Staff Lecture Slides on Sakai.


In [None]:
"""Training"""
import time
opt = torch.optim.Adam(score_network.parameters(), lr=3e-4)

t0 = time.time()
for i_epoch in range(5000):
    total_loss = 0
    for data, in train_loader:
        opt.zero_grad()

        # training step
        loss = calc_loss(scoreNet, data)
        loss.backward()
        opt.step()

        # running stats
        total_loss = total_loss + loss.detach().item() * data.shape[0]

    # print the training stats
    if i_epoch % 500 == 0:
        print(f"{i_epoch} ({time.time() - t0}s): {total_loss / len(dset)}")

In [None]:
"""Evaluation and Generation"""
# Sample 9 images from your trained score model and plot them in a 3x3 grid
# using matplotlib's subplot function.

## Question 2: Sampling from energy models with Langevin dynamics and stein scores

Energy based models learn an energy functional $E_{\theta}:\mathcal{X}\rightarrow \mathbb{R}$. We look at the Gibbs distribution as follows:

$p_{\theta}(x) = \frac{1}{Z_{\theta}}e^{-E_{\theta}(x)}$, where $Z_{\theta} = \int_{\mathcal{X}}e^{-E_{\theta}(y)}dy$.

Directly sampling from $p_{\theta}$ is hard, but we can approximate samples using a Markov chain with stationary distribution $p_{\theta}$, spscifically, we have the discretized Langevin dynamics:

<!-- '''$\frac{d x_t}{dt} = \nabla_x\log p_{\theta}(x_t)dt+\sqrt{2}dW_t,$ -->

<!-- where $dW_t$ is a white noise process, given by the Brownian motion $W_t$. -->

<!-- (Diffusion following these dynamics converges asymptotically to samples $x_t\sim p_{\theta}$, in the sense that $D(x_t\|p_{\theta})→0$ as $t→∞$.) -->

<!-- Discretizing the Langevin dynamics, we have -->

$x_{t+1} = x_t-\eta\nabla_x\log p_{\theta}(x_t)+\sqrt{2\eta}\epsilon_t,$

where $\epsilon_t\sim\mathcal{N}(0,I)$, $\eta$ is the step size.

We consider a 2D case, where $x\in\mathbb{R}^2$. Say $E_{\theta}(x) = \theta\cdot x$, where $\theta\in\mathbb{R}^{2}$ is a vector and has all the parameters.

Calculate the expression for the distribution $x_N$, where $x_0\sim \mathcal{N}(0,I)$, and $N$ is the number of steps, in terms of $\eta, \theta, N$.

(You can implement and see if your computational results match your analytical results. A helpful website: https://courses.cs.washington.edu/courses/cse599i/20au/resources/L16_ebm.pdf)

## Question 3: Transformer for translation

Here, we implement transformers for neural machine translation (NMT), such as turning "Hello world" to "Salut le monde". You are going to follow the following steps:
1. Load and prepare the data. We provide "en-ft.txt". Each line of this file contains an English phrase, the equivalent French phrase, and an attribution identifying where the translation came from. The en-fr.txt used in problem 3 can also be found at: https://github.com/jeffprosise/Applied-Machine-Learning/tree/main/Chapter%2013/Data
2. Build and train a model. Implement a transformer from scratch in Pytorch. We will provide you with an existing implementation in Keras. You might also find https://github.com/gordicaleksa/pytorch-original-transformer useful.

For deliverables, plot your training and validation accuracy. The x-axis should be epoch, the y-axis should be your translation accuracy.

For reference, the provided code given at https://github.com/jeffprosise/Applied-Machine-Learning/blob/main/Chapter%2013/Neural%20Machine%20Translation%20(Transformer).ipynb achieves 85% accuracy after 14 epochs. You do not have to achieve the same performance to get full marks, just show understanding and functional codes.

In [None]:
"""Clean the text by removing punctuation symbols and numbers, converting
characters to lowercase, and replacing Unicode characters with their ASCII
equivalents. For the French samples, insert [start] and [end] tokens at the
 beginning and end of each phrase"""
import pandas as pd
import re
from unicodedata import normalize

df = pd.read_csv('Data/en-fr.txt', names=['en', 'fr', 'attr'], usecols=['en', 'fr'], sep='\t')
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)
df.head()

def clean_text(text):
    text = normalize('NFD', text.lower())
    text = re.sub('[^A-Za-z ]+', '', text)
    return text

def clean_and_prepare_text(text):
    text = '[start] ' + clean_text(text) + ' [end]'
    return text

df['en'] = df['en'].apply(lambda row: clean_text(row))
df['fr'] = df['fr'].apply(lambda row: clean_and_prepare_text(row))
df.head()

In [None]:
"""The next step is to scan the phrases and determine the maximum length of the
English phrases and then of the French phrases. These lengths will determine
the lengths of the sequences input to and output from the model"""
en = df['en']
fr = df['fr']

en_max_len = max(len(line.split()) for line in en)
fr_max_len = max(len(line.split()) for line in fr)
sequence_len = max(en_max_len, fr_max_len)

print(f'Max phrase length (English): {en_max_len}')
print(f'Max phrase length (French): {fr_max_len}')
print(f'Sequence length: {sequence_len}')

In [None]:
"""Now fit one Tokenizer to the English phrases and another Tokenizer to their
French equivalents, and generate padded sequences for all the phrases"""
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

en_tokenizer = Tokenizer()
en_tokenizer.fit_on_texts(en)
en_sequences = en_tokenizer.texts_to_sequences(en)
en_x = pad_sequences(en_sequences, maxlen=sequence_len, padding='post')

fr_tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@\\^_`{|}~\t\n')
fr_tokenizer.fit_on_texts(fr)
fr_sequences = fr_tokenizer.texts_to_sequences(fr)
fr_y = pad_sequences(fr_sequences, maxlen=sequence_len + 1, padding='post')

In [None]:
"""Compute the vocabulary sizes from the Tokenizer instances"""
en_vocab_size = len(en_tokenizer.word_index) + 1
fr_vocab_size = len(fr_tokenizer.word_index) + 1

print(f'Vocabulary size (English): {en_vocab_size}')
print(f'Vocabulary size (French): {fr_vocab_size}')

In [None]:
"""Finally, create the features and the labels the model will be trained with.
The features are the padded English sequences and the padded French sequences
minus the [end] tokens. The labels are the padded French sequences minus the
[start] tokens. Package the features in a dictionary so they can be input to a
model that accepts multiple inputs."""
inputs = { 'encoder_input': en_x, 'decoder_input': fr_y[:, :-1] }
outputs = fr_y[:, 1:]

Now, define and train the transformer in Pytorch. We provide here some example code in Keras, **but note that you have to write it in Pytorch**.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from keras_nlp.layers import TokenAndPositionEmbedding, TransformerEncoder
from keras_nlp.layers import TransformerDecoder
from tensorflow.keras.callbacks import EarlyStopping
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

np.random.seed(42)
tf.random.set_seed(42)

num_heads = 8
embed_dim = 256

encoder_input = Input(shape=(None,), dtype='int64', name='encoder_input')
x = TokenAndPositionEmbedding(en_vocab_size, sequence_len, embed_dim)(encoder_input)
encoder_output = TransformerEncoder(embed_dim, num_heads)(x)
encoded_seq_input = Input(shape=(None, embed_dim))

decoder_input = Input(shape=(None,), dtype='int64', name='decoder_input')
x = TokenAndPositionEmbedding(fr_vocab_size, sequence_len, embed_dim, mask_zero=True)(decoder_input)
x = TransformerDecoder(embed_dim, num_heads)(x, encoded_seq_input)
x = Dropout(0.4)(x)

decoder_output = Dense(fr_vocab_size, activation='softmax')(x)
decoder = Model([decoder_input, encoded_seq_input], decoder_output)
decoder_output = decoder([decoder_input, encoder_output])

model = Model([encoder_input, decoder_input], decoder_output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary(line_length=120)

callback = EarlyStopping(monitor='val_accuracy', patience=3, restore_best_weights=True)
hist = model.fit(inputs, outputs, epochs=50, validation_split=0.2, callbacks=[callback])

acc = hist.history['accuracy']
val = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training accuracy')
plt.plot(epochs, val, ':', label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

## Question 4: BERT for sentiment analysis

For the last problem, we are going to learn how to use the huggingface library to train a simple BERT classifier for sentiment analysis.

We will use the IMDB dataset. You can find the dataset from huggingface using the following command:

```
from datasets import load_dataset
imdb = load_dataset("imdb")
```
To access BERT, use
```
from transformers import BertForSequenceClassification
#load pre-trained BERT
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                      num_labels = len(label_dict),
                                                      output_attentions = False,
                                                      output_hidden_states = False)
```
To reduce training complexity, you can choose to freeze the weight of the pretrained BERT model and only train the classifier. The classifier should have a minimum of 3 layers.
You might find https://huggingface.co/blog/sentiment-analysis-python and https://github.com/baotramduong/Twitter-Sentiment-Analysis-with-Deep-Learning-using-BERT/blob/main/Notebook.ipynb helpful.

