## Introduction

The goal of this tutorial is to implement a Variational Autoencoder (VAE) for Topic Models. The aim is to give you sense of: 


*   How topic models can be implemented under Variational Autoencoder (VAE)
*   How the "*reparametrization trick*" enables backpropogation through latent variables


Frist, we need to import neccesary packages:

In [None]:
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import Parameter
import torch.nn.functional as F
import math
import os
import string
import numpy as np
import random
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import OrderedDict
from tqdm.notebook import tqdm

In [None]:
###############
# Torch setup #
###############
print('Torch version: {}, CUDA: {}'.format(torch.__version__, torch.version.cuda))
cuda_available = torch.cuda.is_available()

if not torch.cuda.is_available():
  print('WARNING: You may want to change the runtime to GPU for faster training!')
  DEVICE = 'cpu'
else:
  DEVICE = 'cuda:0'

#########################
# Some helper functions #
#########################
def fix_seed(seed=None):
  """Sets the seeds of random number generators."""
  if seed is None:
    # Take a random seed
    seed = time.time()
  seed = int(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  return seed

fix_seed(1234)

## Data Preprocessing
### Download dataset

We experiment on a standard news corpora: the ***20NewsGroups*** and download it using scikit-learn. This dataset consists of 20k news articles classified into 20 topics.

In [None]:
from sklearn.datasets import fetch_20newsgroups

train_news_group = fetch_20newsgroups(subset='train')
test_news_group = fetch_20newsgroups(subset='test')

train_data = train_news_group['data']
test_data = test_news_group['data']

print("Size of training data:", len(train_data))
print("Size of test data:", len(test_data))
print("All topics:", train_news_group.target_names)

### Preprocess Dataset

In this section, we define the functions to do conventional preprocessing and build the vocabulary.

In [None]:
def preprocess(samples):
    output = []
    for item in samples:
        words = item.replace('\n', '').strip().lower().split(' ')
        punctuations = (string.punctuation).replace("'", "")
        trans_table = str.maketrans('', '', punctuations)
        stripped_words = [word.translate(trans_table) for word in words]
        words = [str for str in stripped_words if str]
        words = [word for word in words if not word.isdigit()]
        words = [str for str in words if str]
        output.append(words)
    return output

train_prep = preprocess(train_data)
test_prep = preprocess(test_data)

In [None]:
def get_vocab(data):
    vocab = {}
    
    stops = set(stopwords.words('english'))
    ### -------------- TODO --------------- ###
    # remove stop words and count frequency of words
    
    return vocab


vocab_total = get_vocab(train_prep + test_prep)
print("Total number of words in vocabulary:", len(vocab_total))
sorted(vocab_total.items(),key = lambda x:x[1],reverse = True)
vocab = vocab_total

'''
Here we filter vocabulary to save some training time, otherwise our model input dimension would be huge (V=350k+).
You can uncomment the line to include more words (around 52k words in vocabulary), which would help classifing the topics (Q4).
'''
vocab = {k:v for k,v in list(vocab_total.items())[:5000]}
# vocab = {k:v for k,v in vocab_total.items() if v > 3}
vocab_size = len(vocab)
print("Vocabulary size after filtering:", vocab_size)

word2idx = {k:n for n,(k,v) in enumerate(vocab.items())}

In [None]:
train_doc = [[word for word in doc if word in vocab] for doc in train_prep]
train_doc = [doc for doc in train_doc if len(doc) > 5]

test_doc = [[word for word in doc if word in vocab] for doc in test_prep]
test_doc = [doc for doc in test_doc if len(doc) > 5]

### Process Bag-of-words Inputs

Next we define multiple helper functions to create input batches. Our inputs are represented in bag-of-word (bow) where each article/document is represented with a vector of **V** elements. We will also do the batching in this section, so the inputs to the models would be in the dimension of *(batch_size, vocab_size)*.

In [None]:
from collections import Counter
def data_set(data_url):
    """process data input."""
    data = []
    word_count = []
    for words in data_url:
        word2freq = dict(Counter(words))
        doc = {}
        count = 0

        for word,freq in word2freq.items():
            doc[int(word2idx[word])] = freq
            count += freq

        if count > 0:
            data.append(doc)
            word_count.append(count)

    return data, word_count

In [None]:
def create_batches(data_size, batch_size, shuffle=True):
    """create a batch of indices."""
    batches = []
    ids = list(range(data_size))
    if shuffle:
        random.shuffle(ids)
    for i in range(data_size // batch_size):
        start = i * batch_size
        end = (i + 1) * batch_size
        batches.append(ids[start:end])
    # the batch of which the length is less than batch_size
    rest = data_size % batch_size
    if rest > 0:
        batches.append(ids[-rest:] + [-1] * (batch_size - rest))  # -1 as padding
    return batches

In [None]:
def fetch_data(data, count, idx_batch, vocab_size):
    """fetch input data by batch."""
    batch_size = len(idx_batch)
    data_batch = np.zeros((batch_size, vocab_size))
    count_batch = []
    mask = np.zeros(batch_size)
    indices = []
    values = []
    for i, doc_id in enumerate(idx_batch):
        if doc_id != -1:
            for word_id, freq in data[doc_id].items():
                data_batch[i, word_id] = freq
            count_batch.append(count[doc_id])
            mask[i]=1.0
        else:
            count_batch.append(0)
    return data_batch, count_batch, mask

### Question 1: Finish the neural structures of the VAE encoder and decoder, and the reparamerisation trick.

In [None]:
class TopicModel(nn.Module):
    def __init__(self, 
                 vocab_size,
                 input_size,
                 n_hidden,
                 n_topic, 
                 batch_size):
        super(TopicModel, self).__init__()

        self.vocab_size = vocab_size
        self.n_hidden = n_hidden
        self.n_topic = n_topic
        self.batch_size = batch_size
 
        ### -------------- TODO --------------- ###
        self.mu_layer = 
        self.logsigm_layer = 

        ### -------------- TODO --------------- ###
        self.encoder = nn.Sequential(nn.Linear(),
                                     nn.ReLU(),
                                     nn.Linear(),
                                     nn.ReLU())
        
        ### -------------- TODO --------------- ###
        self.decoder =  


    def zero_bias(self,):
        self.mu_layer.bias.data.fill_(0.0)
        self.logsigm_layer.bias.data.fill_(0.0)
        
    def forward(self, input):

        # encoder forward
        doc_vec = self.encoder(input)
        mu = self.mu_layer(doc_vec)
        logsigm = self.logsigm_layer(doc_vec)
        
        # reparameterisation
        ### -------------- TODO --------------- ###
        eps = 
        z =

        # decoder forward
        logits = self.decoder(z)
        
        # reconsrtuction loss
        ### -------------- TODO --------------- ###
        recons = 
        
        # kl-divergence loss
        ### -------------- TODO --------------- ###
        kld = 
        
        loss = torch.mean(recons + kld)
        recons = torch.mean(recons)
        kld = torch.mean(kld)

        # print(loss, recons, kld)
        

        return loss, recons, kld

### Question 2: Finish the Training File

In [None]:
def main_train():
    num_epoch = 20
    batch_size = 64
    vocab_size = len(vocab)
    n_hidden = 256
    n_topic = 50
    learning_rate = 0.0001
    alternate_epochs = 5
    
    train_set, train_count = data_set(train_doc)
    
    ### -------------- TODO --------------- ###
    model = 
    
    model.zero_bias()
    model.to(DEVICE)


    ### -------------- TODO --------------- ###
    optimizer_enc = torch.optim.Adam(,
                                     lr = learning_rate,
                                     eps= 1e-8)
    optimizer_dec = torch.optim.Adam(, 
                                     lr = learning_rate,
                                     eps= 1e-8)
    
    for epoch in range(num_epoch):
        train_batches = create_batches(len(train_set), batch_size, shuffle=True)
        model.train() 
        
        ### -------------- TODO --------------- ###
        # Question: why do we need two optimizers #
        for switch in range(0, 2): 
            if switch == 0:
                optimizer = optimizer_dec
                print_mode = 'updating decoder'
            else:
                optimizer = optimizer_enc
                print_mode = 'updating encoder'
                
            loss_epoch = 0.0
            recons_epoch = 0.0
            kld_epoch = 0.0
            count = 0
    
            for i in range(alternate_epochs):
                                 
                for idx_batch in train_batches:
                    data_batch, count_batch, mask = fetch_data(train_set, train_count, idx_batch, vocab_size)
                    input = torch.from_numpy(data_batch).float().to(DEVICE)
                    loss, recons, kld = model(input)
                    
                    # optimize
                    optimizer.zero_grad()      
                    loss.backward()        
                    optimizer.step()        
                    loss_epoch += loss
                    recons_epoch += recons
                    kld_epoch += kld
                    count += 1

            print(f'Epoch {epoch}, loss={loss_epoch/count}, recons={recons_epoch/count}, kld={kld_epoch/count}')

    return model
    

In [None]:
model = main_train()

In [None]:
# save model to use in Q4
torch.save(model.state_dict(), "vae.pt")

### Question 3: Code qualitative analysis for topics (p(x|z))
Now that we have the VAE trained with 50 candidate topics, we can explore how the VAE model cluster words with similar topics together.

In the following section, you will also need to evaluate the perplexity of the VAE model.

In [None]:
#Add meta information (authors, time, geolocation etc.) to improve quality of the topics

associations = {
    'jesus': ['prophet', 'jesus', 'matthew', 'christ', 'worship', 'church'],
    'comp ': ['floppy', 'windows', 'microsoft', 'monitor', 'workstation', 'macintosh', 
              'printer', 'programmer', 'colormap', 'scsi', 'jpeg', 'compression'],
    'car  ': ['wheel', 'tire'],
    'polit': ['amendment', 'libert', 'regulation', 'president'],
    'crime': ['violent', 'homicide', 'rape'],
    'midea': ['lebanese', 'israel', 'lebanon', 'palest'],
    'sport': ['coach', 'hitter', 'pitch'],
    'gears': ['helmet', 'bike'],
    'nasa ': ['orbit', 'spacecraft'],
}
def identify_topic_in_line(line):
    topics = []
    for topic, keywords in associations.items():
        for word in keywords:
            if word in line:
                topics.append(topic)
                break
    return topics

def print_top_words(beta, feature_names, n_top_words=10):
    print('---------------Printing the Topics------------------')
    for i in range(len(beta)):
        line = " ".join([feature_names[j][0] for j in beta[i].argsort()[:-n_top_words - 1:-1]])
        topics = identify_topic_in_line(line)
        print('|'.join(topics))
        print('     {}'.format(line))
    print('---------------End of Topics------------------')


def print_perp(model):
    cost=[]
    model.eval()
    test_set, test_count = data_set(test_doc)
    test_batches = create_batches(len(test_set), 64)
    
    ### -------------- TODO --------------- ###
    
    ppl = 
    print('The approximated perplexity is: ', ppl)

In [None]:
# perplexity on test data
print_perp(model)

In [None]:
# model latent topics
emb = model.decoder[0].weight.data.cpu().numpy().T
print_top_words(emb, sorted(vocab.items(), key=lambda x:x[1]))

### Question 4: Use Topics to do Classification
In this section, you will use both the article and the labels to train a topic classifier. Firstly, you may train a vanilla classifier, and you are likely to get around 83% validation accuracy with a vocabulary of 50k (the accuracy might be lower with a small vocabulary). Then you can use the pre-trained VAE encoder as the classifier encoder and fine-tune it to see what happens.

In [None]:
'''
Uncomment these lines to train the classifier on larger vocabulary
NOTE: if you want to use the pre-trained VAE encoder, the vocab size for the classifier should be the same as the VAE model
'''

# vocab = {k:v for k,v in vocab_total.items() if v > 3}
# vocab_size = len(vocab)
# print("Vocabulary size after filtering:", vocab_size)
# word2idx = {k:n for n,(k,v) in enumerate(vocab.items())}

In [None]:
def fetch_labelled_data(data, labels, idx_batch, vocab_size):
    """fetch input data and labels by batch."""
    batch_size = len(idx_batch)
    data_batch = np.zeros((batch_size, vocab_size))
    label_batch = []

    texts_batch = [data[i] for i in idx_batch]
    label_batch = [labels[i] for i in idx_batch]

    for i, text in enumerate(texts_batch):
        for word in text:
            if word in vocab:
                data_batch[i, word2idx[word]] += 1

    return data_batch, np.array(label_batch)

In [None]:
class VanillaClassifier(nn.Module):
    def __init__(self, input_size, n_hidden, n_class, dp):
        super().__init__()

        ### --------------------- TODO ----------------------- ###
        # construct a same encoder architecture as the VAE encoder 
        self.encoder = 
        self.dropout = nn.Dropout(dp)
        self.output = nn.Linear(n_hidden, n_class, bias=True)

    def forward(self, input):
        doc_vec = self.dropout(self.encoder(input))
        logits = self.output(doc_vec)
        return logits

class VAEClassifier(nn.Module):

    def __init__(self, vae, n_class, dp):
        super().__init__()

        self.encoder = vae.encoder
        self.vae_output = vae.n_hidden
        
        self.dropout = nn.Dropout(dp)
        self.output = nn.Linear(self.vae_output, n_class, bias=True)

    def forward(self, input):
        doc_vec = self.dropout(self.encoder(input))

        logits = self.output(doc_vec)

        return logits

In [None]:
def evaluate(classifier, idx_batches, data, labels, vocab_size, criterion):
    with torch.no_grad():
        total_loss = 0
        total_acc = 0
        val_count = 0

        for idx_batch in idx_batches:
            ### --------------------- TODO ----------------------- ###
            # compute validation loss and accuracy

    return total_loss/val_count, total_acc/val_count

In [None]:
def cls_train(train_data, train_labels, valid_data, valid_labels, vocab_size):
    num_epoch = 10
    batch_size = 64
    dropout = 0.1
    n_hidden = 64
    learning_rate = 0.0001
    
    # model.load_state_dict(torch.load("vae.pt"))
    # classifier = VAEClassifier(model, 20, dropout)
    classifier = VanillaClassifier(vocab_size, n_hidden, 20, dropout)

    classifier.to(DEVICE)

    optimizer = torch.optim.Adam(classifier.parameters(),
                                     lr = learning_rate)
    
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epoch):
        train_batches = create_batches(len(train_prep), batch_size, shuffle=True)
        valid_batches = create_batches(len(test_prep), batch_size, shuffle=False)
        
                
        train_loss = 0.0
        count = 0
        acc = 0
        classifier.train()              
        for idx_batch in train_batches:
            optimizer.zero_grad()
            ### --------------------- TODO ----------------------- ###
            # finish training loop

            
            count += 1
        
        # validation
        classifier.eval()
        valid_loss, valid_acc = evaluate(classifier, valid_batches, valid_data, test_labels, vocab_size, criterion)
        print(f'Epoch {epoch},\ttrain_loss={train_loss/count:.3f},\ttrain_acc={acc/count:.3f},\tvalid_loss={valid_loss:.3f},\tvalid_acc={valid_acc:.3f}')

In [None]:
cls_train(train_prep, train_labels, test_prep, test_labels, vocab_size)

Since the training of VAE takes a long time, the vocabulary was truncated into 5000 words. Therefore, the benefit from VAE pre-training might not seem very obvious. You may try and train a new VAE with a larger vocabulary (e.g. 50k and train for 20 epochs), it would help in classifing the topics.

In addition, the validation accuracy might be low if your vocabulary size is only 5,000 (valid acc around 65%). The below code is just a baseline utilizing all the words, and it would achieve around 83% accuracy on the validation set.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)

train_labels = np.array(train_news_group['target'])
test_labels = np.array(test_news_group['target'])

vocab_size = len(vectorizer.vocabulary_)
print(vocab_size)

In [None]:
def fetch_labelled_data(features, labels, idx_batch, vocab_size=None):
    idxs = np.array(idx_batch)

    feature_batch = features[idxs, :].toarray()
    label_batch = labels[idxs]
    return feature_batch, label_batch

In [None]:
cls_train(train_features, train_labels, test_features, test_labels, vocab_size)