# Assignment 3: Text processing with LSTM in PyTorch

*Author:* Thomas Adler

*Copyright statement:* This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for  personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

In this assignment you will a train an LSTM to generate text. To be able to feed text into (recurrent) neural networks we first have to choose a good representation. There are several options to do so ranging from simple character embeddings to more sophisticated approaches like [word embeddings](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) or [token embeddings](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a). We will use a character embedding in this assignment. 

Character embeddings work as follows. First we define an alphabet, a set of characters that we want to be able to represent. To feed a character into our network we use a one-hot vector. The dimension of this vector is equal to the size of our alphabet and the "hot" position indicates the character we want to represent. While this is logically a decent representation (all characters have the same norm, are orthogonal to one another, etc.) it is inefficient in terms of memory because we have to store a lot of zeros. In the first layer of our network we will multiply our one-hot vector with a weight matrix, i.e. we compute the preactivation by a matrix-vector product of the form $We_i$, where $e_i$ is the $i$-th canonical basis vector. This operation corresponds to selecting the $i$-th column of $W$. So an efficient implementation is to perform a simple lookup operation in $W$. This is how embedding layers work also for word or token embeddings. They are learnable lookup tables. 

## Exercise 1: Encoding characters

Write a class `Encoder` that implements the methods `__init__` and `__call__`. The method `__init__` takes a string as argument that serves as alphabet. The method `__call__` takes one argument. If it is a string then it should return a sequence of integers as `torch.Tensor` of shape  representing the input string. Each integer should represent a character of the alphabet. The alphabet consists of the characters matched by the regex `[a-z0-9 .!?]`. If the input text contains characters that are not in the alphabet, then `__call__` should either remove them or map them to a corresponding character that belongs to the alphabet. If the argument is a `torch.Tensor`, then the method should return a string representation of the input, i.e. it should function as decoder. 

In [56]:
import re
import torch

########## YOUR SOLUTION HERE ##########

class Encoder:
    
    def __init__(self,alphabet_string:str):
        self.alphabet_list=list(set(alphabet_string))
        self.alphabet_list.sort()
        pattern = re.compile(fr"[a-z0-9 .!?]")
        for _ in self.alphabet_list:
            assert pattern.match(_), f"'{_}' does not match alphabet"
                
        self.alphabet = {self.alphabet_list[char_idx]:char_idx for char_idx in range(len(self.alphabet_list))}
        
    def __call__(self,arg):
        if type(arg)==torch.Tensor:
            return ''.join([self.alphabet_list[int(_.item())] for _ in arg])
        elif type(arg)==str:
            return torch.Tensor([self.alphabet[_] for _ in arg])

        
encoder = Encoder("test string tensor")
tensor_ = encoder("string")
string_ = encoder(torch.Tensor([6,7,5,3,4,2]))
string_tensor = encoder(torch.Tensor(encoder("stringtensor")))

print(tensor_)
print(string_)
print(string_tensor)

tensor([7., 8., 6., 3., 4., 2.])
rsoing
stringtensor


## Exercise 2: Pytorch Dataset

Write a class `TextDataset` that derives from `torch.utlis.data.Dataset`. It should wrap a text file and utilize it for training with pytorch. Implement the methods `__init__`, `__len__`, `__getitem__`. The method `__init__` should take a path to a text file as string and an integer `l` specifying the length of one sample sequence. The method `__len__` takes no arguments and should return the size of the dataset, i.e. the number of sample sequences in the dataset. The method `__getitem__` should take an integer indexing a sample sequence and should return that sequence as a `torch.Tensor`. The input file can be viewed as one long sequence. The first sample sequence consists of the characters at positions `0..l-1` in the input file. The second sequence consists of the characters at positions `l..2*l-1` and so on. That is, the samples of our dataset are non-overlapping sequences. The last incomplete sequence may be dropped. 

In [65]:
import torch
from torch.utils.data import Dataset

########## YOUR SOLUTION HERE ##########

class TextDataset(torch.utils.data.Dataset):
    
    def __init__(self,file_path:str,l:int):
        self.data = None #get file from filepath
        self.l = l
    
    def __len__(self):
        return len(self.data) #or something
    
    def __getitem__():
        pass




## Exercise 3: The Model

Write a class `NextCharLSTM` that derives from `torch.nn.Module` and takes `alphabet_size`, the `embedding_dim`, and the `hidden_dim` as arguments. It should consist of a `torch.nn.Embedding` layer that maps the alphabet to embeddings, a `torch.nn.LSTM` that takes the embeddings as inputs and maps them to hidden states, and a `torch.nn.Linear` output layer that maps the hidden states of the LSTM back to the alphabet. Implement the methods `__init__` that sets up the module and `forward` that takes an input sequence and returns the logits (i.e. no activation function on the output layer) of the model prediction at every time step. 

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class NextCharLSTM(torch.nn.Module)

    def __init__(self,alphabet_size:str,embedding_dim,hidden_dim):
        pass
    
    def forward(self,input_sequence):
        
        logits = None
        
        return logits
        

########## YOUR SOLUTION HERE ##########

## Exercise 4: Training/Validation Epoch

Write a function `epoch` that takes a `torch.utils.data.DataLoader`, a `NextCharLSTM`, and a `torch.optim.Optimizer` as arguments, where the last one might be `None`. If the optimizer is `None`, then the function should validate the model. Otherwise it should train the model for next-character prediction in the many-to-many setting. That is, given a sequence `x` of length `l`, the input sequence is `x[:l-1]` and the corresponding target sequence is `x[1:]`. The function should perform one epoch of training/validation and return the loss values of each mini batch as a numpy array. Use the cross-entropy loss function for both training and validation. 

In [None]:
from torch.utils.data import DataLoader
import numpy as np

########## YOUR SOLUTION HERE ##########

## Exercise 5: Model Selection

Usually, we would now train and validate our model with different hyperparameters to see which setting performs best. However, this pretty expensive in terms of compute so we will provide you with a setting that should work quite well. Train your model for 30 epochs using `torch.optim.Adam`. Validate your model after every epoch and persist the model that performs best on the validation set using `torch.save`. Visualize and discuss the training and validation progress. 

In [None]:
import matplotlib.pyplot as plt

sequence_length = 100
batch_size = 256
embedding_dim = 8
hidden_dim = 512
learning_rate = 1e-3
num_epochs = 100

########## YOUR SOLUTION HERE ##########

## Exercise 6: Top-$k$ Accuracy

Write a function `topk_accuracy` that takes a list of integers $k$, a model, and a data loader and returns the top-$k$ accuracy of the model on the given data set for each $k$. A sample is considered to be classified correctly if the true label appears in the top-$k$ classes predicted by the model. Then load the best model from the previous exercise using `torch.load` and plot its top-$k$ accuracy as a function of $k$ for all possible values of $k$. Discuss the results. 

In [None]:
########## YOUR SOLUTION HERE ##########

## Exercise 7: Deterministic Text Generation

In this exercise we utilize the trained network to generate novel text. To do this, take some seed text, which can be chosen by the user, and feed it to the network. Subsequently, extrapolate new text by always appending the top-1 character according to the model prediction to the input sequence. Discuss the quality of your model as a text generator. 

In [None]:
########## YOUR SOLUTION HERE ##########

## Exercise 8: Probabilistic Text Generation

Utilize your trained model as text generator as in the previous exercise but with one difference. Instead of always choosing the top-1 character make a probabilistic choice. The network prediction constitutes a probability distribution over the alphabet. Choose the next character by sampling from this distribution. Compare the results to those of the previous exercise and discuss the observed differences. 

In [None]:
from torch.distributions import Categorical

########## YOUR SOLUTION HERE ##########

## Exercise 9: Visualize Neurons

Visualize the value of the 512 neurons while the trained model processes some user-defined text. Take a look at the last figure of [this blog](https://openai.com/blog/unsupervised-sentiment-neuron/) (which is also a good read) to get an idea of how to do the visualization. You can install and use the package `colorama` for that. Can you figure out certain repsonsibilities of certain neurons?

In [None]:
# provides readable names for ANSI escape sequences
from colorama import Fore, Back, Style

########## YOUR SOLUTION HERE ##########

## Bonus Exercise (3 Points):

Adapt your code from the previous exercises such that the model runs in the many-to-one setting, i.e., it should read `l-1` characters of a sample sequence and predict the `l`-th character. Train/validate the model in the many-to-one setting and compare it to the many-to-many setting in terms of top-$k$ accuracy on the validation set and probabilistic text generation. Visualize your results. What are the pros and cons? 

In [None]:
########## YOUR SOLUTION HERE ##########