# Assignment 5: Neural Networks

---

## Task 1) RNN as Language Model

Similar to the n-gram language models in the previous tasks, imagine you have to write another thesis and just want to generate an interesting topic.
In this assignment, you will train and use Recurrent Neural Networks as language models to generate new potential thesis topics.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

### Basic Setup

For the assignment on Recurrent Neural Networks, we'll (again) heavily use [PyTorch](https://pytorch.org) as go-to Deep Learning library.
Here, we'll rely on the RNN and Embedding modules already implemented by PyTorch.
You can imagine the Embedding layer as a simple lookup table that stores embeddings of a fixed dictionary and size (quite similar to the Word2Vec parameters we've trained in assignment 2).
Head over to the [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) and [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) modules to gain some understanding of their functionality.
Code for processing data samples, batching, converting to tensors, etc. can get messy and hard to maintain. 
Therefore, you can use PyTorch's [Datasets & DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). 
Get familiar with the basics of data handling, as it will help you for upcoming assignments.
As always, you can use [NumPy](https://numpy.org) and [Pandas](https://pandas.pydata.org) for data handling etc.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [109]:
# Dependencies
import os
import tqdm
import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import TypedDict, Iterator, Optional, Callable
import re
from functools import reduce
from math import floor, ceil
from sklearn.model_selection import train_test_split
import csv

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch import masked
from torch.optim import Optimizer, Adam
from torcheval.metrics import Perplexity
from torch.nn import CrossEntropyLoss

### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the modeling part such as nn.Embedding.

1.3 Create a PyTorch Dataset class which handles your tokenized data with respect to model inputs and labels.

In [110]:
@dataclass
class Thesis:
    registration_date: str
    due_date: str
    year_academic: int
    type: str
    degree: str
    language: str
    title: str
    abstract: str

class _Thesis(TypedDict):
    Anmeldedatum: str
    Abgabedatum: str
    JahrAkademisch: str
    Art: str
    Grad: str
    Sprache: str
    Titel: str
    Abstract: str

def to_thesis(thesis: _Thesis) -> Thesis:
    return Thesis(
        registration_date=thesis["Anmeldedatum"],
        due_date=thesis["Abgabedatum"],
        year_academic=int(thesis["JahrAkademisch"]),
        type=thesis["JahrAkademisch"],
        degree=thesis["Grad"],
        language=thesis["Sprache"],
        title=thesis["Titel"],
        abstract=thesis["Abstract"]
    )

def load_theses_dataset(filepath) -> pd.DataFrame:
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE
    
    lists = {key: [] for key in Thesis.__dataclass_fields__.keys()}
    with open(filepath, encoding="utf-8-sig") as fp:
        theses = map(to_thesis, csv.DictReader(fp.readlines(), delimiter=";")) # type: ignore
        for thesis in theses:
            for key in lists:
                lists[key].append(thesis.__dict__[key])
    return pd.DataFrame(lists)
    
    ### END YOUR CODE

In [111]:
### Notice: Think about start and end of sentence tokens

def tokenize(text: str) -> Iterator[str]:
    yield "<s>"
    for s in text.split():
        m = re.match(r"^(\w+)?([,\.?!])?$", s)
        if m is not None:
            if m.group(1) is not None:
                yield m.group(1).lower()
            if m.group(2) is not None:
                yield m.group(2)
    yield "</s>"

def preprocess(dataframe) -> list[list[str]]:
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    
    l = []
    for i in range(len(dataframe)):
        if dataframe["language"][i] == "DE":
            l.append(list(tokenize(dataframe["abstract"][i])))
    return l

    ### END YOUR CODE

In [112]:
THESES_DATASET_PATH = "../4-nnet/data/theses2022.csv"

dataframe = load_theses_dataset(THESES_DATASET_PATH)
tokenized_data = preprocess(dataframe)
vocabulary = {w for l in tokenized_data for w in l}
idx2word = sorted(list(vocabulary))
word2idx = {w: i for i, w in enumerate(idx2word)}

In [113]:
### TODO: 1.3 Implement the PyTorch theses dataset
### Notice: It is possible to solve the task without this class.
### Notice: However, with respect to DataLoaders it makes your life easier.

### YOUR CODE HERE

class ThesesDataset(Dataset):
    @property
    def dtype(self) -> torch.dtype:
        return self.__dtype
    
    @property
    def voc_size(self) -> int:
        return len(self.__word2idx)

    def __init__(self, dataset: list[list[str]], word2idx: dict[str, int], dtype: torch.dtype = torch.float16):
        self.__max_length = reduce(lambda acc, l: acc if acc > l else l, map(len, dataset), 0) - 1
        self.__seq_idcs = []
        self.__lengths = []
        for i, l in enumerate(dataset):
            for j in range(len(l) - 1):
                self.__seq_idcs.append(i)
                self.__lengths.append(j)
        self.__data = dataset
        self.__word2idx = word2idx
        self.__dtype = dtype


    def __len__(self):
        return len(self.__seq_idcs)

    def __getitem__(self, idx: slice | int) -> tuple[torch.Tensor, torch.Tensor]:
        if isinstance(idx, int):
            return self.__get_single(idx)
        else:
            return self.__get_multiple(idx)
    
    def __get_single(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        seq = self.__data[self.__seq_idcs[idx]]
        length = self.__lengths[idx]
        _x = torch.zeros((length , self.voc_size), dtype=self.dtype)
        for i in range(length):
            _x[i, self.__word2idx[seq[i]]] = 1
        x = torch.vstack([_x, torch.full((self.__max_length - length, self.voc_size), torch.nan, dtype=self.dtype)])
        y = torch.zeros(self.voc_size, dtype=self.dtype)
        y[self.__word2idx[seq[length]]] = 1
        return x, y
    
    def __get_multiple(self, idcs: slice) -> tuple[torch.Tensor, torch.Tensor]:
        xs = []
        ys = []
        for i in range(idcs.start, idcs.stop, idcs.step):
            x, y = self.__get_single(i)
            xs.append(x.reshape(1, x.shape[0], x.shape[1]))
            ys.append(y)
        return torch.vstack(xs), torch.stack(ys)
    
### END YOUR CODE

### Train and Evaluate

2.1 Implement the RNN Language Model. Therefore, you can use the nn.Module and overwrite the forward function. For the embedding layer you can either use the embeddings learned from the previous word2vec assignment or train the `nn.Embedding` module and corresponding parameters from scratch.

2.2 Implement the functionality to train your model with the train dataset.

2.3 Implement the functionality to evaluate your model with the test dataset.

2.4 Perform a train-test-split for your theses data, train the RNN Language Model and evaluate the loss & perplexity.

In [114]:
### TODO: 2.1 Implement the RNN Language Model (nn.Module)

### YOUR CODE HERE

class RNN_LM(nn.Module):
    @property
    def device(self) -> torch.device:
        return self.__device
    
    @device.setter
    def device(self, value: str | torch.device):
        if isinstance(value, str):
            value = torch.device(value)
        self.__device = value
        self.to(self.device)

    @property
    def dtype(self) -> torch.dtype:
        return self.__dtype
    
    def __init__(self, voc_size: int, embedding_dim: int, hidden_layer_sizes: list[int], device: torch.device, dtype: torch.dtype = torch.float16,  **kwargs):
        super(RNN_LM, self).__init__(**kwargs)
        self.__device = device
        self.__dtype = dtype
        self.__hidden_layer_sizes = hidden_layer_sizes
        self.embeddings = nn.Linear(voc_size, embedding_dim, False, device, dtype)
        self.hidden: list[nn.Linear] = []
        prev_size = embedding_dim
        for size in hidden_layer_sizes:
            self.hidden.append(nn.Linear(prev_size + size, size, True, device, dtype))
            prev_size = size
        self.classification_head = nn.Linear(prev_size, voc_size, True, device, dtype)
    
    def forward(self, X: torch.Tensor) -> torch.Tensor:
        hidden_states = [torch.zeros((X.shape[0], s), device=self.device, dtype=self.dtype) for s in self.__hidden_layer_sizes]
        word_embeddings = self.embeddings(X)
        for i in range(X.shape[1]):
            x = word_embeddings[:, i, :]
            mask = (~x[:, 0:1].isnan())
            print(mask)
            if not mask.any().item():
                break
            new_hidden_states = [self.__update_hidden(0, x, hidden_states[0], mask)]
            for j in range(1, len(self.hidden)):
                new_hidden_states.append(self.__update_hidden(j, new_hidden_states[-1], hidden_states[j], mask))
            hidden_states = new_hidden_states
        return self.classification_head(hidden_states[-1])
    
    def __update_hidden(self, i: int, x: torch.Tensor, prior: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        result = F.relu(self.hidden[i](torch.hstack([x, prior])))
        mask.any
        return torch.where(mask.repeat(1, result.shape[1]), result, prior)


### END YOUR CODE

In [115]:
### TODO: 2.2 Implement the train functionality
### Notice: If you want, you can also combine train and eval functionality

def train(model: nn.Module, loader: DataLoader, loss_fn: Callable[[torch.Tensor, torch.Tensor], torch.Tensor], opt: Optimizer):
    """Trains the RNN-LM for one epoch."""
    ### YOUR CODE HERE
    
    batch_count = len(loader)
    running_loss = 0
    for i, (X, Y) in enumerate(loader):
        X = X.to(model.device)
        Y = Y.to(model.device)
        print(f"\r  training batch {i+1}/{batch_count}", end="")
        opt.zero_grad()
        logits = model(X)
        loss = loss_fn(logits, Y)
        print(loss.item())
        running_loss += loss.item()
        loss.backward()
        opt.step()
    print()
    print(f"  average loss: {running_loss/batch_count}")

    ### END YOUR CODE

In [116]:
### TODO: 2.3 Implement the evaluation functionality
### Notice: If you want, you can also combine train and eval

def eval(model: nn.Module, loader: DataLoader, loss_fn: Callable[[torch.Tensor, torch.Tensor], torch.Tensor]):
    """Evaluates the optimized RNN-LM."""
    ### YOUR CODE HERE

    batch_count = len(loader)
    perplexity = Perplexity()
    running_perp = 0
    running_loss = 0
    with torch.no_grad():
        for i, (X, Y) in loader:
            X = X.to(model.device)
            Y = Y.to(model.device)
            print(f"\r  evaluating batch {i+1}/{batch_count}")
            logits = model(X)
            running_loss += loss_fn(logits, Y).item()
            perplexity.update(X, torch.argmax(Y, dim=1))
            running_perp += perplexity.compute().item()
    print()
    print(f"  average loss: {running_loss/batch_count}, average perplexity: {running_perp/batch_count}")

    ### END YOUR CODE

In [117]:
### TODO: 2.4 Initialize and train the RNN Language Model for X epochs

# For split reproducibility
# Optional: Use 5-fold cross validation
SEED = 42

EPOCHS = 100

DEVICE = "cuda" # 'cpu', 'mps' or 'cuda'

TEST_RATIO = 0.2

BATCH_SIZE = 1

EMBEDDING_DIM = 256

HIDDEN_LAYER_SIZES = [256, 64, 64]

### YOUR CODE HERE

train_data, test_data = train_test_split(tokenized_data, test_size=TEST_RATIO, random_state=SEED)

# Use batch_size=1 if you want to avoid padding handling
train_dataset = ThesesDataset(train_data, word2idx)
train_dataloader = DataLoader(train_dataset, BATCH_SIZE, True)

# Use batch_size=1 if you want to avoid padding handling
test_dataset = ThesesDataset(test_data, word2idx)
test_dataloader = DataLoader(test_dataset, BATCH_SIZE, True)

# Your language model
model = RNN_LM(len(vocabulary), EMBEDDING_DIM, HIDDEN_LAYER_SIZES, torch.device(DEVICE))

# Your loss function
criterion = CrossEntropyLoss()

# Your optimizer (optim.SGD should be okay)
optimizer = Adam(model.parameters(), 0.0001)


# TODO: Training for epoch i

for i in range(EPOCHS):
    print(f"training epoch {i+1}/{EPOCHS}...")
    train(model, train_dataloader, criterion, optimizer)

# TODO: Evaluation for epoch i

print(f"evaluating model...")
eval(model, test_dataloader, criterion)

### END YOUR CODE

training epoch 1/100...
  training batch 1/92329tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')
tensor([[True]], device='cuda:0')


KeyboardInterrupt: 

### Generate Titles

3.1 Use the trained RNN Language Model to generate theses titles. How can you sample the next tokens?

3.2 Compare your results with n-gram language models (e.g., n=4). Of course, you can use a library such as NLTK toolkit
- What perplexity does a regular 4-gram have on the same split? 
- Compare the generated titles from the 4-gram and RNN-LM. Do you think the n-gram titles are better?

In [None]:
### TODO: 3.1 Generate titles with the trained RNN Language Model

def generate(arguments):
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

for i in range(10):
    generated_title = generate(None)
    print(" ".join(generated_title))

NotImplementedError: 

In [None]:
### TODO: 3.2 Generate titles with the trained n-gram language model

### YOUR CODE HERE



### END YOUR CODE