# Assignment 5: Neural Networks

---

## Task 1) RNN as Language Model

Similar to the n-gram language models in the previous tasks, imagine you have to write another thesis and just want to generate an interesting topic.
In this assignment, you will train and use Recurrent Neural Networks as language models to generate new potential thesis topics.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

### Basic Setup

For the assignment on Recurrent Neural Networks, we'll (again) heavily use [PyTorch](https://pytorch.org) as go-to Deep Learning library.
Here, we'll rely on the RNN and Embedding modules already implemented by PyTorch.
You can imagine the Embedding layer as a simple lookup table that stores embeddings of a fixed dictionary and size (quite similar to the Word2Vec parameters we've trained in assignment 2).
Head over to the [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) and [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) modules to gain some understanding of their functionality.
Code for processing data samples, batching, converting to tensors, etc. can get messy and hard to maintain. 
Therefore, you can use PyTorch's [Datasets & DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). 
Get familiar with the basics of data handling, as it will help you for upcoming assignments.
As always, you can use [NumPy](https://numpy.org) and [Pandas](https://pandas.pydata.org) for data handling etc.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [3]:
%pip install numpy scikit-learn matplotlib torch torchvision tqdm

Collecting numpy
  Using cached numpy-2.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting torch
  Using cached torch-2.7.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting torchvision
  Downloading torchvision-0.22.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting tqdm
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.15.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.5.0-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached thread

In [5]:
%pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandas]2m2/3[0m [pandas]
[1A[2KSuccessfully installed pandas-2.2.3 pytz-2025.2 tzdata-2025.2
Note: you may need to restart the kernel to use updated packages.


In [6]:
# Dependencies
import os
import tqdm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the modeling part such as nn.Embedding.

1.3 Create a PyTorch Dataset class which handles your tokenized data with respect to model inputs and labels.

In [7]:
def load_theses_dataset(filepath):
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE
   
    
    
    data= pd.read_csv(filepath, sep='\t', encoding='utf-8', header=None)
    data = data[3].to_list()
    return data
    
    ### END YOUR CODE

In [13]:
def preprocess(dataframe):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    # Tokenize the titles
    # Remove special characters
    # Convert to lowercase

    for i in range(len(dataframe)):
        dataframe[i] = dataframe[i].lower()
        dataframe[i] = dataframe[i].replace("'", "")
        dataframe[i] = dataframe[i].replace('"', "")
        dataframe[i] = dataframe[i].replace("(", "")
        dataframe[i] = dataframe[i].replace(")", "")
        dataframe[i] = dataframe[i].replace(",", "")
        dataframe[i] = dataframe[i].replace(".", "")
        dataframe[i] = dataframe[i].replace("!", "")
        dataframe[i] = dataframe[i].replace("?", "")
        dataframe[i] = dataframe[i].replace(":", "")
        dataframe[i] = dataframe[i].replace(";", "")
        dataframe[i] = dataframe[i].replace("-", " ")
        dataframe[i] = dataframe[i].replace("_", " ")   
        dataframe[i] = dataframe[i].replace("  ", " ")

    # Tokenize the titles
    for i in range(len(dataframe)):
        dataframe[i] = dataframe[i].split(" ")
        # add start and end tokens
        dataframe[i].insert(0, "<s>")
        dataframe[i].append("</s>")
    # Remove empty strings
    for i in range(len(dataframe)):
        dataframe[i] = list(filter(None, dataframe[i]))
    


    ### END YOUR CODE

In [14]:
def cerate_vocab(dataframe):
    """Creates a vocabulary from the given dataframe."""
    ### YOUR CODE HERE
    # Create a vocabulary from the tokenized titles
    vocab = set()
    for title in dataframe:
        for word in title:
            vocab.add(word)
    return vocab
    ### END YOUR CODE

In [15]:
dataframe = load_theses_dataset("data/theses.tsv")
preprocess(dataframe)


vocabulary = cerate_vocab(dataframe)
word2idx = {word: idx for idx, word in enumerate(vocabulary)}
idx2word = {idx: word for idx, word in enumerate(vocabulary)}




In [16]:
class ThesesDataset(Dataset):
    def __init__(self, dataset, word2idx):
        """
        Initializes the dataset.
        Args:
            dataset (list of list of str): Tokenized theses titles.
            word2idx (dict): Mapping of words to their indices.
        """
        self.data = []
        self.labels = []
        
        for title in dataset:
            # Convert words to indices
            indices = [word2idx[word] for word in title[:-1]]  # All words except the last one
            label_indices = [word2idx[word] for word in title[1:]]  # All words except the first one
            
            self.data.append(indices)
            self.labels.append(label_indices)

    def __len__(self):
        """
        Returns the number of samples in the dataset.
        """
        return len(self.data)

    def __getitem__(self, idx):
        """
        Returns the sample and label at the given index.
        Args:
            idx (int): Index of the sample.
        Returns:
            tuple: (sample, label) where both are lists of word indices.
        """
        sample = torch.tensor(self.data[idx], dtype=torch.long)
        labels = torch.tensor(self.labels[idx], dtype=torch.long)
        return sample, labels

In [18]:
dataset = ThesesDataset(dataframe, word2idx)
print(f"Dataset size: {len(dataset)}")
print(dataset.__getitem__(0))  # Print the first sample and its label

Dataset size: 2979
(tensor([4366, 6021, 2135, 7205, 5187, 5226, 4112]), tensor([6021, 2135, 7205, 5187, 5226, 4112, 5987]))


### Train and Evaluate

2.1 Implement the RNN Language Model. Therefore, you can use the nn.Module and overwrite the forward function. For the embedding layer you can either use the embeddings learned from the previous word2vec assignment or train the `nn.Embedding` module and corresponding parameters from scratch.

2.2 Implement the functionality to train your model with the train dataset.

2.3 Implement the functionality to evaluate your model with the test dataset.

2.4 Perform a train-test-split for your theses data, train the RNN Language Model and evaluate the loss & perplexity.

In [None]:
class RNN_LM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        """
        Initializes the RNN Language Model.
        Args:
            vocab_size (int): Size of the vocabulary.
            embedding_dim (int): Dimension of the word embeddings.
            hidden_dim (int): Number of hidden units in the RNN.
            num_layers (int): Number of RNN layers.
        """
        super(RNN_LM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, X, hidden=None):
        """
        Forward pass of the RNN Language Model.
        Args:
            X (Tensor): Input tensor of shape (batch_size, sequence_length).
            hidden (Tensor, optional): Hidden state tensor of shape (num_layers, batch_size, hidden_dim).
        Returns:
            output (Tensor): Output tensor of shape (batch_size, sequence_length, vocab_size).
            hidden (Tensor): Hidden state tensor of shape (num_layers, batch_size, hidden_dim).
        """
        embedded = self.embedding(X)  # Shape: (batch_size, sequence_length, embedding_dim)
        output, hidden = self.rnn(embedded, hidden)  # RNN output and hidden state
        output = self.fc(output)  # Shape: (batch_size, sequence_length, vocab_size)
        return output, hidden

In [None]:
### TODO: 2.2 Implement the train functionality
### Notice: If you want, you can also combine train and eval functionality

def train(arguments):
    """Trains the RNN-LM for one epoch."""
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
### TODO: 2.3 Implement the evaluation functionality
### Notice: If you want, you can also combine train and eval

def eval(arguments):
    """Evaluates the optimized RNN-LM."""
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

In [None]:
### TODO: 2.4 Initialize and train the RNN Language Model for X epochs

# For split reproducibility
# Optional: Use 5-fold cross validation
SEED = 42

EPOCHS = 100

DEVICE = "cpu" # 'cpu', 'mps' or 'cuda'

### YOUR CODE HERE

# Use batch_size=1 if you want to avoid padding handling
train_dataset = None
train_dataloader = None

# Use batch_size=1 if you want to avoid padding handling
test_dataset = None
test_dataloader = None

# Your language model
model = None

# Your loss function
criterion = None

# Your optimizer (optim.SGD should be okay)
optimizer = None


# TODO: Training for epoch i

# TODO: Evaluation for epoch i


### END YOUR CODE

### Generate Titles

3.1 Use the trained RNN Language Model to generate theses titles. How can you sample the next tokens?

3.2 Compare your results with n-gram language models (e.g., n=4). Of course, you can use a library such as NLTK toolkit
- What perplexity does a regular 4-gram have on the same split? 
- Compare the generated titles from the 4-gram and RNN-LM. Do you think the n-gram titles are better?

In [None]:
### TODO: 3.1 Generate titles with the trained RNN Language Model

def generate(arguments):
    ### YOUR CODE HERE

    raise NotImplementedError()

    ### END YOUR CODE

for i in range(10):
    generated_title = generate(None)
    print(" ".join(generated_title))

In [None]:
### TODO: 3.2 Generate titles with the trained n-gram language model

### YOUR CODE HERE



### END YOUR CODE