# Assignment 4: Neural Networks

---

## Task 1) Skip-grams

Tomas Mikolov's [original paper](https://arxiv.org/abs/1301.3781) for word2vec is not very specific on how to actually compute the embedding matrices.
Xin Ron provides a much more detailed [walk-through](https://arxiv.org/pdf/1411.2738.pdf) of the math, I recommend you go through it before you continue with this assignment.
Now, while the original implementation was in C and estimates the matrices directly, in this assignment, we want to use PyTorch (and autograd) to train the matrices.
There are plenty of example implementations and blog posts out there that show how to do it, I particularly recommend [Mateusz Bednarski's](https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb) version. Familiarize yourself with skip-grams and how to train them using pytorch.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

### Basic Setup

For the upcoming assignments on Neural Networks, we'll be heavily using [PyTorch](https://pytorch.org) as go-to Deep Learning library.
If you're not already familiar with PyTorch, now's the time to get started with it.
Head over to the [Basics](https://pytorch.org/tutorials/beginner/basics/intro.html) and gain some understanding about the essentials.
Before starting this assignment, make sure you've got PyTorch installed in your working environment. 
It's a quick setup, and you'll find all the instructions you need on the PyTorch website.
As always, you can use [NumPy](https://numpy.org) and [Pandas](https://pandas.pydata.org) for data handling etc.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [21]:
# Dependencies
import numpy as np
import pandas as pd
import torch
import csv
from typing import TypedDict, Iterator, Iterable
from dataclasses import dataclass
import re

if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device("cpu")

### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the initialization of the matrices and to map tokens to indices.

1.3 Generate the training pairs with center word and context word. Which window size do you choose?

In [22]:
DATASET_PATH = "data/theses2022.csv"

@dataclass
class Thesis:
    registration_date: str
    due_date: str
    year_academic: int
    type: str
    degree: str
    language: str
    title: str
    abstract: str

class _Thesis(TypedDict):
    Anmeldedatum: str
    Abgabedatum: str
    JahrAkademisch: str
    Art: str
    Grad: str
    Sprache: str
    Titel: str
    Abstract: str

def to_thesis(thesis: _Thesis) -> Thesis:
    return Thesis(
        registration_date=thesis["Anmeldedatum"],
        due_date=thesis["Abgabedatum"],
        year_academic=int(thesis["JahrAkademisch"]),
        type=thesis["JahrAkademisch"],
        degree=thesis["Grad"],
        language=thesis["Sprache"],
        title=thesis["Titel"],
        abstract=thesis["Abstract"]
    )

def load_theses_dataset(filepath) -> pd.DataFrame:
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE

    lists = {key: [] for key in Thesis.__dataclass_fields__.keys()}
    with open(filepath, encoding="utf-8-sig") as fp:
        theses = map(to_thesis, csv.DictReader(fp.readlines(), delimiter=";")) # type: ignore
        for thesis in theses:
            for key in lists:
                lists[key].append(thesis.__dict__[key])
    return pd.DataFrame(lists)
    
    ### END YOUR CODE

dataset = load_theses_dataset(DATASET_PATH)

In [23]:
def tokenize(text: str) -> Iterator[str]:
    for s in text.split():
        m = re.match(r"^([@#]?\w+)[,\.?!]?$", s)
        if m is not None:
            yield m.group(1)

def preprocess(dataframe: pd.DataFrame) -> list[list[str]]:
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    
    l = []
    for i in range(len(dataframe)):
        l.append(list(tokenize(dataframe[i]["title"])))
        l.append(list(tokenize(dataframe[i]["abstract"])))
    return l
    
    ### END YOUR CODE

In [24]:
def word_frequencies(word2idx: dict[str, int], data: Iterable[str]) -> np.ndarray:
    counts = np.zeros(len(word2idx), np.int32)
    total = 0
    for w in data:
        counts[word2idx[w]] += 1
        total += 1
    return counts / total

def create_training_pairs(data: list[list[str]], word2idx: dict[str, int], window_size: int):
    """Creates training pairs based on skip-grams for further use."""
    ### YOUR CODE HERE
    
    freqs = word_frequencies(word2idx, (w for l in data for w in l))
    
    ### END YOUR CODE

In [25]:
dataframe = load_theses_dataset(DATASET_PATH)
tokenized_data = preprocess(dataframe)
vocabulary = {w for l in tokenized_data for w in l}
word2idx = {w: i for i, w in enumerate(vocabulary)}
idx2word = list(vocabulary)
# training_pairs = create_training_pairs(...)

### Train and Analyze

2.1 Implement and train the word2vec model with your training data.

2.2 Implement a method to find the top-k similar words for a given word (token).

2.3 Analyze: What are the most similar words to "Konzeption", "Cloud" and "virtuelle"?

In [26]:
### TODO: 2.1 Implement and train the word2vec model.

### YOUR CODE HERE



### END YOUR CODE

In [27]:
### TODO: 2.2 Implement a method to find the top-k similar words.

### YOUR CODE HERE



### END YOUR CODE

In [28]:
### TODO: 2.3 Find the most similar words for "Konzeption", "Cloud" and "virtuelle".

### YOUR CODE HERE



### END YOUR CODE

### Play with the Embeddings

3.1 Use the computed embeddings: Can you identify the most similar theses for some examples?

3.2 Visualize the embeddings for a subset of theses using [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). You can use [Scikit-Learn](https://scikit-learn.org/stable/) and [Matplotlib](https://matplotlib.org) or [Seaborn](https://seaborn.pydata.org).

In [29]:
### TODO: 3.1 Compute the embeddings for the theses and transform with TSNE.

### YOUR CODE HERE



### END YOUR CODE

In [30]:
### TODO: 3.2 Visualize the samples in the 2D space.

### YOUR CODE HERE



### END YOUR CODE