# Assignment 4: Neural Networks

---

## Task 1) Skip-grams

Tomas Mikolov's [original paper](https://arxiv.org/abs/1301.3781) for word2vec is not very specific on how to actually compute the embedding matrices.
Xin Ron provides a much more detailed [walk-through](https://arxiv.org/pdf/1411.2738.pdf) of the math, I recommend you go through it before you continue with this assignment.
Now, while the original implementation was in C and estimates the matrices directly, in this assignment, we want to use PyTorch (and autograd) to train the matrices.
There are plenty of example implementations and blog posts out there that show how to do it, I particularly recommend [Mateusz Bednarski's](https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb) version. Familiarize yourself with skip-grams and how to train them using pytorch.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

### Basic Setup

For the upcoming assignments on Neural Networks, we'll be heavily using [PyTorch](https://pytorch.org) as go-to Deep Learning library.
If you're not already familiar with PyTorch, now's the time to get started with it.
Head over to the [Basics](https://pytorch.org/tutorials/beginner/basics/intro.html) and gain some understanding about the essentials.
Before starting this assignment, make sure you've got PyTorch installed in your working environment. 
It's a quick setup, and you'll find all the instructions you need on the PyTorch website.
As always, you can use [NumPy](https://numpy.org) and [Pandas](https://pandas.pydata.org) for data handling etc.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [1]:
%pip install torch numpy matplotlib scikit-learn pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Dependencies
import numpy as np
import pandas as pd


### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the initialization of the matrices and to map tokens to indices.

1.3 Generate the training pairs with center word and context word. Which window size do you choose?

In [3]:
def load_theses_dataset(filepath):
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE
   
    
    
    data= pd.read_csv(filepath, sep='\t', encoding='utf-8', header=None)
    data = data[3].to_list()
    return data
    
    ### END YOUR CODE

In [4]:
def preprocess(dataframe):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    # Tokenize the titles
    # Remove special characters
    # Convert to lowercase

    for i in range(len(dataframe)):
        dataframe[i] = dataframe[i].lower()
        dataframe[i] = dataframe[i].replace("'", "")
        dataframe[i] = dataframe[i].replace('"', "")
        dataframe[i] = dataframe[i].replace("(", "")
        dataframe[i] = dataframe[i].replace(")", "")
        dataframe[i] = dataframe[i].replace(",", "")
        dataframe[i] = dataframe[i].replace(".", "")
        dataframe[i] = dataframe[i].replace("!", "")
        dataframe[i] = dataframe[i].replace("?", "")
        dataframe[i] = dataframe[i].replace(":", "")
        dataframe[i] = dataframe[i].replace(";", "")
        dataframe[i] = dataframe[i].replace("-", " ")
        dataframe[i] = dataframe[i].replace("_", " ")   
        dataframe[i] = dataframe[i].replace("  ", " ")

    # Tokenize the titles
    for i in range(len(dataframe)):
        dataframe[i] = dataframe[i].split(" ")
    # Remove empty strings
    for i in range(len(dataframe)):
        dataframe[i] = list(filter(None, dataframe[i]))
    


    ### END YOUR CODE

In [5]:
def cerate_vocab(dataframe):
    """Creates a vocabulary from the given dataframe."""
    ### YOUR CODE HERE
    # Create a vocabulary from the tokenized titles
    vocab = set()
    for title in dataframe:
        for word in title:
            vocab.add(word)
    return vocab
    ### END YOUR CODE

In [6]:
def create_training_pairs(data, word2idx, window_size):
    """Creates training pairs based on skip-grams for further use."""
    ### YOUR CODE HERE
    # Create training pairs based on skip-grams
    training_pairs = []

    for title in data:
        for i, word in enumerate(title):
            # Get the context words
            start = max(0, i - window_size)
            end = min(len(title), i + window_size + 1)
            context_words = title[start:end]
            print("words",context_words)
            context_words.remove(word)
            print("withput word", context_words)
            for context_word in context_words:
                training_pairs.append((word2idx[word], word2idx[context_word]))
    
    ### END YOUR CODE

In [7]:
dataframe = load_theses_dataset("data/theses.tsv")
preprocess(dataframe)
vocabulary = cerate_vocab(dataframe)
word2idx = {word: idx for idx, word in enumerate(vocabulary)}
idx2word = {idx: word for idx, word in enumerate(vocabulary)}
training_pairs = create_training_pairs([dataframe[0]], word2idx, window_size=2)

words ['email', 'am', 'beispiel']
withput word ['am', 'beispiel']
words ['email', 'am', 'beispiel', 'smtp']
withput word ['email', 'beispiel', 'smtp']
words ['email', 'am', 'beispiel', 'smtp', 'im']
withput word ['email', 'am', 'smtp', 'im']
words ['am', 'beispiel', 'smtp', 'im', 'internet']
withput word ['am', 'beispiel', 'im', 'internet']
words ['beispiel', 'smtp', 'im', 'internet']
withput word ['beispiel', 'smtp', 'internet']
words ['smtp', 'im', 'internet']
withput word ['smtp', 'im']


In [8]:
print(vocabulary)

{'warehouse', 'concepts', 'parametrisierten', 'suite', 'bei', 'tests', 'hobby', 'energiewirtschaft', 'wirksamen', 'geführten', 'modellbasierten', 'ergie', 'gussbauteilfehlstellen', 'servicetechniker', 'selektion', 'emf', 'konformen', 'cmmi', 'ergebnissen', 'kits', 'templatebasierte', 'radiologische', 'adressen', 'sammlerkataloges', 'travel', '2016', 'apis', '166', 'plm', 'decisions', 'risikomanagementsystems', 'verifikation', 'recovery', 'working', 'sea', 'dextra', 'regulatorisch', 'potentieller', 'uveröffentlichung', '101', 'aop', 'zeitalter', 'unterstützt', 'independent', 'zusammengesetzten', 'synchronisationsverfahren', 'zeitbehafteter', 'betriebssystem', 'java/j2me/midp', 'appliance', 'gt', 'dual', 'informationsgewinnung', 'generative', 'behat', 'sporting', 'markov', 'desktop', 'h263', 'nao', 'dec', 'baustein', 'kurses', 'redaktionswerkezugs', '/3d', 'verhaltensmodellen', 'spannungsfeldes', 'supplier', 'datenbankbasierten', 'commerce', 'rails', 'automobilsektor', 'assistent', 'mult

### Train and Analyze

2.1 Implement and train the word2vec model with your training data.

2.2 Implement a method to find the top-k similar words for a given word (token).

2.3 Analyze: What are the most similar words to "Konzeption", "Cloud" and "virtuelle"?

In [None]:
### TODO: 2.1 Implement and train the word2vec model.

### YOUR CODE HERE

import torch


import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np






### END YOUR CODE

In [11]:
### TODO: 2.2 Implement a method to find the top-k similar words.

### YOUR CODE HERE



### END YOUR CODE

In [12]:
### TODO: 2.3 Find the most similar words for "Konzeption", "Cloud" and "virtuelle".

### YOUR CODE HERE



### END YOUR CODE

### Play with the Embeddings

3.1 Use the computed embeddings: Can you identify the most similar theses for some examples?

3.2 Visualize the embeddings for a subset of theses using [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). You can use [Scikit-Learn](https://scikit-learn.org/stable/) and [Matplotlib](https://matplotlib.org) or [Seaborn](https://seaborn.pydata.org).

In [13]:
### TODO: 3.1 Compute the embeddings for the theses and transform with TSNE.

### YOUR CODE HERE



### END YOUR CODE

In [14]:
### TODO: 3.2 Visualize the samples in the 2D space.

### YOUR CODE HERE



### END YOUR CODE