# Task 1

In this first task you will implement a toy version of the Word2Vec algorithm to produce some simple word embeddings using the Pytorch library. <br>
First, we need to install the reqired packages (if you have not installed them already):

In [None]:
import sys
!{sys.executable} -m pip install torch
!{sys.executable} -m pip install matplotlib

<br>We will need a corpus to train our embeddings on. To keep computation times low and to be able to produce simpler plots later, we will only use the short text below as a toy corpus.

In [None]:
corpus = "Human language is special for several reasons . It is specifically constructed to convey the speaker / writer's " \
       "meaning . It is a complex system , although little children can learn it pretty quickly . Another remarkable " \
       "thing about human language is that it is all about symbols . According to Chris Manning , a machine learning " \
       "professor at Stanford , it is a discrete , symbolic , categorical signaling system . This means we can convey the " \
       "same meaning in different ways ( i.e. , speech , gesture , signs , etc. ) The encoding by the human brain is a " \
       "continuous pattern of activation by which the symbols are transmitted via continuous signals of sound and " \
       "vision . Understanding human language is considered a difficult task due to its complexity . For example , there " \
       "are an infinite number of different ways to arrange words in a sentence . Also , words can have several meanings " \
       "and contextual information is necessary to correctly interpret sentences . Every language is more or less unique " \
       "and ambiguous . Just take a look at the following newspaper headline \" The Pope’s baby steps on gays . \" This " \
       "sentence clearly has two very different interpretations , which is a pretty good example of the challenges in " \
       "NLP . Note that a perfect understanding of language by a computer would result in an AI that can process the " \
       "whole information that is available on the internet , which in turn would probably result in artificial general " \
       "intelligence ."

<br>In order to train our model, we first need to transform this text into usable training data.
1. Extract the vocabulary from this text. Do not worry about collapsing different forms of the same word into a single entry (e.g. treat "language" and "languages" as separate words). You also do not need to worry about punctuation marks. They are already split from the words in the text where necessary, so you can treat them like individual words. This should result in a list with a single entry for each unique word (make sure to not have multiple entries for e.g. "is").
2. Assign an index to each token in the vocabulary, so that each word can be represented as a distinct number.
3. Transform our corpus into a series of these indeces.
4. One-Hot encode the corpus using the <i>torch.nn.functional.one_hot()</i> function. You can find the documentation for it [here](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html).
5. Convert the resulting tensor so that it contains floating point numbers instead of integers (this is just to make it compatible with our model implementation later).

In [None]:
### Your code here ###
import torch

<br>Now costruct the actual training data. We are going to implement the skip-gram version of Word2Vec, using a context window of size 5. This means that for each word, the model should try to predict the two words that come before and after it individually. Thus, the training data should consist of four input-label pairs for each word, for which the four labels are the words in the context window. E.g. for the word "special" in the first sentence of our toy corpus, the pairs would be (special, language), (special, is), (special, for), (special, several). Keep in mind, however, that for our model implementation, we will need two tensors, and not a list of tuples. So you should construct an input tensor containing the one-hot encoded tokens from our toy-corpus four times, and for each of these four entries the label tensor should contain a different word in its context window. Since we are going to use CrossEntropyLoss as our loss function later, the labels should not be one-hot encoded, but simply state the index of the target token (this is simply a characteristic of the torch implementation of cross entropy loss). You might have noticed that this way of constructing training data is problematic for the first two and the last two tokens in the corpus, because their context window would be out of bounds. Usually, this would be solved by padding (adding two special tokens at the beginning and at the end of the text), but for simplicity's sake, you can just ignore the first and last two tokens in the input tensor.

In [None]:
### Your code here ###

The next step is to construct our model. For easier visualisation later, we want to produce a model with embedding size 2. In reality, Word2Vec models use a much bigger embedding size, usually around 300.<br>
Pytorch uses classes to define models. You can find a code skeleton for such a class below. Define a model in such a way, that it has three layers, an input, a hidden and an output layer. Also apply the Softmax activation function to the output.<br>
We use two fully connected linear networks to connect the input layer to the hidden layer, and the latter to the output layer. The input and output size should be identical and be equal to the size of our vocabulary.

In [None]:
class Word2Vec(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.hidden = # Linear network connecting input to hidden layer
        self.prediction = # Linear network connecting embedding to output layer
        self.activation = torch.nn.Softmax(dim=0)
    def forward(self, x):
        # apply the layers defined above to the input x
        return x

Now you should instantiate our model. Additionally, define a learning rate (how much the weights of our model should change after each training iteration, usually around 0.01) and how many epochs (how many iterations over the training data) you want to do. More epochs can increase model performance, but also increases computation time, and, because our training data is so small, the model should converge rather quickly, so do not choose an overly high number of epochs. A good point to start might be to choose 100 epochs and then adjust that number, depending on how long your code takes to run. As stated above, we are going to use cross entropy loss as our loss function and Adam as our optimizer (already implemented in the code below). An optimizer handles the way in which the weights of our model should be updated. 

In [None]:
model = Word2Vec()
lr = # learning rate
epochs = # number of epochs
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

It is time to train our model now.
1. For each epoch, loop through all training instances.
2. We update our weights after each instance. So you should call your model on each sample, to get a prediction (model(sample)).
3. Compute the loss of each prediction each prediction, by using our defined loss function (Cross Entropy Loss). Pass the prediction and the vocabulary index of the correct word to the loss function.
4. Then, use the <i>.backward()</i> on your loss, and use <i>optimizer.step()</i> and <i>optimizer.zero_grad()</i>. This updates the weights through backpropagation and resets the optimizer gradients.

In [None]:
### Your code here ###

Modify the following code to visualise your embeddings. You simply need to save the hidden layer weights into the <i>weights</i> variable. In the original implementation of Word2Vec, the prediction weights were used as embeddings and not the hidden weights (Mikolov et al., 2013, [Link](https://arxiv.org/abs/1301.3781)). However, in practice, both versions are used.

In [None]:
import matplotlib.pyplot as plt
weights = # This should look something like my_model.hidden.weight
weights.detach_()
weights_dim_1 = weights[0]
weights_dim_2 = weights[1]

plt.rcParams['figure.figsize'] = (20, 20)
plt.scatter(weights_dim_1, weights_dim_2, s=14)
for i, word in enumerate(uniques):
    plt.annotate(word, (weights_dim_1[i], weights_dim_2[i]), size= 10)
plt.show()

Obviously, our training data is way too small and our model too simple to produce good embeddings for all tokens in the corpus.<br>
Can you still find some tokens that our model produced meaningful embeddings for? (Hint: look for clusters of tokens you would expect to be close to each other in a vector space representation.)

[Your answer here]

# Task 2

In this task, you will finetune a version of BERT (Bidirectional Encoder Representations from Transformers), a transformer based language model, to classify movie reviews on Rotten Tomatoes into positive or negative reviews. The goal is to train the model to predict whether a review is positive (1) or negative (0).
As in task one, first install the following packages, if you have not done so already:

In [None]:
!{sys.executable} -m pip install datasets
!{sys.executable} -m pip install transformers
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install scikit-learn

<br>Download the dataset using the following command:

In [None]:
from datasets import load_dataset
dataset = load_dataset('rotten_tomatoes')

Look at some entries in the dataset, to familiarize yourself with its contents and structure. You can also have a look at [this website](https://huggingface.co/datasets/rotten_tomatoes) to get some additional information. 

In [None]:
### Your code for data exploration ###

To be able to use the texts in the database, we need them in tokenized form. For this you need to download a tokenizer. The following code snippet downloads a pretrained tokenizer, and tokenizes the dataset:

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Using the whole data set would probably make the training process very long, therefore you should only use a small subset of the training and test instances contained in the data set. (Hint: use the .shuffle and .select method of the Dataset class. The documentation can be found [here](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset).)

In [None]:
### Your code here ###

Load the distilbert model (a smaller version of the full BERT model) and the evaluation metric to be used (in this case 'accuracy'). Fill in the number of labels the classification model should use. (Do not worry about the warning, it does not concern the task at hand.) The model will take a series of input tokens, which is why we had to tokenise the reviews, and returns a logit for each possible label, which indicates how likely it is that the input sequence corresponds to that label, according to the model's prediction. The evaluation metric is used to calculate how likely the predictions that the model makes are correct.

In [None]:
from transformers import AutoModelForSequenceClassification
from datasets import load_metric
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=) # fill in num_labels

metric = load_metric("accuracy")

In order to compute the metric for the predictions our model makes, we will need to first convert the logits that the model returns into the corresponding class label. The following function will do this:

In [None]:
import numpy as np
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Before training, specify the training arguments. The most important ones here are the number of epochs to be used and the evaluation strategy: Use 'epoch' as an argument to evaluate your model after each epoch or 'step' to evaluate it after each weight update step. As for the number of epochs, since this model does many more computations per epoch compared to the one you implemented in task 1, you should choose a much lower value. A good starting point is to set the num_training_epochs parameter to 3, and then adjust as needed. If you are interested and would like to tweak other arguments, you can find a list of them [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir='my_training_dir', evaluation_strategy="epoch", num_train_epochs=) # tweak training args to your liking

The transformers library comes with a Trainer class, that streamlines the training process. Create an instance of this class and pass your base model, your training arguments, your training data set, your evaluation data set and the compute_metrics function to it. 

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=,
    eval_dataset=,
    compute_metrics=compute_metrics,
)

Start the training by calling .train() on your trainer.

In [None]:
### Your code here ###

What accuracy did your model achieve at the end of the training process?

[Your answer here]

Now you should do some predictions using your model. Use the TextClassificationPipeline class included in the transformers library to write a function which takes any string as an input and returns 'POSITIVE' if your model thinks the input was a positive review, and 'NEGATIVE' otherwise.

In [None]:
from transformers import TextClassificationPipeline

pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)

def predict(review:str) -> str:
    ### Your code here ###
    

Could you find any kind of prompts your model struggles to classify correctly?

[Your answer here]