## Project NLP and Deep Learning

### 1. Project proposal presentation

In the presentation, you have 5 minutes to present your research proposal. During the presentation, you should explain:

* What is the topic of your project, what is the current state of this topic/task/setup
* What is the new part of your project
* What is the research question of your project

We have proposed a number of topics in the slides which can be found on LearnIt, you can either pick one of these or come up with your own. If you pick your own, we suggest to get a pre-approval with Rob van der Goot.

**Deadline for uploading slides: day before the presentation (23:59)**  (pdf only, they will be put into one long pdf for a smooth presentation)

### 2. Baseline
To get your project started, you start with implementing a baseline model. Ideally, this is going to be the main baseline that you are going to compare to in your paper. Note that this baseline should be more advanced than just predicting the majority class (O).

We will use EWT portion of the [Universal NER project](http://www.universalner.org/), which we provide with this notebook for convenience. You can use the train data (`en_ewt-ud-train.iob2`) and dev data(`en_ewt-ud-dev.iob2`) to build your baseline, then upload your prediction on the test data (`en_ewt-ud-test.iob2`).

It is important to upload your predictions in same format as the training and dev files, so that the `span_f1.py` script can be used.

Note that you do not have to implement your baseline from scratch, you can use for example the code from the RNN or BERT assignments as a starting point.

**Deadline: 20-03 on LearnIt (14:00)**

In [None]:
!uname --nodename

In [None]:
!pip install --upgrade pip
!pip install --upgrade torch

#### Helper functions

In [51]:
import token
import torch


def read_iob2_file(path, sep="\t", word_index=1, tag_index=2):
    """
    read in conll file

    :param path: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []
    current_words = []
    current_tags = []

    for line in open(path, encoding="utf-8"):
        line = line.strip()

        if line:
            if line[0] == "#":
                continue  # skip comments
            tok = line.split(sep)

            current_words.append(tok[word_index])
            current_tags.append(tok[tag_index])
        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != []:
        data.append((current_words, current_tags))
    return data


def read_cyner(path, sep="\t", word_index=1, tag_index=2):
    """
    read in conll file

    :param path: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []
    current_words = []
    current_tags = []

    for line in open(path, encoding="utf-8"):
        line = line.strip()

        if line:
            if line[0] == "#":
                continue  # skip comments
            tok = line.split(sep)

            current_words.append(tok[word_index])
            tag = tok[tag_index]
            if "Vulnerability" in tag:
                current_tags.append("Vulnerability")
            else:
                current_tags.append("O")
        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != []:
        data.append((current_words, current_tags))
    return data


def read_aptner(path, sep=" ", word_index=0, tag_index=1):
    """
    read in conll file

    :param path: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []
    current_words = []
    current_tags = []

    for line in open(path, encoding="utf-8"):
        line = line.strip()

        if line:
            if line[0] == "#":
                continue  # skip comments
            tok = line.split(sep)

            current_words.append(tok[word_index])
            if len(tok) >= 2:
                tag = tok[tag_index]
                if "B-VULNAME" in tag:
                    current_tags.append("B-VULNAME")
                elif "I-VULNAME" in tag or "E-VULNAME" in tag:
                    current_tags.append("I-VULNAME")
            elif len(tok) == 1:
                current_tags.append("O")
        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != []:
        data.append((current_words, current_tags))
    return data


import jsonlines


def read_attacker(path, sep=" ", word_index=0, tag_index=1):
    """
    read in conll file

    :param path: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []

    with jsonlines.open(path) as reader:
        for obj in reader:
            tags = [tag if "VULNERABILITY" in tag else "O" for tag in obj["tags"]]
            tokens = obj["tokens"]
            data.append((tokens, tags))
    return data


class Vocab:
    def __init__(self, pad_unk="<PAD>"):
        """
        A convenience class that can help store a vocabulary
        and retrieve indices for inputs.
        """
        self.pad_unk = pad_unk
        self.word2idx = {self.pad_unk: 0}
        self.idx2word = [self.pad_unk]

    def getIdx(self, word, add=False):
        if word not in self.word2idx:
            if add:
                self.word2idx[word] = len(self.idx2word)
                self.idx2word.append(word)
            else:
                return self.word2idx[self.pad_unk]
        return self.word2idx[word]

    def getWord(self, idx):
        return self.idx2word[idx]


# Your implementation goes here:


class Preprocess:
    """
    data: the dataset from which we get the matrix used by a Neural network (instances + their tags)
    instances: number of instances in the dataset, needed for dimension of matrix
    features: the number of features/columns of the matrix
    """

    def __init__(self):
        self.vocab_words = Vocab()
        self.vocab_tags = Vocab()

    def build_vocab(self, data, instances, features):
        data_X = torch.zeros(instances, features, dtype=int)
        data_y = torch.zeros(instances, features, dtype=int)
        for i, sentence_tags in enumerate(data):
            for j, word in enumerate(sentence_tags[0]):
                data_X[i, j] = self.vocab_words.getIdx(word=word, add=True)
                data_y[i, j] = self.vocab_tags.getIdx(
                    word=sentence_tags[1][j], add=True
                )

        # returns the list of unique words in the list from the attributes of the Vocab()
        idx2word_train = self.vocab_words.idx2word
        # returns the list of unique tags in the list from the attributes of the Vocab()
        idx2label_train = self.vocab_tags.idx2word
        # only returned in the builder function, because they are reused for dev data in transform_prep_data()
        return data_X, data_y, idx2word_train, idx2label_train

    def transform_prep_data(self, data, instances, features):
        # to be used only on dev data
        data_X = torch.zeros(instances, features, dtype=int)
        data_y = torch.zeros(instances, features, dtype=int)
        for i, sentence_tags in enumerate(data):
            for j, word in enumerate(sentence_tags[0]):
                data_X[i, j] = self.vocab_words.getIdx(word=word, add=False)
                data_y[i, j] = self.vocab_tags.getIdx(
                    word=sentence_tags[1][j], add=False
                )
        return data_X, data_y


def prepare_output_file(
    transformer: Preprocess,
    data: list,
    pred_labels: torch.Tensor,
    input_file: str,
    output_file: str,
):
    global_labels = []
    for (_, placeholder), labels_idxs in zip(data, pred_labels):
        labels = []

        for i in range(len(placeholder)):
            labels.append(transformer.vocab_tags.idx2word[labels_idxs[i]])
        global_labels += labels

    with (
        open(output_file, mode="w", encoding="utf-8") as f_out,
        open(input_file, mode="r", encoding="utf-8") as f_in,
    ):
        i = 0
        for line in f_in.readlines():
            if line.strip():
                if line[0] == "#":
                    f_out.write(line)
                else:
                    words = line.split("\t")
                    words[2] = global_labels[i]
                    i += 1

                    new_line = "\t".join(words)
                    f_out.write(new_line)
            else:
                f_out.write("\n")
    assert i == len(global_labels)

### Cuda

In [24]:
print(torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

False


In [25]:
print(torch.__version__)


2.6.0+cpu


### Load Data

In [32]:
train_data = read_iob2_file("./en_ewt-ud-train.iob2")
dev_data = read_iob2_file("./en_ewt-ud-dev.iob2")
test_data = read_iob2_file("./en_ewt-ud-test-masked.iob2")

In [33]:
train_data = read_iob2_file(
    "./data/APTNer/APTNERtrain.txt", sep=" ", word_index=0, tag_index=1
)
dev_data = read_iob2_file(
    "./data/APTNer/APTNERdev.txt", sep=" ", word_index=0, tag_index=1
)
test_data = read_iob2_file(
    "./data/APTNer/APTNERtest.txt", sep=" ", word_index=0, tag_index=1
)

In [41]:
train_data = read_attacker("./data/attackner/train.json")
dev_data = read_attacker("./data/attackner/dev.json")
test_data = read_attacker("./data/attackner/test.json")

In [54]:
train_data = read_cyner("./data/cyner/train.txt", word_index=0, tag_index=1)
dev_data = read_cyner("./data/cyner/valid.txt", word_index=0, tag_index=1)
test_data = read_cyner("./data/cyner/test.txt", word_index=0, tag_index=1)

### Transforms

In [55]:
transformer = Preprocess()
max_len = max([len(x[0]) for x in train_data])

train_X, train_y, idx2word, idx2label = transformer.build_vocab(
    train_data, len(train_data), max_len
)

dev_X, dev_y = transformer.transform_prep_data(dev_data, len(dev_data), max_len)

test_X, _ = transformer.transform_prep_data(test_data, len(test_data), max_len)
# here, the second variable doesn't hold true labels, as this is a test set. We need only to know the length of the sentences.

IndexError: index 106 is out of bounds for dimension 1 with size 106

In [None]:
!nvidia-smi

In [None]:
# put already to gpu if having space:
train_X, train_y = train_X.to(device), train_y.to(device)
dev_X, dev_y = dev_X.to(device), dev_y.to(device)
test_X = test_X.to(device)

### Batching

In [None]:
from torch.utils.data import DataLoader, TensorDataset

# TODO: Maybe dtype would need to be changed!
BATCH_SIZE = 32
train_dataset = TensorDataset(train_X, train_y)
train_loader = DataLoader(train_dataset, BATCH_SIZE)  # drop_last=True
n_batches = len(train_loader)

### Training

In [None]:
from torch import nn
import torch

torch.manual_seed(0)
DIM_EMBEDDING = 100
LSTM_HIDDEN = 100
BATCH_SIZE = 32
LEARNING_RATE = 0.01
EPOCHS = 15


class TaggerModel(torch.nn.Module):
    def __init__(self, nwords, ntags):
        super().__init__()
        # TODO Do Bidirectional LSTM
        self.embed = nn.Embedding(nwords, DIM_EMBEDDING)
        self.drop1 = nn.Dropout(p=0.2)
        self.rnn = nn.LSTM(
            DIM_EMBEDDING, LSTM_HIDDEN, batch_first=True, bidirectional=True
        )
        self.drop2 = nn.Dropout(p=0.3)
        self.fc = nn.Linear(LSTM_HIDDEN * 2, ntags)

    def forward(self, input_data):
        word_vectors = self.embed(input_data)
        regular1 = self.drop1(word_vectors)
        output, hidden = self.rnn(regular1)
        regular2 = self.drop2(output)

        predictions = self.fc(regular2)
        return predictions


model = TaggerModel(len(idx2word), len(idx2label))
model = model.to(device)  # run on cuda if possible
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=0, reduction="sum")

# creating the batches

for epoch in range(EPOCHS):
    model.train()
    # reset the gradient
    print(f"Epoch {epoch + 1}\n-------------------------------")
    loss_sum = 0

    # loop over batches
    # types for convenience
    batch_X: torch.Tensor
    batch_y: torch.Tensor
    for batch_X, batch_y in train_loader:
        # TODO: if having memory issues comment .to(device)
        # from one of the previous cells, and uncomment that:
        # batch_X, batch_y = batch_X.to(device), batch_y.to(device)

        optimizer.zero_grad()

        predicted_values = model.forward(batch_X)

        # Cross entropy request (predictions, classes) shape for predictions, and (classes) for batch_y

        # calculate loss
        loss = loss_function(
            predicted_values.view(batch_X.shape[0] * max_len, -1), batch_y.flatten()
        )  # TODO: Last batch has 31 entries instead of 32 - we don't adjust much for that.
        loss_sum += loss.item()  # avg later

        # update
        loss.backward()
        optimizer.step()

    print(f"Average loss after epoch {epoch + 1}: {loss_sum / n_batches}")

# set to evaluation mode
model.eval()

### Evaluate on dev

In [None]:
import gc

# eval using Span_F1
predictions_dev = model.forward(dev_X)
print(predictions_dev.shape)
# gives probabilities for each tag (dim=18) for each word/feature (dim=159) for each sentence(dim=2000)
# we want to classify each word for the part-of-speech with highest probability
labels_dev = torch.argmax(predictions_dev, 2)
print(labels_dev.shape)
prepare_output_file(
    transformer, dev_data, labels_dev, "./en_ewt-ud-dev.iob2", "./dev.iob2"
)

del predictions_dev
del labels_dev
gc.collect()
torch.cuda.empty_cache()

!python span_f1.py en_ewt-ud-dev.iob2 dev.iob2

In [None]:
# # Eval using just accuracy.

# labels_dev = torch.flatten(labels_dev)  # model predictions
# dev_y_flat = torch.flatten(dev_y)  # true labels
# acc = []
# for i in range(len(labels_dev)):
#     if dev_y_flat[i] != 0:
#         acc.append(int(labels_dev[i] == dev_y_flat[i]))

# accuracy = sum(acc) / len(acc)
# print(f"Model accuracy on dev set: {accuracy}")

### Save test for submission

In [None]:
import gc

# Evaluating on dev data we will predict using trained TaggerModel
predictions_test = model.forward(test_X)
print(predictions_test.shape)
# gives probabilities for each tag (dim=18) for each word/feature (dim=159) for each sentence(dim=2000)
# we want to classify each word for the part-of-speech with highest probability
labels_test = torch.argmax(predictions_test, 2)
print(labels_test.shape)
### save labels
prepare_output_file(
    transformer, test_data, labels_test, "./en_ewt-ud-test-masked.iob2", "./test.iob2"
)

del predictions_test
del labels_test
gc.collect()
torch.cuda.empty_cache()

### 3. Project proposal

The written proposal should consist of maximum one page in [ACL-format](https://github.com/acl-org/acl-style-files) (The bibliography does not count for the word limit). In here, you should explain the last three points from the list above and place your project in a larger context (previous work).

Make sure your proposal is:
* Novel to some extent
* Doable within the time-frame

*hint* The [ACL Anthology](https://aclanthology.org/) contains almost all peer-reviewed NLP papers.

**Deadline: 03-04 on LearnIt (14:00)**

### 4. Final project
The final project has a maximum size of 5 pages (excluding bibliography and appendix), using the [ACL style files](https://github.com/acl-org/acl-style-files)

Besides the main paper (discussed in class), you have to include:
* Group contributions. State who was responsible for which part of the project. Here you may state if there
were any serious unequal workloads among group members. This should be put in the appendix.
* A report on usage of chatbots. We follow: https://2023.aclweb.org/blog/ACL-2023-policy/
   * Add a section in appendix if you made use of a chatbot (since we do not use a Responsible NLP Checklist)
   * Include each stage on the ACL policy, and indicate to what extent you used a chatbot
   * Use with care!, you are responsible for the project and plagiarism, correctness etc.

You can also put additional results and details in the appendix. However, the paper itself should be standalone, and understandable without consulting the appendix.

Furthermore, the code should be available on www.github.itu.dk (with a link in a footnote at the end of the abstract) , it should include a README with instructions on how to reproduce your results.

**Deadline: 23-05 on LearnIt** Please check the checklist below before uploading!

Optionally, you can upload a draft a week before **16-05 (before 09:00)** for an extra round of feedback

## Analysis

Analysis is essential for the interpretation of your results. In this section we will shortly describe some different types of analysis. We strongly suggest to use at least one of these:

* **Ablation study**: Leave out a certain part of the model, to study its effects. For example, disable the tokenizer, remove a certain (group of) feature(s), or disable the stop-word removal. If the performance drops a lot, it means that this part of the model contributes heavily to the models final performance. This is commonly done in 1 table, while disabling different parts of the model. Note that you can also do this the other way around, i.e. use only one feature (group) at a time, and test performance
* **Learning curve**: Evaluate how much data your model needs to reach a certain performance. Especially for the data augmentation projects this is essential.
* **Quantitative analysis**: Automated means of analyzing in which cases your model performs worse. This can for example be done with a confusion matrix.
* **Qualitative analysis**: Manually inspect a certain number of errors, and try to categorize them/find trends. Can be combined with the quantitative analysis, i.e., inspect 100 cases of positive reviews predicted to be negative and 100 cases of negative reviews predicted to be positive
* **Feature importance**: In traditional machine learning methods, one can often extract and inspect the weights of the features. In sklearn these can be found in: `trained_model.coef_`
* **Other metrics**: per class scores, partial matches, or count how often the span-borders were correct, but the label wrong.
* **Input words importance**: To gain insight into which words have a impact on prediction performance (positive, negative), we can analyze per-word impact: given a trained model, replace a given word with
the unknown word token and observe the change in prediction score (probability for a class). This is
shown in Figure 4 of [Rethmeier et al (2018)](https://aclweb.org/anthology/W18-6246) (a paper on controversy detection), also shown below: red-colored
tokens were important for controversy detection, blue-colored token decreased prediction scores.

<img width=400px src=example.png>

Note that this is a non-exhaustive list, and you are encouraged to also explore additional analyses.

### Checklist final project
Please check all these items before handing in your final report. You only have to upload a pdf file on learnit, and make sure a link to the code is included in the report and the code is accesible. 

* Are all group members and their email addresses specified?
* Does the group report include a representative project title?
* Does the group report contain an abstract?
* Does the introduction clearly specify the research intention and research question?
* Does the group report adequately refer to the relevant literature?
* Does the group report properly use figure, tables and examples?
* Does the group report provide and discuss the empirical results?
* Is the group report proofread?
* Does the pdf contain the link to the project’s github repo?
* Is the github repo accessible to the public (within ITU)?
* Is the group report maximum 5 pages long, excluding references and appendix?
* Are the group contributions added in the appendix?
* Does the repository contain all scripts and code to reproduce the results in the group report? Are instructions
 provided on how to run the code?
