## Project phase

### 1. Baseline
To get your project started, you start with implementing a baseline model. Ideally, this is going to be the main baseline that you are going to compare to in your paper. Note that this baseline should be more advanced than just predicting O (the majority class).

We will use the following datasets for each of the tasks:
* EWT for Named Entity Recognition (NER): https://github.com/bplank/nested-ner , use `en_ewt_nn_train.conll` for training, and `en_ewt_nn_answers_test.conll` for testing (note that you can use the dev split for development). **You only have to take into account the 2nd column for the baseline, i.e. the task is not nested NER**
* CrossRE for Relation Extraction (RE): https://github.com/mainlp/CrossRE , use `music-train.json` for training, and `music-test.json` for testing (note that you can use the dev split for development).

Note that you do not have to implement your baseline from scratch, you can use for example: https://github.itu.dk/robv/intro-nlp2023/blob/main/assignments/week3/viterbi-solution/viterbi_solution_cleaner.py , https://github.itu.dk/robv/intro-nlp2023/blob/main/assignments/week5/rnn.py or https://github.itu.dk/robv/intro-nlp2023/blob/main/assignments/week6/bert/bert-topic.py as a starting point.

It is important to upload your predictions in exactly the same format as the original datasets are using. For evaluation, we will use the scripts at: https://github.itu.dk/robv/intro-nlp2023/tree/main/assignments/project/span_f1.py and https://github.com/mainlp/CrossRE/blob/main/evaluate.py

**Deadline: 27-03 on LearnIt (14:00)**

### 2. Project proposal presentation

In the presentation, you have 2 minutes to present your baseline, and 5 minutes to present your research proposal. During the presentation, you should explain:
* What was your baseline model (architecture, design decisions etc.)
* What is the topic of your project, what is the current state of this topic/task/setup
* What is the new part of your project
* What is the research question of your project

We have proposed a number of topics in the [slides](https://github.itu.dk/robv/intro-nlp2023/tree/main/slides/14-project.pdf), you can either pick one of these or come up with your own. If you pick your own, we suggest to get a pre-approval with Rob van der Goot before you start on writing the full-fledged proposal.

**Deadline: 28-03 on LearnIt (14:00)**  (pdf only, they will be put into one long pdf for a smooth presentation)

### 3. Project proposal

The written proposal should consist of maximum one page in ACL-format (The bibliography does not count for the word limit). In here, you should explain the last three points from the list above and place your project in a larger context (previous work).

Make sure your proposal is:
* Novel to some extent
* Doable within the time-frame

The ACL style files can be found on: [https://github.com/acl-org/acl-style-files](https://github.com/acl-org/acl-style-files).

**Deadline: 12-04 on LearnIt (14:00)**

### 4. Final project
The final project has a maximum size of 5 pages (excluding bibliography and appendix). 

Besides the main paper (discussed in class), you have to include:
* Group contributions. State who was responsible for which part of the project. Here you may state if there
were any serious unequal workloads among group members. This should be put in the appendix.
* A report on usage of chatbots. We follow: https://2023.aclweb.org/blog/ACL-2023-policy/
   * Add a section in appendix if you made use of a chatbot (since we do not use a Responsible NLP Checklist)
   * Include each stage on the ACL policy, and indicate to what extend you used a chatbot
   * Use with care!, you are responsible for the project and plagiarism, correctness etc.

In the appendix, you can also put additional results and details in the appendix. However, the paper itself should be standalone, and understandable without consulting the appendix.

Furthermore, the code should be available on www.github.itu.dk (with a link in a footnote at the end of the abstract) , it should include a README with instructions on how to reproduce your results.

**Deadline: 26-05 on LearnIt (14:00)** Please check the checklist below before uploading!

Optionally, you can upload a draft a week before **19-05 (at 09:00)** for an extra round of feedback

## Analysis

Analysis is essential for the interpretation of your results. In this section we will shortly describe some different types of analysis. We strongly suggest to use at least one of these:

* **Ablation study**: Leave out a certain part of the model, to study its effects. For example, disable the tokenizer, remove a certain (group of) feature(s), or disable the stop-word removal. If the performance drops a lot, it means that this part of the model contributes heavily to the models final performance. This is commonly done in 1 table, while disabling different parts of the model. Note that you can also do this the other way around, i.e. use only one feature (group) at a time, and test performance
* **Learning curve**: Evaluate how much data your model needs to reach a certain performance. Especially for the data augmentation projects this is essential.
* **Quantitative analysis**: Automated means of analyzing in which cases your model performs worse. This can for example be done with a confusion matrix (like in [week2](https://github.itu.dk/bapl/2ndyearproject-2021-material/blob/master/assignments/week2/week2.ipynb)).
* **Qualitative analysis**: Manually inspect a certain number of errors, and try to categorize them/find trends. Can be combined with the quantitative analysis, i.e., inspect 100 cases of positive reviews predicted to be negative and 100 cases of negative reviews predicted to be positive
* **Feature importance**: In traditional machine learning methods, one can often extract and inspect the weights of the features. In sklearn these can be found in: `trained_model.coef_`
* **Input words importance**: To gain insight into which words have a impact on prediction performance (positive, negative), we can analyze per-word impact: given a trained model, replace a given word with
the unknown word token and observe the change in prediction score (probability for a class). This is
shown in Figure 4 of [Rethmeier et al (2018)](https://aclweb.org/anthology/W18-6246) (a paper on controversy detection), also shown below: red-colored
tokens were important for controversy detection, blue-colored token decreased prediction scores.

<img width=400px src=example.png>

Note that this is a non-exhaustive list, and you are encouraged to also explore additional analyses.

### Checklist final project
Please check all these items before handing in your final report. You only have to upload a pdf file on learnit, and make sure a link to the code is included in the report and the code is accesible. 

* Are all group members and their email addresses specified?
* Does the group report include a representative project title?
* Does the group report contain an abstract?
* Does the introduction clearly specify the research intention and research question?
* Does the group report adequately refer to the relevant literature?
* Does the group report properly use figure, tables and examples?
* Does the group report provide and discuss the empirical results?
* Is the group report proofread?
* Does the pdf contain the link to the project’s github repo?
* Is the github repo accessible to the public (within ITU)?
* Is the group report maximum 5 pages long, excluding references and appendix?
* Are the group contributions added in the appendix?
* Does the repository contain all scripts and code to reproduce the results in the group report? Are instructions
 provided on how to run the code?


In [1]:
# based on: https://jkk.name/neural-tagger-tutorial/
import random
import codecs
from torch import nn
import torch
import sys
import collections
from collections import Counter
from imblearn.over_sampling import SMOTE
import numpy as np
import pickle

torch.manual_seed(0)
PAD = "PAD"
DIM_EMBEDDING = 100
LSTM_HIDDEN = 50
BATCH_SIZE = 32
LEARNING_RATE = 0.01
EPOCHS = 20

def read_data(file_name):
    """
    read in conll file
    
    :param file_name: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []
    current_words = []
    current_tags = []

    for line in codecs.open(file_name, encoding='utf-8'):
        line = line.strip()

        if line:
            tok = line.split('\t')
            word = tok[0]
            tag = tok[1]

            current_words.append(word)
            current_tags.append(tag)

        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != []:
        data.append((current_words, current_tags))
    return data

train_data=read_data(sys.argv[1])


# Create vocabularies for both the tokens
# # and the tags
id_to_token = [PAD]
token_to_id = {PAD: 0}
id_to_tag = [PAD]
tag_to_id = {PAD: 0}

for tokens, tags in train_data:
    for token in tokens:
        if token not in token_to_id:
            token_to_id[token] = len(token_to_id)
            id_to_token.append(token)
    for tag in tags:
        if tag not in tag_to_id:
            tag_to_id[tag] = len(tag_to_id)
            id_to_tag.append(tag)

NWORDS = len(token_to_id)
NTAGS = len(tag_to_id)

max_len=max([len(x[0]) for x in train_data])

# convert text data with labels to indices
def data2feats(inputData, word2idx, label2idx):
    feats = torch.zeros((len(inputData), max_len), dtype=torch.long)
    labels = torch.zeros((len(inputData), max_len), dtype=torch.long)

    for sentPos, sent in enumerate(inputData):
        for wordPos, word in enumerate(sent[0][:max_len]):
            wordIdx = word2idx[PAD] if word not in word2idx else word2idx[word]
            feats[sentPos][wordPos] = wordIdx

        for labelPos, label in enumerate(sent[1][:max_len]):
            labelIdx = word2idx[PAD] if label not in label2idx else label2idx[label]
            labels[sentPos][labelPos] = labelIdx

    return feats, labels

train_feats, train_labels = data2feats(train_data, token_to_id, tag_to_id)

# convert to batches
num_batches = int(len(train_feats)/BATCH_SIZE)
train_feats_batches = train_feats[:BATCH_SIZE*num_batches].view(num_batches, BATCH_SIZE, max_len)
train_labels_batches = train_labels[:BATCH_SIZE*num_batches].view(num_batches, BATCH_SIZE, max_len)


class TaggerModel(torch.nn.Module):
    def __init__(self, nwords, ntags):
        super().__init__()

        # Create word embeddings
        self.word_embedding = nn.Embedding(nwords, DIM_EMBEDDING)
        # Create input dropout parameter
        self.word_dropout = torch.nn.Dropout(.2)
        # Create LSTM parameters
        self.rnn = torch.nn.RNN(DIM_EMBEDDING, LSTM_HIDDEN, num_layers=1,
                batch_first=True, bidirectional=False)
        # Create output dropout parameter
        self.rnn_output_dropout = torch.nn.Dropout(.3)
        # Create final matrix multiply parameters
        self.hidden_to_tag = torch.nn.Linear(LSTM_HIDDEN, ntags)

    def forward(self, sentences):
        # Look up word vectors
        word_vectors = self.word_embedding(sentences)
        # Apply dropout
        dropped_word_vectors = self.word_dropout(word_vectors)
        rnn_out, _ = self.rnn(dropped_word_vectors, None)
        # Apply dropout
        rnn_out_dropped = self.rnn_output_dropout(rnn_out)
        # Matrix multiply to get scores for each tag
        output_scores = self.hidden_to_tag(rnn_out_dropped)

        # Calculate loss and predictions
        return output_scores


model = TaggerModel(NWORDS, NTAGS)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = torch.nn.CrossEntropyLoss(ignore_index = 0, reduction = 'sum')

for epoch in range(EPOCHS):
    model.train() 
    model.zero_grad()

    # Loop over batches
    loss = 0
    match = 0
    total = 0
    for batchIdx in range(0, num_batches):
        output_scores = model.forward(train_feats_batches[batchIdx])
        output_scores = output_scores.view(BATCH_SIZE * max_len, -1)
        flat_labels = train_labels_batches[batchIdx].view(BATCH_SIZE * max_len)
        batch_loss = loss_function(output_scores, flat_labels)

        predicted_tags  = torch.argmax(output_scores, 1)
        predicted_tags = predicted_tags.view(BATCH_SIZE, max_len)

        # Prepare inputs
        input_array = train_feats_batches[batchIdx]
        output_array = train_labels_batches[batchIdx]

        # Construct computation
        output_scores = model(input_array)
        # Calculate loss
        output_scores = output_scores.view(BATCH_SIZE * max_len, -1)
        flat_labels = output_array.view(BATCH_SIZE * max_len)
        batch_loss = loss_function(output_scores, flat_labels)

        # Run computations
        batch_loss.backward()
        optimizer.step()
        model.zero_grad()
        loss += batch_loss.item()
        # Update the number of correct tags and total tags
        for goldSent, predSent in zip(train_labels_batches[batchIdx], predicted_tags):
            for goldLabel, predLabel in zip(goldSent, predSent):
                if goldLabel != 0:
                    total += 1
                    if goldLabel == predLabel:
                        match+= 1
    #print(epoch, loss, match / total)

    # Show the loss history in training data
    fig, ax = plt.subplots()
    sns.lineplot(x=epoch, y = loss, ax=ax);
    ax.set_xlabel("Epoch");
    ax.set_ylabel("Loss (Cross entropy)");
    ax.set_title("FFNN loss evolution throughout epochs (in training data)");

TabError: inconsistent use of tabs and spaces in indentation (2865510997.py, line 181)