## Project NLP and Deep Learning

### 1. Project proposal presentation

In the presentation, you have 5 minutes to present your research proposal. During the presentation, you should explain:
* What was your baseline model (architecture, design decisions etc.)
* What is the topic of your project, what is the current state of this topic/task/setup
* What is the new part of your project
* What is the research question of your project

We have proposed a number of topics in the slides which can be found on LearnIt, you can either pick one of these or come up with your own. If you pick your own, we suggest to get a pre-approval with Rob van der Goot.

**Deadline for uploading slides: 12-03 on LearnIt (14:00)**  (pdf only, they will be put into one long pdf for a smooth presentation)

**Presentations: 13-03 from 08:00-12:00**, we will split the class in half for the lecture hours (08:00-10:00) and the lab hours (10:00-12:00)


### 2. Baseline
To get your project started, you start with implementing a baseline model. Ideally, this is going to be the main baseline that you are going to compare to in your paper. Note that this baseline should be more advanced than just predicting the majority class (O).

We will use EWT portion of the [Universal NER project](http://www.universalner.org/), which we provide with this notebook for convenience. You can use the train data (`en_ewt-ud-train.iob2`) and dev data(`en_ewt-ud-dev.iob2`) to build your baseline, then upload your prediction on the test data (`en_ewt-ud-test.iob2`).

It is important to upload your predictions in same format as the training and dev files, so that the `span_f1.py` script can be used.

Note that you do not have to implement your baseline from scratch, you can use for example the code from the RNN or BERT assignments as a starting point.

**Deadline: 20-03 on LearnIt (11:59)**

In [2]:
# Imports

import torch
import myutils
from typing import List
from transformers import AutoModel, AutoTokenizer

### 1. Create a Function to Read the Universel NER Data:

In [3]:
# Function to read NER data

def read_universal_NER(file_path):
    with open(file_path, 'r', encoding = 'utf-8') as infile:
        # Split into lines
        lines = infile.readlines()

        # Define lists to store data 
        sentences = []
        labels = []
        current_sentence = []
        current_labels = []

        # Iterate over lines
        for line in lines:

            line = line.strip() # Remove whitespace
            if not line: # Skip empty lines
                continue

            # Check if line starts with sentence ID
            if line.startswith('# sent_id'):
                if current_sentence:
                    sentences.append(current_sentence)
                    labels.append(current_labels)
                current_sentence = []
                current_labels = []

            # Check for token lines
            elif not line.startswith("#"):
                parts = line.strip().split('\t')
                current_sentence.append(parts[1])
                current_labels.append(parts[2])

        if current_sentence:
            sentences.append(current_sentence)
            labels.append(current_labels)

    # Flatten lists
    sentences = sum(sentences, [])
    labels = sum(labels, [])

    return sentences, labels

### 2. Define the BERT Model:

In [80]:
### EXISTING BERT MODEL FROM EX5

class ClassModel(torch.nn.Module):
    def __init__(self, nlabels: int, mlm: str):
        """
        Model for classification with transformers.

        The architecture of this model is simple, we just have a transformer
        based language model, and add one linear layer to converts it output
        to our prediction.
    
        Parameters
        ----------
        nlabels : int
            Vocabulary size of output space (i.e. number of labels)
        mlm : str
            Name of the transformers language model to use, can be found on:
            https://huggingface.co/models
        """
        super().__init__()

        # The transformer model to use
        self.mlm = AutoModel.from_pretrained(mlm)

        # Find the size of the output of the masked language model
        if hasattr(self.mlm.config, 'hidden_size'):
            self.mlm_out_size = self.mlm.config.hidden_size
        elif hasattr(self.mlm.config, 'dim'):
            self.mlm_out_size = self.mlm.config.dim
        else: # if not found, guess
            self.mlm_out_size = 768

        # Create prediction layer
        self.hidden_to_label = torch.nn.Linear(self.mlm_out_size, nlabels)

    def forward(self, input: torch.tensor):
        """
        Forward pass
    
        Parameters
        ----------
        input : torch.tensor
            Tensor with wordpiece indices. shape=(batch_size, max_sent_len).

        Returns
        -------
        output_scores : torch.tensor
            ?. shape=(?,?)
        """
        # Run transformer model on input
        mlm_out = self.mlm(input)
        # Keep only the last layer: shape=(batch_size, max_len, DIM_EMBEDDING)
        mlm_out = mlm_out.last_hidden_state
        # Keep only the output for the first ([CLS]) token: shape=(batch_size, DIM_EMBEDDING)
        mlm_out = mlm_out[:,:1,:].squeeze()

        # Matrix multiply to get scores for each label: shape=(?,?)
        output_scores = self.hidden_to_label(mlm_out)

        return output_scores
    
    def run_eval(self, text_batched: List[torch.tensor], labels_batched: List[torch.tensor]):
        """
        Run evaluation: predict and score
    
        Parameters
        ----------
        text_batched : List[torch.tensor]
            list with batches of text, containing wordpiece indices.
        labels_batched : List[torch.tensor]
            list with batches of labels (converted to ints).
        model : torch.nn.module
            The model to use for prediction.
    
        Returns
        -------
        score : float
            accuracy of model on labels_batches given feats_batches
        predictions : list
            list of predicted labels
        """

        # Set model to evaluation mode
        self.eval()

        # Store correct and total predictions
        correct = 0
        total = 0

        # Create empty list to store predictions
        predictions = []

        # Testing
        printed = False
        
        # Iterate over test data
        for sents, labels in zip(text_batched, labels_batched):

            # Perform forward pass
            output_scores = self.forward(sents)

            # # Testing
            # if not printed:
            #     print(f'output_scores shape: {output_scores.shape}')
            #     print(f'output_scores (first 10 instances of first batch):\n {output_scores[:10]}')
            #     printed = True

            # Get prediction labels
            pred_labels = torch.argmax(output_scores, 1)

            # Convert predictions back to tags and append to list
            predictions.append(pred_labels)

            for gold_label, pred_label in zip(labels, pred_labels):
                total += 1
                if gold_label.item() == pred_label.item():
                    correct+= 1

        correct_freq = correct / total

        return correct_freq, predictions

### 3. Define a Function to Train the Model:

In [81]:
def train_ClassModel(train_file_path: str, 
                     dev_file_path: str,
                     MLM: str,
                     UNK: str,
                     lr: float,
                     batch_size: int,
                     device: str,
                     n_epochs: int,
                     max_train_sents = None,
                     return_model = False,
                     return_mappings = False
                     ):

    # Read data
    print('reading data...')
    train_sents, train_labels = read_universal_NER(train_file_path)
    dev_sents, dev_labels = read_universal_NER(dev_file_path)
    
    # Slice train data if max_train_sents is passed
    if max_train_sents is not None:
        train_sents = train_sents[:max_train_sents]
        train_labels = train_labels[:max_train_sents]

    id2label, label2id = myutils.labels2lookup(train_labels, UNK)
    n_labels = len(id2label)

    # Transform labels to numerical
    train_labels = [label2id.get(label, label2id[UNK]) for label in train_labels]
    dev_labels = [label2id.get(label, label2id[UNK]) for label in dev_labels]
    
    # Tokenize
    print('tokenizing...')
    tokzr = AutoTokenizer.from_pretrained(MLM)
    train_tokked = myutils.tok(train_sents, tokzr)
    dev_tokked = myutils.tok(dev_sents, tokzr)
    PAD = tokzr.pad_token_id
    
    # Convert to batches
    print('converting to batches...')
    train_sents_batched, train_labels_batched = myutils.to_batch(train_tokked, train_labels, batch_size, PAD, device)
    dev_sents_batched, dev_labels_batched = myutils.to_batch(dev_tokked, dev_labels, batch_size, PAD, device)
    
    # Create instance of model
    print('initializing model...')
    model = ClassModel(n_labels, MLM)
    model.to(device) # Move to device

    # Define optimizer and criterion
    optimizer = torch.optim.Adam(model.parameters(), lr = lr)
    criterion = torch.nn.CrossEntropyLoss(ignore_index = 0, reduction = 'sum')
    
    print('training...')
    for epoch in range(n_epochs):
        print('=====================')
        print(f'starting epoch {epoch + 1}/{n_epochs}')

        # Set model to training model
        model.train() 
    
        # Keep total epoch loss
        loss = .0

        # Loop over batches
        for batch_idx in range(0, len(train_sents_batched)):

            if batch_idx % 100 == 0:
                print(f'running for batch {batch_idx}/{len(train_sents_batched)}')

            # Set gradients to zero
            optimizer.zero_grad()

            # Perform forward pass
            output_scores = model.forward(train_sents_batched[batch_idx])

            # Compute loss for batch and add to total loss
            batch_loss = criterion(output_scores, train_labels_batched[batch_idx])
            loss += batch_loss.item()
    
            # Perform backward pass
            batch_loss.backward()
            optimizer.step()
    
        # Compute dev accuracy
        dev_score, dev_preds = model.run_eval(dev_sents_batched, dev_labels_batched)

        # Print statistics
        print(f'Training Loss: {loss:.4f}')
        print(f'Dev Accuracy: {dev_score:.4f}')

        # # Testing
        # dev_preds_unique = torch.unique(torch.cat(dev_preds))
        # print('Dev Predictions:')
        # print(dev_preds_unique.tolist())

    # Return logic, not very nice code but it works for now
    returns = []
    if return_model:
        returns.append(model)
    if return_mappings:
        returns.append(label2id)
        returns.append(id2label)
    if returns:
        return tuple(returns)

### 4. Train the Model:

In [83]:
# Seet seed
torch.manual_seed(42)

# We get the model and label mapping from the training
# Strange way to do it but whatever
model, _, id2label = train_ClassModel(train_file_path = 'en_ewt-ud-train.iob2',
                 dev_file_path = 'en_ewt-ud-dev.iob2',
                 MLM = 'distilbert-base-cased',
                 UNK = '[UNK]',
                 batch_size = 32,
                 lr = 0.0001,
                 device = 'cuda' if torch.cuda.is_available() else 'cpu',
                 n_epochs = 5,
                 max_train_sents = None,
                 return_model = True,
                 return_mappings = True
                 )

reading data...
tokenizing...
converting to batches...
initializing model...
training...
starting epoch 1/5
running for batch 0/6393
running for batch 100/6393
running for batch 200/6393
running for batch 300/6393
running for batch 400/6393
running for batch 500/6393
running for batch 600/6393
running for batch 700/6393
running for batch 800/6393
running for batch 900/6393
running for batch 1000/6393
running for batch 1100/6393
running for batch 1200/6393
running for batch 1300/6393
running for batch 1400/6393
running for batch 1500/6393
running for batch 1600/6393
running for batch 1700/6393
running for batch 1800/6393
running for batch 1900/6393
running for batch 2000/6393
running for batch 2100/6393
running for batch 2200/6393
running for batch 2300/6393
running for batch 2400/6393
running for batch 2500/6393
running for batch 2600/6393
running for batch 2700/6393
running for batch 2800/6393
running for batch 2900/6393
running for batch 3000/6393
running for batch 3100/6393
running 

### 5. Make Predictions on Training Data:

In [84]:
# Redefine variables
tokzr = AutoTokenizer.from_pretrained('distilbert-base-cased')
UNK = "[UNK]"
batch_size = 32
device = "cuda" if torch.cuda.is_available() else "cpu"
PAD = tokzr.pad_token_id

# Loading in test data
test_sents, test_labels = read_universal_NER('en_ewt-ud-test.iob2')
train_sents, train_labels = read_universal_NER('en_ewt-ud-train.iob2')
id2label, label2id = myutils.labels2lookup(test_labels, UNK)
test_labels = [label2id.get(label, label2id[UNK]) for label in test_labels]

# Tokenize testing data
test_tokked = myutils.tok(test_sents, tokzr)

# Convert testing data to batches
test_text_batched, test_labels_batched = myutils.to_batch(test_tokked, test_labels, batch_size, PAD, device)

# Evaluate the model on testing data
print('evaluating on test data...')
test_score, test_predictions = model.run_eval(test_text_batched, test_labels_batched)
print(f'Accuracy on test data: {test_score:.4f}')

evaluating on test data...
Accuracy on test data: 0.9304


### 6. Format Test Predictions:

In [85]:
### Note that because of batching, there is a small amount of sentences
### at the end of the test file which has not gotten predictions.
### We can fix this issue later, but for now we will just ignore those last instances

# temp_mapping = {0: '[UNK]', 1: 'O', 2: 'B-LOC', 3: 'I-LOC'}

# Flatten batched predictions to non-batched list
test_predictions_flat = [pred.item() for batch in test_predictions for pred in batch]

# Convert to NER tags
NER_predictions = [id2label[pred] for pred in test_predictions_flat]

def insert_preds_to_file(masked_test_file_path, test_predictions_file_path, predictions):

    # Open masked test file
    with open(masked_test_file_path, 'r', encoding = 'utf-8') as infile:
        lines = infile.readlines()

    # Store output lines as batch index
    output_lines = []
    pred_idx = 0

    # Iterate through each line in masked file and make copy to new file
    for line in lines:

        # Non-token lines
        if line.startswith('#'):
            output_lines.append(line)

        # Token lines
        elif line.strip():

            # Split line into parts and insert prediction into third column
            parts = line.strip().split('\t')
            parts[2] = predictions[pred_idx]

            # Move to next prediction
            pred_idx += 1
            output_lines.append('\t'.join(parts) + '\n')

            # Stop when we run out of predictions
            if pred_idx >= len(predictions):
                break
        
        # Empty seperator lines
        else:
            output_lines.append('\n')

    # Write output to new file
    with open(test_predictions_file_path, 'w', encoding = 'utf-8') as outfile:
        outfile.writelines(output_lines)

# NER_predictions = convert_to_NER_tags(test_predictions, temp_mapping)
insert_preds_to_file('en_ewt-ud-test-masked.iob2', 'test_predictions.iob2', NER_predictions)

### 3. Project proposal

The written proposal should consist of maximum one page in [ACL-format](https://github.com/acl-org/acl-style-files) (The bibliography does not count for the word limit). In here, you should explain the last three points from the list above and place your project in a larger context (previous work).

Make sure your proposal is:
* Novel to some extent
* Doable within the time-frame

*hint* The [ACL Anthology](https://aclanthology.org/) contains almost all peer-reviewed NLP papers.

**Deadline: 03-04 on LearnIt (14:00)**

### 4. Final project
The final project has a maximum size of 5 pages (excluding bibliography and appendix), using the [ACL style files](https://github.com/acl-org/acl-style-files)

Besides the main paper (discussed in class), you have to include:
* Group contributions. State who was responsible for which part of the project. Here you may state if there
were any serious unequal workloads among group members. This should be put in the appendix.
* A report on usage of chatbots. We follow: https://2023.aclweb.org/blog/ACL-2023-policy/
   * Add a section in appendix if you made use of a chatbot (since we do not use a Responsible NLP Checklist)
   * Include each stage on the ACL policy, and indicate to what extend you used a chatbot
   * Use with care!, you are responsible for the project and plagiarism, correctness etc.

You can also put additional results and details in the appendix. However, the paper itself should be standalone, and understandable without consulting the appendix.

Furthermore, the code should be available on www.github.itu.dk (with a link in a footnote at the end of the abstract) , it should include a README with instructions on how to reproduce your results.

**Deadline: 24-05 on LearnIt (14:00)** Please check the checklist below before uploading!

Optionally, you can upload a draft a week before **17-05 (before 09:00)** for an extra round of feedback

## Analysis

Analysis is essential for the interpretation of your results. In this section we will shortly describe some different types of analysis. We strongly suggest to use at least one of these:

* **Ablation study**: Leave out a certain part of the model, to study its effects. For example, disable the tokenizer, remove a certain (group of) feature(s), or disable the stop-word removal. If the performance drops a lot, it means that this part of the model contributes heavily to the models final performance. This is commonly done in 1 table, while disabling different parts of the model. Note that you can also do this the other way around, i.e. use only one feature (group) at a time, and test performance
* **Learning curve**: Evaluate how much data your model needs to reach a certain performance. Especially for the data augmentation projects this is essential.
* **Quantitative analysis**: Automated means of analyzing in which cases your model performs worse. This can for example be done with a confusion matrix.
* **Qualitative analysis**: Manually inspect a certain number of errors, and try to categorize them/find trends. Can be combined with the quantitative analysis, i.e., inspect 100 cases of positive reviews predicted to be negative and 100 cases of negative reviews predicted to be positive
* **Feature importance**: In traditional machine learning methods, one can often extract and inspect the weights of the features. In sklearn these can be found in: `trained_model.coef_`
* **Other metrics**: per class scores, partial matches, or count how often the span-borders were correct, but the label wrong.
* **Input words importance**: To gain insight into which words have a impact on prediction performance (positive, negative), we can analyze per-word impact: given a trained model, replace a given word with
the unknown word token and observe the change in prediction score (probability for a class). This is
shown in Figure 4 of [Rethmeier et al (2018)](https://aclweb.org/anthology/W18-6246) (a paper on controversy detection), also shown below: red-colored
tokens were important for controversy detection, blue-colored token decreased prediction scores.

<img width=400px src=example.png>

Note that this is a non-exhaustive list, and you are encouraged to also explore additional analyses.

### Checklist final project
Please check all these items before handing in your final report. You only have to upload a pdf file on learnit, and make sure a link to the code is included in the report and the code is accesible. 

* Are all group members and their email addresses specified?
* Does the group report include a representative project title?
* Does the group report contain an abstract?
* Does the introduction clearly specify the research intention and research question?
* Does the group report adequately refer to the relevant literature?
* Does the group report properly use figure, tables and examples?
* Does the group report provide and discuss the empirical results?
* Is the group report proofread?
* Does the pdf contain the link to the project’s github repo?
* Is the github repo accessible to the public (within ITU)?
* Is the group report maximum 5 pages long, excluding references and appendix?
* Are the group contributions added in the appendix?
* Does the repository contain all scripts and code to reproduce the results in the group report? Are instructions
 provided on how to run the code?
