## Project NLP and Deep Learning

### 1. Project proposal presentation

In the presentation, you have 5 minutes to present your research proposal. During the presentation, you should explain:
* What was your baseline model (architecture, design decisions etc.)
* What is the topic of your project, what is the current state of this topic/task/setup
* What is the new part of your project
* What is the research question of your project

We have proposed a number of topics in the slides which can be found on LearnIt, you can either pick one of these or come up with your own. If you pick your own, we suggest to get a pre-approval with Rob van der Goot.

**Deadline for uploading slides: 12-03 on LearnIt (14:00)**  (pdf only, they will be put into one long pdf for a smooth presentation)

**Presentations: 13-03 from 08:00-12:00**, we will split the class in half for the lecture hours (08:00-10:00) and the lab hours (10:00-12:00)


### 2. Baseline
To get your project started, you start with implementing a baseline model. Ideally, this is going to be the main baseline that you are going to compare to in your paper. Note that this baseline should be more advanced than just predicting the majority class (O).

We will use EWT portion of the [Universal NER project](http://www.universalner.org/), which we provide with this notebook for convenience. You can use the train data (`en_ewt-ud-train.iob2`) and dev data(`en_ewt-ud-dev.iob2`) to build your baseline, then upload your prediction on the test data (`en_ewt-ud-test.iob2`).

It is important to upload your predictions in same format as the training and dev files, so that the `span_f1.py` script can be used.

Note that you do not have to implement your baseline from scratch, you can use for example the code from the RNN or BERT assignments as a starting point.

**Deadline: 20-03 on LearnIt (11:59)**

#### 1. We first create a function to read the universel NER data:

In [62]:
# Function to read NER data

def read_universal_NER(file_path):
    with open(file_path, 'r', encoding = 'utf-8') as infile:
        # Split into lines
        lines = infile.readlines()

        # Define lists to store data 
        sentences = []
        labels = []
        current_sentence = []
        current_labels = []

        # Iterate over lines
        for line in lines:

            line = line.strip() # Remove whitespace
            if not line: # Skip empty lines
                continue

            # print(line)

            # Check if line starts with sentence ID
            if line.startswith('# sent_id'):
                if current_sentence:
                    sentences.append(' '.join(current_sentence))
                    labels.append(current_labels)
                current_sentence = []
                current_labels = []

            # Check for token lines
            elif not line.startswith("#"):
                parts = line.strip().split('\t')
                current_sentence.append(parts[1])
                current_labels.append(parts[2])

        if current_sentence:
            sentences.append(' '.join(current_sentence))
            labels.append(current_labels)
    return sentences, labels

In [56]:
sentences, labels = read_universal_NER('en_ewt-ud-train.iob2')

['1', 'Where', 'O', '-', '-']
['2', 'in', 'O', '-', '-']
['3', 'the', 'O', '-', '-']
['4', 'world', 'O', '-', '-']
['5', 'is', 'O', '-', '-']
['6', 'Iguazu', 'B-LOC', '-', 'stephen']
['7', '?', 'O', '-', '-']
['1', 'Iguazu', 'B-LOC', '-', 'stephen']
['2', 'Falls', 'I-LOC', '-', 'stephen']
['1', 'Widely', 'O', '-', '-']
['2', 'considered', 'O', '-', '-']
['3', 'to', 'O', '-', '-']
['4', 'be', 'O', '-', '-']
['5', 'one', 'O', '-', '-']
['6', 'of', 'O', '-', '-']
['7', 'the', 'O', '-', '-']
['8', 'most', 'O', '-', '-']
['9', 'spectacular', 'O', '-', '-']
['10', 'waterfalls', 'O', '-', '-']
['11', 'in', 'O', '-', '-']
['12', 'the', 'O', '-', '-']
['13', 'world', 'O', '-', '-']
['14', ',', 'O', '-', '-']
['15', 'the', 'O', '-', '-']
['16', 'Iguazu', 'B-LOC', '-', 'stephen']
['17', 'Falls', 'I-LOC', '-', 'stephen']
['18', 'on', 'O', '-', '-']
['19', 'the', 'O', '-', '-']
['20', 'border', 'O', '-', '-']
['21', 'of', 'O', '-', '-']
['22', 'Argentina', 'B-LOC', '-', 'stephen']
['23', 'and', 'O'

#### 2. We then define the BERT model:

In [122]:
import torch
from typing import List
from transformers import AutoModel, AutoTokenizer

class NER_model(torch.nn.Module):
    def __init__(self, nlabels: int, mlm: str):
        super().__init__()

        # Define masked linear model
        self.mlm = AutoModel.from_pretrained(mlm)
        self.mlm_out_size = self.mlm.config.hidden_size
        self.hidden_to_label = torch.nn.Linear(self.mlm_out_size, nlabels)

    def forward(self, input_ids, attention_mask = None, token_type_ids = None):

        # Run transformer model on input
        mlm_out = self.mlm(input_ids = input_ids, attention_mask = attention_mask)
        
        # Keep only the last layer: shape=(batch_size, max_len, DIM_EMBEDDING)
        mlm_out = mlm_out.last_hidden_state
        # Keep only the output for the first ([CLS]) token: shape=(batch_size, DIM_EMBEDDING)
        mlm_out = mlm_out[:,:,:].squeeze()

        # Matrix multiply to get scores for each label: shape=(?,?)
        output_scores = self.hidden_to_label(mlm_out)

        return output_scores
    
    def run_eval(self, text_batched: List[torch.tensor], labels_batched: List[torch.tensor]):
        
        # Set model to evaluation mode
        self.eval()

        # Store amount of total instances and correct matches
        correct = 0
        total = 0

        # Iterate over batches
        for sents, labels in zip(text_batched, labels_batched):

            # Run forward pass and get output labels
            output_scores = self.forward(sents)
            pred_labels = torch.argmax(output_scores, 2)

            # Iterate over gold and predicted labels
            for gold_labels, pred_label in zip(labels, pred_labels):
                
                # Increment total
                total += 1

                # Iterate over gold labels and increment correct if correct pred
                for gold_label, pred in zip(gold_labels, pred_label):
                    if gold_label.item() == pred.item():
                        correct += 1

        correct_freq = correct / total
        return correct_freq

#### 3. We create functions to train the model:

In [151]:
def get_label_mapping(labels_list):
    unique_labels = set(label for labels in labels_list for label in labels)
    label2id = {label: i for i, label in enumerate(unique_labels)}
    id2label = {i: label for label, i in label2id.items()}
    return label2id, id2label

def evaluate_model(model, dev_tokens, dev_labels, batch_size, device):

    # Set model to evaluation mode
    model.eval()

    # Store amount of total instances and correct matches
    correct = 0
    total = 0

    with torch.no_grad():
        for i in range(0, len(dev_tokens['input_ids']), batch_size):
            batch_inputs = {key: val[i: i+batch_size].to(device) for key, val in dev_tokens.items()}
            batch_labels = [dev_labels[i: i+batch_size] for dev_labels in dev_labels]
            output = model(**batch_inputs)
            pred_labels = torch.argmax(output, 2)
            for gold_labels, pred in zip(batch_labels, pred_labels):
                total += len(gold_labels)
                correct += sum(gold == pred_label.item() for gold, pred_label in zip(gold_labels, pred))
    
    correct_freq = correct / total
    return correct_freq

def train_model(mlm: str, 
          train_file_path: str, 
          dev_file_path: str,
          learning_rate: float,
          optimizer: torch.optim.Optimizer,
          criterion: torch.nn.modules.loss,
          n_epochs: int,
          batch_size: int,
          device: str,
          ):

    # Read data
    print('reading data...')
    train_text, train_labels = read_universal_NER(train_file_path)
    dev_text, dev_labels = read_universal_NER(dev_file_path)

    # print(type(train_text))
    # print(type(train_text[0]))

    # Tokenize
    print('tokenizing...')
    tokenizer = AutoTokenizer.from_pretrained(mlm)
    train_tokens = tokenizer(train_text, padding = True, truncation = True, return_tensors = 'pt')
    dev_tokens = tokenizer(dev_text, padding = True, truncation = True, return_tensors = 'pt')

    # Convert labels to indices
    label2id, _ = get_label_mapping(train_labels + dev_labels)
    train_labels = [[label2id[label] for label in sent_labels] for sent_labels in train_labels]
    dev_labels =  [[label2id[label] for label in sent_labels] for sent_labels in dev_labels]

    # Initialize model
    print('initializing model...')
    model = NER_model(nlabels = len(label2id), mlm = mlm)
    optimizer = optimizer(model.parameters(), lr = learning_rate)

    # Define the criterion with reduction argument set
    criterion = criterion(reduction = 'mean')

    # Training loop
    print('training...')
    for epoch in range(n_epochs):

        # Set model to training mode
        model.train()
        
        # Store total loss
        total_loss = .0

        for i in range(0, len(train_tokens['input_ids']), batch_size):

            # Set gradients to zero
            optimizer.zero_grad()

            # Define inputs
            batch_inputs = {key: val[i: i+batch_size].to(device) for key, val in train_tokens.items()}
            batch_labels = [labels[i: i+batch_size] for labels in train_labels]

            # Get output and flatten
            output = model(batch_inputs['input_ids'], attention_mask = batch_inputs.get('attention_mask'))
            flat_output = output.view(-1, output.shape[-1])

            # Flatten labels and convert to tensor
            flat_labels = torch.tensor([label for labels in batch_labels for label in labels], dtype=torch.long).to(device)
            flat_labels = flat_labels[:flat_output.shape[0] * flat_output.shape[1]]

            # Print shapes for debugging
            print('Shape of output:', flat_output.shape)
            print('Shape of labels:', flat_labels.shape)
            print(flat_labels[:5])

            # Compute loss and add to total loss
            loss = criterion(flat_output, flat_labels)
            total_loss += loss

            # Perform backward step
            loss.backward()
            optimizer.step()

        # Evaluate
        dev_accuracy = evaluate_model(model, dev_tokens, dev_labels, batch_size, device)

        # Print statistics
        print(f'Epoch {epoch + 1}:')
        print(f'Training Loss: {total_loss / len(train_tokens["input_ids"]):.4f}')
        print(f'Dev Accuracy: {dev_accuracy:.4f}')


#### 4. We train the model:

In [152]:
# Initialize model
train_model(mlm = 'distilbert-base-cased',
            train_file_path = 'en_ewt-ud-train.iob2',
            dev_file_path = 'en_ewt-ud-train.iob2',
            learning_rate = .001,
            optimizer = torch.optim.Adam,
            criterion = torch.nn.CrossEntropyLoss,
            n_epochs = 5,
            batch_size = 32,
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
            )

reading data...
tokenizing...
initializing model...
training...
Shape of output: torch.Size([6464, 7])
Shape of labels: torch.Size([190944])
tensor([0, 0, 0, 0, 0])


ValueError: Expected input batch_size (6464) to match target batch_size (190944).

In [120]:
### EXISTING BERT MODEL FROM EX5

"""
A basic classifier based on the transformers (https://github.com/huggingface/transformers) 
library. It loads a masked language model (by default distilbert), and adds a linear layer for
prediction. Example usage:

python3 bert-topic.py topic-data/train.txt topic-data/dev.txt
"""
from typing import List, Dict
import codecs
import torch
# import sys # I don't need you
import bert.myutils as myutils # I changed this to import from bert dir
from transformers import AutoModel, AutoTokenizer

# set seed for consistency
torch.manual_seed(8446)
# Set some constants
MLM = 'distilbert-base-cased'
BATCH_SIZE = 8
LEARNING_RATE = 0.00001
EPOCHS = 3
# We have an UNK label for robustness purposes, it makes it easier to run on
# data with other labels, or without labels.
UNK = "[UNK]"
MAX_TRAIN_SENTS=64
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"


class ClassModel(torch.nn.Module):
    def __init__(self, nlabels: int, mlm: str):
        """
        Model for classification with transformers.

        The architecture of this model is simple, we just have a transformer
        based language model, and add one linear layer to converts it output
        to our prediction.
    
        Parameters
        ----------
        nlabels : int
            Vocabulary size of output space (i.e. number of labels)
        mlm : str
            Name of the transformers language model to use, can be found on:
            https://huggingface.co/models
        """
        super().__init__()

        # The transformer model to use
        self.mlm = AutoModel.from_pretrained(mlm)

        # Find the size of the output of the masked language model
        if hasattr(self.mlm.config, 'hidden_size'):
            self.mlm_out_size = self.mlm.config.hidden_size
        elif hasattr(self.mlm.config, 'dim'):
            self.mlm_out_size = self.mlm.config.dim
        else: # if not found, guess
            self.mlm_out_size = 768

        # Create prediction layer
        self.hidden_to_label = torch.nn.Linear(self.mlm_out_size, nlabels)

    def forward(self, input: torch.tensor):
        """
        Forward pass
    
        Parameters
        ----------
        input : torch.tensor
            Tensor with wordpiece indices. shape=(batch_size, max_sent_len).

        Returns
        -------
        output_scores : torch.tensor
            ?. shape=(?,?)
        """
        # Run transformer model on input
        mlm_out = self.mlm(input)
        # Keep only the last layer: shape=(batch_size, max_len, DIM_EMBEDDING)
        mlm_out = mlm_out.last_hidden_state
        # Keep only the output for the first ([CLS]) token: shape=(batch_size, DIM_EMBEDDING)
        mlm_out = mlm_out[:,:1,:].squeeze()

        # Matrix multiply to get scores for each label: shape=(?,?)
        output_scores = self.hidden_to_label(mlm_out)

        return output_scores

    def run_eval(self, text_batched: List[torch.tensor], labels_batched: List[torch.tensor]):
        """
        Run evaluation: predict and score
    
        Parameters
        ----------
        text_batched : List[torch.tensor]
            list with batches of text, containing wordpiece indices.
        labels_batched : List[torch.tensor]
            list with batches of labels (converted to ints).
        model : torch.nn.module
            The model to use for prediction.
    
        Returns
        -------
        score : float
            accuracy of model on labels_batches given feats_batches
        """
        self.eval()
        match = 0
        total = 0
        for sents, labels in zip(text_batched, labels_batched):
            output_scores = self.forward(sents)
            pred_labels = torch.argmax(output_scores, 1)
            for gold_label, pred_label in zip(labels, pred_labels):
                total += 1
                if gold_label.item() == pred_label.item():
                    match+= 1
        return(match/total)        

# I no longer need this part
# if len(sys.argv) < 2:
#     print('Please provide path to training and development data')

# I'll wrap this in a function so I can call it here
def train_ClassModel(train_file, dev_file):
    if __name__ == '__main__':

        print('reading data...')

        # Change how we load the data to be specified as a function argument instead of a command-line argument
        train_text, train_labels = myutils.read_data(train_file) # train_text, train_labels = myutils.read_data(sys.argv[1])
        train_text = train_text[:MAX_TRAIN_SENTS]
        train_labels = train_labels[:MAX_TRAIN_SENTS]
        
        id2label, label2id = myutils.labels2lookup(train_labels, UNK)
        NLABELS = len(id2label)
        print(train_labels)
        print(label2id)
        train_labels = [label2id[label] for label in train_labels]
        
        # Change how we load the data to be specified as a function argument instead of a command-line argument
        dev_text, dev_labels = myutils.read_data(dev_file)
        dev_labels = [label2id[label] for label in dev_labels]
        
        print('tokenizing...')
        tokzr = AutoTokenizer.from_pretrained(MLM)
        train_tokked = myutils.tok(train_text, tokzr)
        dev_tokked = myutils.tok(dev_text, tokzr)
        PAD = tokzr.pad_token_id
        
        print('converting to batches...')
        train_text_batched, train_labels_batched = myutils.to_batch(train_tokked, train_labels, BATCH_SIZE, PAD, DEVICE)
        # Note, some data is trown away if len(text_tokked)%BATCH_SIZE!= 0
        dev_text_batched, dev_labels_batched = myutils.to_batch(dev_tokked, dev_labels, BATCH_SIZE, PAD, DEVICE)
        
        print('initializing model...')
        model = ClassModel(NLABELS, MLM)
        model.to(DEVICE)
        optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
        loss_function = torch.nn.CrossEntropyLoss(ignore_index=0, reduction='sum')
        
        print('training...')
        for epoch in range(EPOCHS):
            print('=====================')
            print('starting epoch ' + str(epoch))
            model.train() 
        
            # Loop over batches
            loss = 0
            for batch_idx in range(0, len(train_text_batched)):
                optimizer.zero_grad()

                output_scores = model.forward(train_text_batched[batch_idx])
                batch_loss = loss_function(output_scores, train_labels_batched[batch_idx])
                loss += batch_loss.item()
        
                batch_loss.backward()

                optimizer.step()
        
            dev_score = model.run_eval(dev_text_batched, dev_labels_batched)
            print('Loss: {:.2f}'.format(loss))
            print('Acc(dev): {:.2f}'.format(100*dev_score))
            print()

We train the bro:

In [121]:
# Redefine variables
BATCH_SIZE = 8
MAX_TRAIN_SENTS = 500

train_ClassModel('en_ewt-ud-train.iob2', 'en_ewt-ud-dev.iob2')

reading data...


TypeError: unhashable type: 'list'

### 3. Project proposal

The written proposal should consist of maximum one page in [ACL-format](https://github.com/acl-org/acl-style-files) (The bibliography does not count for the word limit). In here, you should explain the last three points from the list above and place your project in a larger context (previous work).

Make sure your proposal is:
* Novel to some extent
* Doable within the time-frame

*hint* The [ACL Anthology](https://aclanthology.org/) contains almost all peer-reviewed NLP papers.

**Deadline: 03-04 on LearnIt (14:00)**

### 4. Final project
The final project has a maximum size of 5 pages (excluding bibliography and appendix), using the [ACL style files](https://github.com/acl-org/acl-style-files)

Besides the main paper (discussed in class), you have to include:
* Group contributions. State who was responsible for which part of the project. Here you may state if there
were any serious unequal workloads among group members. This should be put in the appendix.
* A report on usage of chatbots. We follow: https://2023.aclweb.org/blog/ACL-2023-policy/
   * Add a section in appendix if you made use of a chatbot (since we do not use a Responsible NLP Checklist)
   * Include each stage on the ACL policy, and indicate to what extend you used a chatbot
   * Use with care!, you are responsible for the project and plagiarism, correctness etc.

You can also put additional results and details in the appendix. However, the paper itself should be standalone, and understandable without consulting the appendix.

Furthermore, the code should be available on www.github.itu.dk (with a link in a footnote at the end of the abstract) , it should include a README with instructions on how to reproduce your results.

**Deadline: 24-05 on LearnIt (14:00)** Please check the checklist below before uploading!

Optionally, you can upload a draft a week before **17-05 (before 09:00)** for an extra round of feedback

## Analysis

Analysis is essential for the interpretation of your results. In this section we will shortly describe some different types of analysis. We strongly suggest to use at least one of these:

* **Ablation study**: Leave out a certain part of the model, to study its effects. For example, disable the tokenizer, remove a certain (group of) feature(s), or disable the stop-word removal. If the performance drops a lot, it means that this part of the model contributes heavily to the models final performance. This is commonly done in 1 table, while disabling different parts of the model. Note that you can also do this the other way around, i.e. use only one feature (group) at a time, and test performance
* **Learning curve**: Evaluate how much data your model needs to reach a certain performance. Especially for the data augmentation projects this is essential.
* **Quantitative analysis**: Automated means of analyzing in which cases your model performs worse. This can for example be done with a confusion matrix.
* **Qualitative analysis**: Manually inspect a certain number of errors, and try to categorize them/find trends. Can be combined with the quantitative analysis, i.e., inspect 100 cases of positive reviews predicted to be negative and 100 cases of negative reviews predicted to be positive
* **Feature importance**: In traditional machine learning methods, one can often extract and inspect the weights of the features. In sklearn these can be found in: `trained_model.coef_`
* **Other metrics**: per class scores, partial matches, or count how often the span-borders were correct, but the label wrong.
* **Input words importance**: To gain insight into which words have a impact on prediction performance (positive, negative), we can analyze per-word impact: given a trained model, replace a given word with
the unknown word token and observe the change in prediction score (probability for a class). This is
shown in Figure 4 of [Rethmeier et al (2018)](https://aclweb.org/anthology/W18-6246) (a paper on controversy detection), also shown below: red-colored
tokens were important for controversy detection, blue-colored token decreased prediction scores.

<img width=400px src=example.png>

Note that this is a non-exhaustive list, and you are encouraged to also explore additional analyses.

### Checklist final project
Please check all these items before handing in your final report. You only have to upload a pdf file on learnit, and make sure a link to the code is included in the report and the code is accesible. 

* Are all group members and their email addresses specified?
* Does the group report include a representative project title?
* Does the group report contain an abstract?
* Does the introduction clearly specify the research intention and research question?
* Does the group report adequately refer to the relevant literature?
* Does the group report properly use figure, tables and examples?
* Does the group report provide and discuss the empirical results?
* Is the group report proofread?
* Does the pdf contain the link to the project’s github repo?
* Is the github repo accessible to the public (within ITU)?
* Is the group report maximum 5 pages long, excluding references and appendix?
* Are the group contributions added in the appendix?
* Does the repository contain all scripts and code to reproduce the results in the group report? Are instructions
 provided on how to run the code?
