<a href="https://colab.research.google.com/github/AbeHandler/AbeHandler.github.io/blob/master/HW3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Intro

In this notebook you will get some practice with a span tagging task (also called token classification) using the popular [Huggingface](https://huggingface.co/) library which makes it (fairly) easy to use large transformer models for NLP tasks.

The data that we will be using for this assignment comes from [RedHOT](https://arxiv.org/abs/2210.06331), a corpus of medical posts from reddit labeled with "PIO" tags where P stands for "population," I stands for "intervention" and O stands for "outcome". PIO frames are often used to create structured representations of text from medical studies.

The creators of the RedHOT dataset created this resource to try to catch medical disinformation online. For example, you could use PIO elements to try to find posts which propose (falsely) that [bleach](https://www.theguardian.com/world/2020/sep/19/bleach-miracle-cure-amazon-covid) is a cure for COVID (which is false).

Credits: this problem set draws from blog posts from [Huggingface](https://huggingface.co/docs/transformers/tasks/token_classification) and [wandb](https://wandb.ai/mostafaibrahim17/ml-articles/reports/Named-Entity-Recognition-With-HuggingFace-Using-PyTorch-and-W-B--Vmlldzo0NDgzODA2) on span tagging.

#### Q1

Data science does not happen in a vacuum. It's usually a good idea to really understand your data before you even get started with modeling. When you are working in a supervised setting, you should spend some time understanding what your labels mean!

Get started by gaining a little bit of familiarity with the PIO tagging scheme by looking at the example in [Table 1](https://arxiv.org/pdf/2210.06331.pdf) of the RedHOT paper. Then define each of the following in your own words by filling out the cell below. You can write one clear sentence for each of these.

**population** _your answer here_

**intervention** _your answer here_

**outcome** _your answer here_

#### Q2

Find a post on r/AskDocs which references a population, intervention or outcome (or all of the above). Include the text of the post below, and describe the population interention and outcome in the post. If the post is really long, pick the most important one or two sentences and post them here.

_Your answer here_

### Coding preliminaries

In order complete the assignment you will need to install the necessary packages and set up your runtime. There are three steps here:

1. A the menu at the top, go to "Runtime" and pick "Change runtime type" and pick "T4." T4 is a kind of GPU made by Nvidia. When you select this runtime your colab notebook will connect to a T4 GPU. Google gives you a small number of GPU credits as part of a free colab account. It is not really enough for production or research use, but it is an easy way to get started with GPU programming for a problem set. If you skip this step your model will train on a CPU which will take a really long time. You will have a hard time completing this assignment without a GPU runtime.

2.  Uncomment the next line and install the required dependencies using pip. You must do this after step 1 because pip will install different versions depending on if you have access to the GPU.

3. Upload your `train.conll.txt` and `validation.conll.txt` files to colab. You can find instructions on how to do that [here](https://docs.google.com/document/d/1f1DQI_QUQId4x8fFQQ59cdenAZWWq9eK6HMyYchdpQM/edit?usp=sharing).

In [2]:
# ! pip install transformers datasets evaluate seqeval transformers[torch]

In [3]:
# run this cell to import the needed packages
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from datasets import load_dataset, load_metric, Dataset, DatasetDict
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score, classification_report


#### Q3: Understanding the CONLL format

In the lectures for this class, we have been talking about "supervised" learning for span tagging (aka token classification) kind of abstractly. But what does "supervised" data actually look like for span tagging tasks? Well, one common format is called CONLL. Each token in a CONLL file is listed on its own line, and its label is listed to the right (separated by a space delimiter).

In our setting, the tag labels are "P" for population, "I" for intervention and "O" for outcome. Also, the first token in a span gets the tag B-P (for beggining) and all other tags in the labeled span get the tag I-P. So for example the string "Im also HLA-B27 positive" gets tagged as follows.

```
I O
m O
also O
HLA B-P
- I-P
B27 I-P
positive O
```


The next cell reads in a conll file. You can leave it unchanged.

In [5]:
def read_conll_file(file_path):
    lines = 0
    with open(file_path, "r") as f:
        content = f.read().strip()
        sentences = content.split("\n\n")
        data = []
        for sentence in sentences:
            tokens = sentence.split("\n")
            token_data = []
            for token in tokens:
                token_data.append(token.split())
            data.append(token_data)
            lines += 1
    return data


train_data = read_conll_file("train.conll.txt")
validation_data = read_conll_file("validation.conll.txt")

Q3A: _Explain what the code in the cell above is doing in your own words_

Q3B: Take your example from r/AskDocs from the previous answer and write it in CONLL format below. You only need to do 25 tokens total, so if you picked a long post then just choose a smalll piece to annotate


```
# Your data in CONLL format here
```

#### Converting to a Dataset object

Huggingface transformers models often use a dataset object from the Huggingface [Datasets](https://huggingface.co/docs/datasets/index) library. The code below simply converts the connl data into this format. There is not much you need to do for this. To some extent this is boilerplate preprocessing needed to get the RedHOT data ready to use in the Huggingface library.

In [6]:
def convert_to_dataset(data, label_map):
    formatted_data = {"tokens": [], "ner_tags": []}
    for sentence in data:
        if sentence[0] != []:
            try:
                tokens = [token_data[0] for token_data in sentence]
                ner_tags = [label_map[token_data[1]] for token_data in sentence]
                formatted_data["tokens"].append(tokens)
                formatted_data["ner_tags"].append(ner_tags)
            except IndexError:
                pass
            except KeyError:
                print(sentence)
    return Dataset.from_dict(formatted_data)


label_set = set()
for sentence in train_data:
    for token_data in sentence:
        try:
            label_set.add(token_data[1])
        except IndexError:
            pass

label_list = sorted(list(label_set))
label_map = {label: i for i, label in enumerate(label_list)}

train_dataset = convert_to_dataset(train_data, label_map)
validation_dataset = convert_to_dataset(validation_data, label_map)
#test_dataset = convert_to_dataset(test_data, label_map)

datasets = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    #"test": test_dataset,
})


#### The distilbert model

In Huggingface transformers (and many NLP tasks) it is very common to work from a pretrained "[foundation](https://arxiv.org/abs/2108.07258)" model. There are a few options for models to try:
- The [BERT](https://huggingface.co/docs/transformers/model_doc/bert) model, which we have discussed in lectures. You can try this problem set using regular BERT if you want, provided you have enough colab credits (don't pay for more!). The training time for a BERT model is roughly double the training time for distilbert (see below).
- A model called "[distilbert](https://huggingface.co/docs/transformers/model_doc/distilbert)" which is a smaller model that is trained to act like BERT. The reason we are using distilbert is that it requires much less compute to train, as compared to the full BERT model.
- A model called "[Roberta](https://arxiv.org/abs/1907.11692)" which is similar to BERT, and achieved the highest score on the PICO task in the RedHOT paper.

The next cell shows how to load a pretrained foundation model. You don't need to change anything.

In [7]:
model_name = "distilbert-base-uncased" # "bert-base-uncased" # or roberta-base or
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The next cell includes two functions. You don't have to change anything but you should know what they do.

The `compute_metrics` function returns the precision, recall and F1 score of the model on predictions from the model. This is a standard part of training using [Huggingface]("https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt"); it gives you feedback on how your model is doing during each epoch of training. An epoch is a single pass through the training data where you update the weights based on the loss function.

In [8]:
def compute_metrics(eval_prediction):
    predictions, labels = eval_prediction
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        # "classification_report": classification_report(true_labels, true_predictions),
    }


The next function performs tokenization, which breaks up a string into words. It also assigns a label to each token (e.g. HLA = B-P) from the previous example.

#### Q4: Understanding the BERT tokenizer

Working with BERT can be a little confusing because it uses an algorithm called ["WordPiece"](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt) to break strings into tokens. Understanding the BERT tokenizer requires getting your hands dirty a little. Let's explore this a little bit in the next cell. To answer these questions, you might want to review the course video "Using BERT for text" or check out the [Huggingface tutorial](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt) on WordPiece tokenization.

In [45]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # replace with your model here

In [48]:
inputs = tokenizer("He told me that the Nissan funduplication surgery was a success", return_tensors="pt")
tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())

['[CLS]',
 'he',
 'told',
 'me',
 'that',
 'the',
 'nissan',
 'fund',
 '##up',
 '##lica',
 '##tion',
 'surgery',
 'was',
 'a',
 'success',
 '[SEP]']

Q4A. Why does the tokenizer add the token "CLS" and "SEP"?

_your answer here_

Q4B. How many tokens are created from "funduplication"? Why might that be the case? Answer by referencing the code show in the cell below. Be sure to explain the meaining of the ## tags in the tokens.

_your answer here_

In [49]:
"he" in tokenizer.vocab

True

In [50]:
"funduplication" in tokenizer.vocab # sparsity of text

False

#### Q5: Understanding preprocessing

The next cell shows more boilerplate preprocessing code needed to get data into a Huggingface transformer model, drawn from [wanddb](https://wandb.ai/mostafaibrahim17/ml-articles/reports/Named-Entity-Recognition-With-HuggingFace-Using-PyTorch-and-W-B--Vmlldzo0NDgzODA2). You don't need to do anything with this next cell, but it needs to be included for the code to run. You will also answer a few questions about it below.

In [54]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True, padding=True
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    eval_steps=500,
    save_steps=500,
    num_train_epochs=7,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    learning_rate=5e-5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)


Map:   0%|          | 0/4590 [00:00<?, ? examples/s]

Map:   0%|          | 0/530 [00:00<?, ? examples/s]

The `label_map` defined above maps each token label to a number. The next cell prints out the `label_map`

In [55]:
label_map

{'B-I': 0, 'B-O': 1, 'B-P': 2, 'I-I': 3, 'I-O': 4, 'I-P': 5, 'O': 6}

This code represents the tokens for passage 102 in the training data

In [76]:
tokenized_datasets["train"]['tokens'][102][0:25]

['Fish',
 'oil',
 'omega-3',
 'stuff',
 '...',
 'but',
 'prescription',
 'strength',
 'and',
 'FDA',
 '-',
 'approved',
 '?',
 '(',
 'Lovaza',
 ')']

The next cell represents their NER tags

In [75]:
tokenized_datasets["train"]['ner_tags'][102][0:25]

[0, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]

The next cell shows output from the tokenizer. Notice the `[CLS]` and `[SEP]` tokens.

In [74]:
inputs = tokenizer('Fish oil omega-3 stuff ... but prescription strength and FDA approved', return_tensors="pt")
tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())

['[CLS]',
 'fish',
 'oil',
 'omega',
 '-',
 '3',
 'stuff',
 '.',
 '.',
 '.',
 'but',
 'prescription',
 'strength',
 'and',
 'fda',
 'approved',
 '[SEP]']

The preprocessing function maps each token to a token id based on its label. The next cell shows the first 25 token ids for instance 102. Based on the label map and `tokenize_and_align_labels` function and your understanding of PIO frames, explain the output of the cell below in your own words.

_your answer here_

In [72]:
" ".join([str(i) for ino, i in enumerate(tokenized_datasets["train"]["labels"][102][0:200]) if ino < 25])

[-100,
 0,
 3,
 6,
 -100,
 -100,
 6,
 6,
 -100,
 -100,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 -100,
 -100,
 6,
 -100,
 -100]

Ok finally ready to train! Run the next cell to get ready for training!

In [9]:
def data_collator(data):
    input_ids = [torch.tensor(item["input_ids"]) for item in data]
    attention_mask = [torch.tensor(item["attention_mask"]) for item in data]
    labels = [torch.tensor(item["labels"]) for item in data]


    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence(attention_mask, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=-100)


    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

#### Ok, train!

You are finally ready to train your model. In may ways, this is kind of the easy part. Once you have set all of the training configurations there is just one line of code that you need to call to actually train the model, shown below. This will take about 45 minutes to run in a T4 runtime. Go take a walk :). Waiting for GPUs is alas a big part of modern data science.

In [10]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.1249,0.094777,0.177914,0.165714,0.171598
2,0.1014,0.085162,0.213483,0.217143,0.215297
3,0.0628,0.092418,0.161017,0.217143,0.184915
4,0.0395,0.101051,0.154167,0.211429,0.178313
5,0.0272,0.12386,0.217949,0.194286,0.205438
6,0.0221,0.128126,0.205742,0.245714,0.223958
7,0.0175,0.137412,0.2,0.222857,0.210811


TrainOutput(global_step=4018, training_loss=0.06337380470061789, metrics={'train_runtime': 3009.8849, 'train_samples_per_second': 10.675, 'train_steps_per_second': 1.335, 'total_flos': 8395562585046384.0, 'train_loss': 0.06337380470061789, 'epoch': 7.0})

#### Check the results

Once you train a model, you are probably curious to see how it will do on real datasets. The next cell shows how to run the trained model. Let's test how it does on the example from the Table in the paper which we examined earlier in the problem set.

In [48]:
sentence = '''I’ve had costochondritis for a while, usually comes and goes.
Done all the heart/lung checks all clear. I’ve just recovered covid and what I’m
left with is chest pain/pressure. I mean it could
be a costo flare up which makes sense, but also
been reading about myocarditis after covid and
I’m worried, how can I tell which is which?'''


# tokenize the input
tokenized_input = tokenizer(sentence, return_tensors="pt").to(model.device)

# make predictions about the input
outputs = model(**tokenized_input)

# get the predicted labels for each token
predicted_labels = outputs.logits.argmax(-1)[0]

named_entities = [(tokenizer.decode([token]), label) for token, label in zip(tokenized_input["input_ids"][0], predicted_labels)]

print("Named Entities - Example 1:", named_entities)


Named Entities - Example 1: [('cost', tensor(2, device='cuda:0')), ('##och', tensor(5, device='cuda:0')), ('##ond', tensor(5, device='cuda:0')), ('##rit', tensor(5, device='cuda:0')), ('##is', tensor(5, device='cuda:0')), ('chest', tensor(1, device='cuda:0')), ('pain', tensor(1, device='cuda:0')), ('pressure', tensor(1, device='cuda:0'))]


#### Error analysis

- What is one thing the model got right?
- What is one thing the model got wrong?

#### Another error analysis

So far, you have just seen one example. Go to the [AskDocs](https://www.reddit.com/r/AskDocs/) subreddit and find another medical question.
Post the text of the question below and answer the questions again.

- What is one thing the model got right?
- What is one thing the model got wrong?

### Challenge question

If you look closely at the TrainingArguments to the training function above you will notice that it has a learning rate parameter. The learning rate parameter controls how quickly the model adjusts its weights based on the loss. If the learning rate parameter is small, that means the model makes small changes to the weights each epoch. This means it learns more slowly, which sounds like a bad thing. But if your learning rate is too high, then your model will make big changes in response to a few data points, which may make it hard to converge to the best parameters. Ideally, you want to find a learning rate that moves quickly to a (locally optimum) setting of the weights without wild swings in the loss (and F1 score).

For the challenge question, try changing the learning rate from bigger to smaller and plotting the training and validation loss at each epoch. Remember an epoch is just a pass through the training set.

What do you notice from your plot? How do changes to the learning rate affect the loss and F1 score at each epoch? Why do you think you are observing these results? Post your supporting code, plot and analysis below.

