<a href="https://colab.research.google.com/github/AbeHandler/AbeHandler.github.io/blob/master/HW3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Intro

In this notebook you will get some practice with a span tagging task (also called token classification) using the popular [Huggingface](https://huggingface.co/) library which makes it (fairly) easy to use large transformer models for NLP tasks.

The data that we will be using for this assignment comes from [RedHOT](https://arxiv.org/abs/2210.06331), a corpus of medical posts from reddit labeled with "PIO" tags where P stands for "population," I stands for "intervention" and O stands for "outcome". PIO frames are often used to create structured representations of text from medical studies.

This dataset is motivated by a desire to catch medical disinformation online. For example, you could use PIO elements to try to find posts which propose (falsely) that [bleach](https://www.theguardian.com/world/2020/sep/19/bleach-miracle-cure-amazon-covid) is a cure for COVID (which is false).

Credits: this problem set draws from tutorials from [Huggingface](https://huggingface.co/docs/transformers/tasks/token_classification) and [wandb](https://wandb.ai/mostafaibrahim17/ml-articles/reports/Named-Entity-Recognition-With-HuggingFace-Using-PyTorch-and-W-B--Vmlldzo0NDgzODA2).

#### Q1

Data science does not happen in a vacuum. It's usually a good idea to really understand your data before you even get started with modeling. When you are working in a supervised setting, you should spend some time understanding what your labels mean!

Get started by gaining a little bit of familiarity with the PIO tagging scheme by looking at the example in [Table 1](https://arxiv.org/pdf/2210.06331.pdf) of the RedHOT paper. Then define each of the following in your own words by filling out the cell below. You can write one clear sentence for each of these.

**population** _your answer here_

**intervention** _your answer here_

**outcome** _your answer here_

#### Q2

Find a post on r/AskDocs which references a population, intervention or outcome (or all of the above). Include the text of the post below, and describe the population interention and outcome in the post. If the post is really long, pick the most important one or two sentences.

_Your answer here_

### Coding preliminaries

In order complete the assignment you will need to install the necessary packages and set up your runtime. There are three steps here:

1. A the menu at the top, go to "Runtime" and pick "Change runtime type" and pick "T4." T4 is a kind of GPU made by Nvidia. When you select this runtime your collab notebook will connect to a T4 GPU. Google gives you a small number of GPU credits as part of a free colab account. It is not really enough for production or research use, but it is an easy way to get started with GPU programming for a problem set. If you skip this step your model will train on a CPU which will take a really long time.

2.  Uncomment the next line and install the required dependencies using pip. You don't have to use a GPU runtime to complete this assignment.

3. Upload your `train.conll.txt` and `validation.conll.txt` files to colab. You can find instructions on how to do that [here](https://docs.google.com/document/d/1f1DQI_QUQId4x8fFQQ59cdenAZWWq9eK6HMyYchdpQM/edit?usp=sharing).

In [2]:
# ! pip install transformers datasets evaluate seqeval transformers[torch]

In [3]:
# run this cell to import the needed packages
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from datasets import load_dataset, load_metric, Dataset, DatasetDict
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score, classification_report


#### Q3: Understanding the CONLL format

In the lectures for this class, we have been talking about "supervised" learning for span tagging (aka token classification) kind of abstractly. But what does "supervised" data actually look like for span tagging tasks? Well, one common format is called CONLL. Each token in a CONLL file is listed on its own line, and its label is listed to the right (separated by a space delimiter).

In our setting, the tag labels are "P" for population, "I" for intervention and "O" for outcome. Also, the first token in a span gets the tag B-P (for beggining) and all other tags in the labeled span get the tag I-P. So for example the string "Im also HLA-B27 positive" gets tagged as follows.

```
I O
m O
also O
HLA B-P
- I-P
B27 I-P
positive O
```


The next cell reads in a conll file. You can leave it unchanged.

In [43]:
def read_conll_file(file_path):
    lines = 0
    with open(file_path, "r") as f:
        content = f.read().strip()
        sentences = content.split("\n\n")
        data = []
        for sentence in sentences:
            tokens = sentence.split("\n")
            token_data = []
            for token in tokens:
                token_data.append(token.split())
            data.append(token_data)
            lines += 1
    return data


train_data = read_conll_file("train.conll.txt")
validation_data = read_conll_file("validation.conll.txt")

Question 3: _Explain what the code in the cell above is doing in your own words_

#### Converting to a Dataset object

Huggingface transformers models often use a dataset object from the Huggingface [Datasets](https://huggingface.co/docs/datasets/index) library. The code below simply converts the connl data into this format. There is not much you need to do for this. To some extent this is boilerplate preprocessing needed to get the RedHOT data data ready to use Huggingface library.

In [44]:
def convert_to_dataset(data, label_map):
    formatted_data = {"tokens": [], "ner_tags": []}
    for sentence in data:
        if sentence[0] != []:
            try:
                tokens = [token_data[0] for token_data in sentence]
                ner_tags = [label_map[token_data[1]] for token_data in sentence]
                formatted_data["tokens"].append(tokens)
                formatted_data["ner_tags"].append(ner_tags)
            except IndexError:
                pass
            except KeyError:
                print(sentence)
    return Dataset.from_dict(formatted_data)


label_set = set()
for sentence in train_data:
    for token_data in sentence:
        try:
            label_set.add(token_data[1])
        except IndexError:
            pass

label_list = sorted(list(label_set))
label_map = {label: i for i, label in enumerate(label_list)}

train_dataset = convert_to_dataset(train_data, label_map)
validation_dataset = convert_to_dataset(validation_data, label_map)
#test_dataset = convert_to_dataset(test_data, label_map)

datasets = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    #"test": test_dataset,
})


#### The distilbert model

In Huggingface transformers (and many NLP tasks) it is very common to work from a pretrained "foundation" model. The model that we will be using for this HW is called "distilbert" which is a smaller model that is trained to act like BERT. The reason we are using distilbert is that it requires much less compute to train, as compared to the full BERT model. But you can try this problem set using regular BERT if you want, provided you have enough colab credits (don't pay for more!).

In [45]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))


Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [46]:
def compute_metrics(eval_prediction):
    predictions, labels = eval_prediction
    predictions = np.argmax(predictions, axis=2)


    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        # "classification_report": classification_report(true_labels, true_predictions),
    }


def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True, padding=True
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs


In [47]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    eval_steps=500,
    save_steps=500,
    num_train_epochs=7,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    learning_rate=5e-5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)


Map:   0%|          | 0/4590 [00:00<?, ? examples/s]

Map:   0%|          | 0/530 [00:00<?, ? examples/s]

In [48]:
def data_collator(data):
    input_ids = [torch.tensor(item["input_ids"]) for item in data]
    attention_mask = [torch.tensor(item["attention_mask"]) for item in data]
    labels = [torch.tensor(item["labels"]) for item in data]


    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence(attention_mask, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=-100)


    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

#### Ok, train!

You are finally ready to train your model. In may ways, this is kind of the easy part. Once you have set all of the training configurations there is just one line of code that you need to call to actually train the model, shown below. This will take about 15 minutes to run in a T4 runtime. Go take a walk :). Waiting for GPUs is alas a big part of modern data science.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.2026,0.135739,0.0,0.0,0.0


  _warn_prf(average, modifier, msg_start, len(result))


#### Check the results

Once you train a model, you are probably curious to see how it will do on real datasets. The next cell shows how to run the trained model. Let's test how it does on the example from the Table in the paper which we examined earlier in the problem set.

In [48]:
sentence = '''I’ve had costochondritis for a while, usually comes and goes.
Done all the heart/lung checks all clear. I’ve just recovered covid and what I’m
left with is chest pain/pressure. I mean it could
be a costo flare up which makes sense, but also
been reading about myocarditis after covid and
I’m worried, how can I tell which is which?'''


# tokenize the input
tokenized_input = tokenizer(sentence, return_tensors="pt").to(model.device)

# make predictions about the input
outputs = model(**tokenized_input)

# get the predicted labels for each token
predicted_labels = outputs.logits.argmax(-1)[0]

named_entities = [(tokenizer.decode([token]), label) for token, label in zip(tokenized_input["input_ids"][0], predicted_labels) if label != 0 and label != label_map['O']]

print("Named Entities - Example 1:", named_entities)


Named Entities - Example 1: [('cost', tensor(2, device='cuda:0')), ('##och', tensor(5, device='cuda:0')), ('##ond', tensor(5, device='cuda:0')), ('##rit', tensor(5, device='cuda:0')), ('##is', tensor(5, device='cuda:0')), ('chest', tensor(1, device='cuda:0')), ('pain', tensor(1, device='cuda:0')), ('pressure', tensor(1, device='cuda:0'))]


#### Error analysis

- What is one thing the model got right?
- What is one thing the model got wrong?

#### Another error analysis

So far, you have just seen one example. Go to the [AskDocs](https://www.reddit.com/r/AskDocs/) subreddit and find another medical question.
Post the text of the question below and answer the questions again.

- What is one thing the model got right?
- What is one thing the model got wrong?

In [47]:
label_map

{'B-I': 0, 'B-O': 1, 'B-P': 2, 'I-I': 3, 'I-O': 4, 'I-P': 5, 'O': 6}

In [19]:
# dataset["train"]

Dataset({
    features: ['ner_tags', 'tokens', 'id'],
    num_rows: 2265
})

### BIO tags question

assign your own tags

### Should you keep training?

### Load the tokenizer


In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizer question here

- [word piece](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt)
- CLS and SEP

Reading the documentation questions

In [6]:
# Example text

text = "He told me that the Nissan funduplication surgery was a success for 80% of cases after 15 years."

inputs = tokenizer(text, return_tensors="pt")
tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())

['[CLS]',
 'he',
 'told',
 'me',
 'that',
 'the',
 'nissan',
 'fund',
 '##up',
 '##lica',
 '##tion',
 'surgery',
 'was',
 'a',
 'success',
 'for',
 '80',
 '%',
 'of',
 'cases',
 'after',
 '15',
 'years',
 '.',
 '[SEP]']

In [9]:
V = tokenizer.vocab
"he" in V

True

In [12]:
len(V)

30522

In [13]:
"funduplication" in V # sparsity of text

False

In [13]:
example = wnut["train"][0]
example

{'id': '0',
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [6]:
example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 '@',
 'paul',
 '##walk',
 'it',
 "'",
 's',
 'the',
 'view',
 'from',
 'where',
 'i',
 "'",
 'm',
 'living',
 'for',
 'two',
 'weeks',
 '.',
 'empire',
 'state',
 'building',
 '=',
 'es',
 '##b',
 '.',
 'pretty',
 'bad',
 'storm',
 'here',
 'last',
 'evening',
 '.',
 '[SEP]']

### Challenge question

If you look closely at the TrainingArguments to the training function above you will notice that it has a learning rate parameter. The learning rate parameter controls how quickly the model adjusts its weights based on the loss. If the learning rate parameter is small, that means the model makes small changes to the weights each epoch. This means it learns more slowly, which sounds like a bad thing. But if your learning rate is too high, then your model will make big changes in response to a few data points, which may make it hard to converge to the best parameters.

For the challenge question, try changing the learning rate from bigger to smaller and plotting the training and validation loss at each epoch. Remember an epoch is just a pass through the training set.

What do you notice from your plot? Why do you think you are observing these results. Post your supporting code, plot and analysis below.