In this notebook, we explore a Named Entity Recognition task using transformers. The task will involve finetuning the [ClinicalBert](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT) model.

In [1]:
! pip install pandas
! pip install datasets
! pip install transformers
! pip install torch
! pip install ipywidgets

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-1.4.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting numpy>=1.18.5
  Downloading numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pytz>=2020.1
  Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.5/503.5 KB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.23.0 pandas-1.4.3 pytz-2022.1
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' com

In [2]:
import os
import itertools
import pandas as pd
import numpy as np
import datasets
from datasets import Dataset
from datasets import load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
import torch

  from .autonotebook import tqdm as notebook_tqdm


## The data

In [18]:
import pandas as pd


# Put the code below into a function later

# Test Data Frame
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/test_tokens.txt") as f:
    lines = f.readlines()
tokens = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/test_labels.txt") as f:
    lines = f.readlines()
labels = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
test = pd.DataFrame({"tokens": tokens, "ner_tags": labels})

# Validation Data Frame
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/validation_tokens.txt") as f:
    lines = f.readlines()
tokens = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/validation_labels.txt") as f:
    lines = f.readlines()
labels = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
validation = pd.DataFrame({"tokens": tokens, "ner_tags": labels})


# Train Data Frame
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/train_tokens.txt") as f:
    lines = f.readlines()
tokens = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/train_labels.txt") as f:
    lines = f.readlines()
labels = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
train = pd.DataFrame({"tokens": tokens, "ner_tags": labels})


In [22]:
train.iloc[:2]

Unnamed: 0,tokens,ner_tags
0,"[The, TGF, -, beta, type, II, receptor, in, ch...","[O, B-Gene_or_gene_product, I-Gene_or_gene_pro..."
1,"[Genomic, instability, is, one, mechanism, pro...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


Convert data to a datasets dictionary

In [24]:
med_df = datasets.DatasetDict({
    "train": Dataset.from_pandas(train),
    "validation": Dataset.from_pandas(validation),
    "test": Dataset.from_pandas(test)
})

med_df

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 3021
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1895
    })
})

In [32]:
# Example
example = pd.DataFrame(med_df["validation"][221])
example

Unnamed: 0,tokens,ner_tags
0,The,O
1,rats,B-Organism
2,were,O
3,divided,O
4,into,O
5,4,O
6,groups,O
7,.,O


Each record is annotated in the `inside-outside-beginning` format i.e a `B-` prefix indicates the beginning of an entity, and consecutive
tokens belonging to the same entity are given an `I-` prefix. An `O` tag indicates that the
token does not belong to any entity. For example, the following sentence:


As a quick check that we don't have any unusual imbalance in the tags, let's calculate the frequencies of each entity across each split:

In [39]:
# subclass for counting hashable objects
from collections import Counter
# calls a factory function to supply missing values
from collections import defaultdict
from datasets import DatasetDict

split2freqs = defaultdict(Counter)
for split, dataset in med_df.items():
    for row in dataset["ner_tags"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1


pd.DataFrame(split2freqs)


Unnamed: 0,train,validation,test
Gene_or_gene_product,4016,1354,2513
Cancer,1226,430,918
Cell,1934,540,996
Organism,896,295,513
Simple_chemical,1096,443,720
Multi,415,138,303
Organ,194,71,156
Organism_subdivision,47,12,39
Tissue,314,86,184
Immaterial_anatomical_entity,52,19,31


There are some entitites that are too few across the data sets, perhaps combine the train and validation sets? Let's see how the model will generalize on these.

We now need a way to encode the `ner_tags` eg `Amino_acid` into a numerical form. Let's create the tags that will be used to label each entity and the mapping of each tag to an ID and vice versa. All of this information can be derived as follows:

In [54]:
split2freqs = defaultdict(Counter)

for split, dataset in med_df.items():
    for row in dataset["ner_tags"]:
        for tag in row:
            tag_type = tag
            split2freqs[split][tag_type] +=1


tag_names = pd.DataFrame(split2freqs).reset_index()["index"].to_list()
tag_names

['O',
 'B-Gene_or_gene_product',
 'I-Gene_or_gene_product',
 'B-Cancer',
 'I-Cancer',
 'B-Cell',
 'I-Cell',
 'B-Organism',
 'B-Simple_chemical',
 'I-Simple_chemical',
 'B-Multi-tissue_structure',
 'I-Multi-tissue_structure',
 'B-Organ',
 'B-Organism_subdivision',
 'B-Tissue',
 'I-Tissue',
 'B-Immaterial_anatomical_entity',
 'B-Organism_substance',
 'I-Organism_substance',
 'I-Organism',
 'I-Organism_subdivision',
 'B-Cellular_component',
 'I-Immaterial_anatomical_entity',
 'I-Cellular_component',
 'B-Pathological_formation',
 'I-Pathological_formation',
 'I-Organ',
 'B-Amino_acid',
 'I-Amino_acid',
 'B-Anatomical_system',
 'I-Anatomical_system',
 'B-Developing_anatomical_structure',
 'I-Developing_anatomical_structure']

creating `tags to index` and `index to tag` dictionaries

In [62]:
# Create index and tag mappings
tag2index = {tag: idx for idx, tag in enumerate(tag_names)}
index2tag = {idx: tag for idx, tag in enumerate(tag_names)}
print(index2tag[32])
print(tag2index["I-Developing_anatomical_structure"])

I-Developing_anatomical_structure
32


In [70]:
[index2tag[[0,1]]]

TypeError: unhashable type: 'list'

In [71]:
[0,1]

[0, 1]

With these, the next step is to create a new column in each split with the numeric class label for each observation. We'll use the `map ()` method to apply a function to each observation:

In [72]:
# Add ner_tag ids
def create_tag_ids(batch):
    return {"tag_ids": [tag2index[ner_tag] for ner_tag in batch["ner_tags"]]}

# Apply function to multiple batches
med_df = med_df.map(create_tag_ids)
med_df

100%|██████████| 3021/3021 [00:00<00:00, 9181.63ex/s]
100%|██████████| 1000/1000 [00:00<00:00, 8176.49ex/s]
100%|██████████| 1895/1895 [00:00<00:00, 9077.38ex/s]


DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'tag_ids'],
        num_rows: 3021
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'tag_ids'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'tag_ids'],
        num_rows: 1895
    })
})

In [77]:
example = pd.DataFrame(med_df["validation"][111])
example

Unnamed: 0,tokens,ner_tags,tag_ids
0,Postoperative,O,0
1,progression,O,0
2,of,O,0
3,pulmonary,B-Organ,12
4,metastasis,O,0
5,in,O,0
6,osteosarcoma,B-Cancer,3
7,.,O,0


Much better! We'll still need to tokenize the tokens into numeric representations. We'll get back to that in just a few.