# clinical ner transformer

In this notebook I use pretrained clinic_bioBert encoder transformer to extract diagnosis and medications (NER and classification) from medical notes data.

### Main Objectives

1. Generate or Load Data

- Option A: Synthetic EMRs (fast to iterate; no licensing).
- Option B: Open datasets (stronger realism): BC5CDR (Diseases/Chemicals): https://huggingface.co/datasets/biocreative_cdr

2. BIO Tag the Data

- Label scheme: B-ENTITY, I-ENTITY, O(no entity).

3. Tokenize

- Use the same tokenizer as the model (handles subwords + offset mapping).

4. Load Pretrained Model

- Bio_ClinicalBERT: emilyalsentzer/Bio_ClinicalBERT

5. Fine-Tune

- Objective: add token classification head (NER).
- Loss: cross-entropy over token labels (ignore specials with -100).

6. Classify Diagnoses & Meds

- Postprocess logits - entity spans.
- Aggregate subwords back to words; merge B-/I- runs.

7. Evaluate

- Metrics: precision / recall / F1 (seqeval).
- Inspect errors (boundary splits, abbreviations, synonyms).

### install and import necessary modules

In [None]:
!conda install -y pytorch cpuonly -c pytorch

In [1]:
!pip install -U transformers seqeval evaluate bioc
!pip install "datasets<4.0.0" fsspec pyarrow

Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Using cached huggingface_hub-0.34.4-py3-none-any.whl (561 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.25.2
    Uninstalling huggingface-hub-0.25.2:
      Successfully uninstalled huggingface-hub-0.25.2
Successfully installed huggingface-hub-0.34.4


In [2]:
import transformers
from datasets import load_dataset

In [3]:
print(transformers.__version__)

4.56.0


### load dataset from hugging face datasets

The BioCreative V Chemical Disease Relation (CDR) dataset is a large annotated text corpus of human annotations of all chemicals, diseases and their interactions in 1,500 PubMed articles.

In [68]:
dataset = load_dataset("bigbio/bc5cdr", "bc5cdr_source", trust_remote_code=True)    

In [69]:
# dataset content
dataset

DatasetDict({
    train: Dataset({
        features: ['passages'],
        num_rows: 500
    })
    test: Dataset({
        features: ['passages'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['passages'],
        num_rows: 500
    })
})

### observe dataset features and structure - 


Each row in train / test / validation contains one document.
Inside it, the passages field is a list of two items:

- Title (type = "title")

- Abstract (type = "abstract")

Each passage has:

- text: the actual string

- entities: list of labeled entities (with offsets + text + type + normalized DB links)

- relations: relations between entities (e.g. chemical–disease links)

In [70]:
passage_feature = dataset["train"].features['passages']
passage_feature

[{'document_id': Value(dtype='string', id=None),
  'type': Value(dtype='string', id=None),
  'text': Value(dtype='string', id=None),
  'entities': [{'id': Value(dtype='string', id=None),
    'offsets': [[Value(dtype='int32', id=None)]],
    'text': [Value(dtype='string', id=None)],
    'type': Value(dtype='string', id=None),
    'normalized': [{'db_name': Value(dtype='string', id=None),
      'db_id': Value(dtype='string', id=None)}]}],
  'relations': [{'id': Value(dtype='string', id=None),
    'type': Value(dtype='string', id=None),
    'arg1_id': Value(dtype='string', id=None),
    'arg2_id': Value(dtype='string', id=None)}]}]

random example from dataset 

In [71]:
dataset["train"][200] 

{'passages': [{'document_id': '6747681',
   'type': 'title',
   'text': 'Intra-arterial BCNU chemotherapy for treatment of malignant gliomas of the central nervous system.',
   'entities': [{'id': '0',
     'offsets': [[15, 19]],
     'text': ['BCNU'],
     'type': 'Chemical',
     'normalized': [{'db_name': 'MESH', 'db_id': 'D002330'}]},
    {'id': '1',
     'offsets': [[50, 67]],
     'text': ['malignant gliomas'],
     'type': 'Disease',
     'normalized': [{'db_name': 'MESH', 'db_id': 'D005910'}]}],
   'relations': [{'id': 'R0',
     'type': 'CID',
     'arg1_id': 'D002330',
     'arg2_id': 'D031300'}]},
  {'document_id': '6747681',
   'type': 'abstract',
   'text': 'Because of the rapid systemic clearance of BCNU (1,3-bis-(2-chloroethyl)-1-nitrosourea), intra-arterial administration should provide a substantial advantage over intravenous administration for the treatment of malignant gliomas. Thirty-six patients were treated with BCNU every 6 to 8 weeks, either by transfemoral cath

### Build BIO tags dataset

For each entity offset, mark the corresponding tokens with B-(TYPE) and I-(TYPE).
This converts the BigBio dataset into a standard token classification dataset.

In [74]:
import re
from typing import List, Tuple, Dict

label_lst = ["O","B-Chemical","I-Chemical","B-Disease","I-Disease"]
label_to_id = {l:i for i,l in enumerate(label_lst)}

def separate_to_words(text: str):
    # get text and return a separated words list with their start and end indexes
    # output shape: List[Tuple[str,int,int]]
    return [(m.group(0), m.start(), m.end()) for m in re.finditer(r"[A-Za-z]+(?:'[A-Za-z]+)?|[^\w\s]", text)]

def get_entities(passage: Dict):
    # get features dictionary and return an entity type list with their start and end indexes
    # output shape: List[Tuple[str,int,int]]
    spans = []
    for ent in passage.get("entities", []):
        # BigBio offsets: [[start, end]], extract min and max
        offs = ent.get("offsets", [])
        if not offs: 
            continue
        s = min(o[0] for o in offs)
        e = max(o[1] for o in offs)
        t = ent.get("type", "")
        spans.append((s, e, t))
    return spans

def merge_passages(passages: List[Dict], joiner=" "):
    # get a dict that contains two passages and merge them. return the full text and its entities
    # output shape: text, List[Tuple[int,int,str]]
    full, ents, offset = [], [], 0
    for p in passages:
        text = p.get("text","")
        full.append(text)
        full.append(joiner)  # add white space between title and abstract passages
        for (s,e,t) in get_entities(p):
            ents.append((s+offset, e+offset, t))
        offset += len(text) + len(joiner)
    if full and full[-1] == joiner:
        full.pop()  # drop redundant whitespace
 
    return "".join(full), ents

def main_doc_to_words_tags(doc: Dict) -> Dict[str, List]:
    text, entity_spans = merge_passages(doc["passages"])
    toks = separate_to_words(text)
    words  = [w for (w,_,_) in toks]
    tags   = ["O"] * len(words)

    # map BigBio types to two classes
    def map_type(t: str):
        t = t.lower()
        if t.startswith("chem"): 
            return "Chemical"
        if t.startswith("dis"):  
            return "Disease"
        return None

    # assign BIO by overlap and continuity
    for i, (tok, s, e) in enumerate(toks):
        # find overlapping entity with max overlap
        best = None; best_ov = 0
        for (es, ee, t) in entity_spans:
            ov = min(e, ee) - max(s, es)
            if ov > best_ov:
                best_ov = ov
                best = (es, ee, t)
        if best and best_ov > 0:
            es, ee, t = best
            cls = map_type(t)
            if cls:
                # B- if token starts inside entity at or before entity start; else I-
                tags[i] = "B-"+cls if s <= es < e else "I-"+cls

    # enforce I- continuity
    for i in range(len(tags)):
        if tags[i].startswith("I-"):
            if i == 0 or tags[i-1] == "O" or tags[i-1][2:] != tags[i][2:]:
                tags[i] = "B-" + tags[i][2:]

    ner_ids = [label_to_id.get(t, 0) for t in tags]
    return {"words": words, "ner_tags": ner_ids}

In [75]:
from datasets import Dataset, DatasetDict, Features, Sequence, Value, ClassLabel

processed = {}
for split in dataset.keys():  # train test and val
    rows = [main_doc_to_words_tags(doc) for doc in dataset[split]]
    ds_split = Dataset.from_list(rows)
    feats = Features({
        "words": Sequence(Value("string")),
        "ner_tags": Sequence(ClassLabel(names=label_lst))
    })
    processed[split] = ds_split.cast(feats)

token_level_ds = DatasetDict(processed)
print(token_level_ds)

Casting the dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['words', 'ner_tags'],
        num_rows: 500
    })
    test: Dataset({
        features: ['words', 'ner_tags'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['words', 'ner_tags'],
        num_rows: 500
    })
})


present some examples of post bio tag dataset

In [76]:
words = token_level_ds["train"][200]["words"][:20]
labels = token_level_ds["train"][200]["ner_tags"][:20]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_lst[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Intra - arterial BCNU       chemotherapy for treatment of malignant gliomas   of the central nervous system . Because of the rapid 
O     O O        B-Chemical O            O   O         O  B-Disease I-Disease O  O   O       O       O      O O       O  O   O     


### Apply Hugging Face tokenizer

In [50]:
from transformers import AutoTokenizer

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt: 0.00B [00:00, ?B/s]

In [80]:
def tokenize_and_align_labels(example):
    # tokenize words
    tokenized = tokenizer(example["words"], is_split_into_words=True, truncation=True)
    word_ids = tokenized.word_ids()

    aligned_labels = []
    previous_word_id = None
    for word_id in word_ids:
        if word_id is None:
            # special tokens (CLS, SEP, PAD)
            aligned_labels.append(-100)
        elif word_id != previous_word_id:
            # first subword → take label
            aligned_labels.append(example["ner_tags"][word_id])
        else:
            # Same word as previous token
            label = example["ner_tags"][word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            aligned_labels.append(label)
        previous_word_id = word_id

    tokenized["labels"] = aligned_labels
    return tokenized

In [81]:
tokenized_datasets = token_level_ds.map(
    tokenize_and_align_labels,
    batched=False,
    remove_columns=token_level_ds["train"].column_names
)
print(tokenized_datasets["train"][0].keys())

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [82]:
tokenized_datasets["train"][200] 

{'input_ids': [101,
  1107,
  4487,
  118,
  1893,
  19860,
  171,
  1665,
  14787,
  22572,
  5521,
  20939,
  1111,
  3252,
  1104,
  12477,
  2646,
  15454,
  176,
  9436,
  7941,
  1104,
  1103,
  2129,
  5604,
  1449,
  119,
  1272,
  1104,
  1103,
  6099,
  27410,
  16443,
  1104,
  171,
  1665,
  14787,
  113,
  117,
  118,
  16516,
  1116,
  118,
  113,
  118,
  22572,
  10885,
  7745,
  15644,
  1233,
  114,
  118,
  118,
  11437,
  8005,
  7301,
  3313,
  1161,
  114,
  117,
  1107,
  4487,
  118,
  1893,
  19860,
  3469,
  1431,
  2194,
  170,
  6432,
  4316,
  1166,
  1107,
  4487,
  7912,
  2285,
  3469,
  1111,
  1103,
  3252,
  1104,
  12477,
  2646,
  15454,
  176,
  9436,
  7941,
  119,
  3961,
  118,
  1565,
  4420,
  1127,
  5165,
  1114,
  171,
  1665,
  14787,
  1451,
  1106,
  2277,
  117,
  1719,
  1118,
  14715,
  8124,
  26271,
  1348,
  5855,
  4638,
  2083,
  2734,
  1104,
  1103,
  4422,
  1610,
  3329,
  2386,
  1137,
  1396,
  22460,
  6766,
  1233,
  1859

In [83]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_datasets["train"][200]["input_ids"])
labels = tokenized_datasets["train"][200]["labels"]

for t, l in zip(tokens[:30], labels[:30]):
    label_name = label_lst[l] if l != -100 else "IGN"
    print(f"{t:15} {label_name}")


[CLS]           IGN
in              O
##tra           O
-               O
art             O
##erial         O
b               B-Chemical
##c             I-Chemical
##nu            I-Chemical
ch              O
##em            O
##otherapy      O
for             O
treatment       O
of              O
ma              B-Disease
##li            I-Disease
##gnant         I-Disease
g               I-Disease
##lio           I-Disease
##mas           I-Disease
of              O
the             O
central         O
nervous         O
system          O
.               O
because         O
of              O
the             O


### Fine-tuning the model