# clinical ner transformer

In this notebook I use pretrained clinic_bioBert encoder transformer to extract diagnosis and medications (NER and classification) from medical notes data.

### Main Objectives

1. Generate or Load Data

- Option A: Synthetic EMRs (fast to iterate; no licensing).
- Option B: Open datasets (stronger realism): BC5CDR (Diseases/Chemicals): https://huggingface.co/datasets/biocreative_cdr

2. BIO Tag the Data

- Label scheme: B-ENTITY, I-ENTITY, O(no entity).

3. Tokenize

- Use the same tokenizer as the model (handles subwords + offset mapping).

4. Load Pretrained Model

- Bio_ClinicalBERT: emilyalsentzer/Bio_ClinicalBERT

5. Fine-Tune

- Objective: add token classification head (NER).
- Loss: cross-entropy over token labels (ignore specials with -100).

6. Classify Diagnoses & Meds

- Postprocess logits - entity spans.
- Aggregate subwords back to words; merge B-/I- runs.

7. Evaluate

- Metrics: precision / recall / F1 (seqeval).
- Inspect errors (boundary splits, abbreviations, synonyms).

### install and import necessary modules

In [None]:
!conda install -y pytorch cpuonly -c pytorch

In [1]:
!pip install -U transformers seqeval evaluate bioc
!pip install "datasets<4.0.0" fsspec pyarrow

Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Using cached huggingface_hub-0.34.4-py3-none-any.whl (561 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.25.2
    Uninstalling huggingface-hub-0.25.2:
      Successfully uninstalled huggingface-hub-0.25.2
Successfully installed huggingface-hub-0.34.4


In [2]:
import transformers
from datasets import load_dataset

In [3]:
print(transformers.__version__)

4.56.0


### load dataset from hugging face datasets

The BioCreative V Chemical Disease Relation (CDR) dataset is a large annotated text corpus of human annotations of all chemicals, diseases and their interactions in 1,500 PubMed articles.

In [5]:
dataset = load_dataset("bigbio/bc5cdr", "bc5cdr_source", trust_remote_code=True)    

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [8]:
# dataset content
dataset

DatasetDict({
    train: Dataset({
        features: ['passages'],
        num_rows: 500
    })
    test: Dataset({
        features: ['passages'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['passages'],
        num_rows: 500
    })
})

### observe dataset features and structure - 


Each row in train / test / validation contains one document.
Inside it, the passages field is a list of two items:

- Title (type = "title")

- Abstract (type = "abstract")

Each passage has:

- text: the actual string

- entities: list of labeled entities (with offsets + text + type + normalized DB links)

- relations: relations between entities (e.g. chemical–disease links)

In [24]:
passage_feature = dataset["train"].features['passages']
passage_feature

[{'document_id': Value(dtype='string', id=None),
  'type': Value(dtype='string', id=None),
  'text': Value(dtype='string', id=None),
  'entities': [{'id': Value(dtype='string', id=None),
    'offsets': [[Value(dtype='int32', id=None)]],
    'text': [Value(dtype='string', id=None)],
    'type': Value(dtype='string', id=None),
    'normalized': [{'db_name': Value(dtype='string', id=None),
      'db_id': Value(dtype='string', id=None)}]}],
  'relations': [{'id': Value(dtype='string', id=None),
    'type': Value(dtype='string', id=None),
    'arg1_id': Value(dtype='string', id=None),
    'arg2_id': Value(dtype='string', id=None)}]}]

random exemple from dataset 

In [25]:
dataset["train"][200] 

{'passages': [{'document_id': '6747681',
   'type': 'title',
   'text': 'Intra-arterial BCNU chemotherapy for treatment of malignant gliomas of the central nervous system.',
   'entities': [{'id': '0',
     'offsets': [[15, 19]],
     'text': ['BCNU'],
     'type': 'Chemical',
     'normalized': [{'db_name': 'MESH', 'db_id': 'D002330'}]},
    {'id': '1',
     'offsets': [[50, 67]],
     'text': ['malignant gliomas'],
     'type': 'Disease',
     'normalized': [{'db_name': 'MESH', 'db_id': 'D005910'}]}],
   'relations': [{'id': 'R0',
     'type': 'CID',
     'arg1_id': 'D002330',
     'arg2_id': 'D031300'}]},
  {'document_id': '6747681',
   'type': 'abstract',
   'text': 'Because of the rapid systemic clearance of BCNU (1,3-bis-(2-chloroethyl)-1-nitrosourea), intra-arterial administration should provide a substantial advantage over intravenous administration for the treatment of malignant gliomas. Thirty-six patients were treated with BCNU every 6 to 8 weeks, either by transfemoral cath