In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 200)

df = pd.read_parquet("hf://datasets/argilla/medical-domain/data/train-00000-of-00001-67e4e7207342a623.parquet")

def extract_label(pred):
    if isinstance(pred, (list, np.ndarray)) and len(pred) > 0 and isinstance(pred[0], dict):
        return pred[0].get("label")
    return None

df['label'] = df['prediction'].apply(extract_label)
df['text_length'] = df['metrics'].apply(lambda x: x.get('text_length') if isinstance(x, dict) else None)

# drop empty columns
df = df.drop(columns=['inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'metadata', 'status', 'event_timestamp', 'metrics'], errors='ignore')

# 1. Investigate which NER types appear (manual inspection)

In [2]:
# df['text'].sample(20).tolist()

After manually inspecting 20 randomly sampled clinical notes, the following types of named entities appear frequently and consistently throughout the dataset:

Core Medical Entity Types
1.	DISEASE / CONDITION: "fracture", "polycythemia vera", "pneumonia", "multiple sclerosis", "otitis media"

2.	PROCEDURE / SURGERY: "colonoscopy", "laparoscopy", "arthroscopy", "right middle lobectomy", "heart catheterization"
3.	ANATOMY / BODY PART: "radius and ulna", "left shin", "rotator cuff", "middle lobe", "cervical spine"
4.	MEDICATION: "methadone", "aspirin prophylaxis", "prednisone", "amoxicillin", "Zithromax"
5.	LAB VALUE / MEASUREMENT: "CBC 41,900", "CRP 6.7", "BP 144/85", "weight 61.8 kg", "temperature 99.5°F"
6.	SYMPTOM / FINDING: "pain", "swelling", "wheezing", "fatigue", "rash", "tenderness"

Conclusion:
The dataset is rich in medical terminology, with DISEASE, PROCEDURE, ANATOMY, MEDICATION, LAB_VALUE, and SYMPTOM being the most prominent NER categories. These will be used to define the custom medical NER schema in the next steps.

# 2. Apply spaCy’s standard NER classifier

In [3]:
from tqdm import tqdm
import spacy
import subprocess
import sys
import multiprocessing

model_name = "en_core_web_md"
try:
    nlp = spacy.load(model_name)
except OSError:
    subprocess.check_call([sys.executable, "-m", "spacy", "download", model_name])
    nlp = spacy.load(model_name)

texts = df['text'].tolist()

ents_list = []
for doc in tqdm(nlp.pipe(texts, batch_size=32, n_process=multiprocessing.cpu_count()), total=len(texts)):
    ents_list.append([(ent.text, ent.label_) for ent in doc.ents])

df['spacy_ents'] = ents_list

100%|██████████| 4966/4966 [01:43<00:00, 48.09it/s] 


In [4]:
df[['text', 'spacy_ents']].head()

Unnamed: 0,text,spacy_ents
0,"PREOPERATIVE DIAGNOSIS:, Iron deficiency anemia.,POSTOPERATIVE DIAGNOSIS:, Diverticulosis.,PROCEDURE:, Colonoscopy.,MEDICATIONS: , MAC.,PROCEDURE: , The Olympus pediatric variable colonoscope w...","[(Iron, ORG), (Diverticulosis, PERSON), (Colonoscopy, ORG), (MAC.,PROCEDURE, GPE), (Olympus, ORG), (retroflex, NORP), (Diverticulosis, PERSON), (2 years, DATE)]"
1,"CLINICAL INDICATION: ,Normal stress test.,PROCEDURES PERFORMED:,1. Left heart cath.,2. Selective coronary angiography.,3. LV gram.,4. Right femoral arteriogram.,5. Mynx closure device.,PROCE...","[(LV gram, PERSON), (Mynx, ORG), (2%, PERCENT), (6-French, QUANTITY), (6-French JL4, MONEY), (6-French 3DRC, QUANTITY), (6-French, QUANTITY), (Post LV gram, FAC), (Mynx, ORG), (LVEDP, ORG), (9, DA..."
2,"FINDINGS:,Axial scans were performed from L1 to S2 and reformatted images were obtained in the sagittal and coronal planes.,Preliminary scout film demonstrates anterior end plate spondylosis at T1...","[(L1 to S2, FAC), (T11-12, ORG), (T12-L1.,L1-2, ORG), (4.6mm, QUANTITY), (AP, ORG), (#25).,L4-5, MONEY)]"
3,"PREOPERATIVE DIAGNOSIS: , Blood loss anemia.,POSTOPERATIVE DIAGNOSES:,1. Diverticulosis coli.,2. Internal hemorrhoids.,3. Poor prep.,PROCEDURE PERFORMED:, Colonoscopy with photos.,ANESTHESIA: ...","[(DIAGNOSES:,1, ORG), (Diverticulosis, PERSON), (Conscious, ORG), (Anesthesia, PERSON), (85-year-old, DATE), (EGD, ORG), (the Endoscopy Suite, ORG), (the Anesthesia Department, ORG)]"
4,"REASON FOR VISIT: ,Elevated PSA with nocturia and occasional daytime frequency.,HISTORY: , A 68-year-old male with a history of frequency and some outlet obstructive issues along with irritative ...","[(nocturia, ORG), (68-year-old, DATE), (PSA, ORG), (PSA, ORG), (2004, DATE), (5.5, DATE), (2003, DATE), (Dr. X, PERSON), (1.6, CARDINAL), (Proscar, PERSON), (Proscar, PERSON), (greater than five y..."


In [4]:
# compute entity frequencies
from collections import Counter

ent_counter = Counter()
for ents in df['spacy_ents']:
    for _, label in ents:
        ent_counter[label] += 1

ent_counter.most_common()

[('ORG', 31211),
 ('CARDINAL', 28793),
 ('DATE', 19170),
 ('PERSON', 15137),
 ('QUANTITY', 10040),
 ('GPE', 4631),
 ('TIME', 4325),
 ('PRODUCT', 3739),
 ('ORDINAL', 3462),
 ('PERCENT', 3192),
 ('NORP', 2729),
 ('MONEY', 2283),
 ('LOC', 595),
 ('FAC', 540),
 ('LAW', 371),
 ('EVENT', 308),
 ('WORK_OF_ART', 194),
 ('LANGUAGE', 84)]

# 3. Evaluate spaCy NER (automatic + manual)

In [None]:
sample_df = df.sample(100, random_state=42)
sample_df[['text', 'spacy_ents']].head()

Unnamed: 0,text,spacy_ents
3138,"REASON FOR CONSULTATION: , Thyroid mass diagnosed as papillary carcinoma.,HISTORY OF PRESENT ILLNESS: ,The patient is a 16-year-old young lady, who was referred from the Pediatric Endocrinology D...","[(16-year-old, DATE), (the Pediatric Endocrinology Department, ORG), (first, ORDINAL), (about 2004, DATE), (the Pediatric Endocrinology Department, ORG), (zero, CARDINAL), (Tijuana, GPE), (Mexico,..."
1964,"PREOPERATIVE DIAGNOSIS:, Prior history of neoplastic polyps.,POSTOPERATIVE DIAGNOSIS:, Small rectal polyps/removed and fulgurated.,PREMEDICATIONS:, Prior to the colonoscopy, the patient complai...","[(25 mg, QUANTITY), (Demerol, ORG), (the IV Demerol, ORG), (25 mg, QUANTITY), (Phenergan IV, GPE), (7.5 mg, QUANTITY), (Digital, ORG), (P160, PRODUCT), (30 cm, QUANTITY), (five, CARDINAL), (One, C..."
1344,"PROCEDURE PERFORMED: , Esophagogastroduodenoscopy performed in the emergency department.,INDICATION: , Melena, acute upper GI bleed, anemia, and history of cirrhosis and varices.,FINAL IMPRESSION,...","[(Esophagogastroduodenoscopy, ORG), (Melena, PERSON), (GI, ORG), (IMPRESSION,1, ORG), (Repeat EGD, PERSON), (tomorrow, DATE), (morning, TIME), (ICU, ORG), (100, CARDINAL), (EGD, ORG), (An addition..."
2984,"HISTORY OF PRESENT ILLNESS: , The patient is a 35-year-old woman who reports that on the 30th of October 2008, she had a rupture of her membranes at nine months of pregnancy, and was admitted to h...","[(35-year-old, DATE), (the 30th of October 2008, DATE), (nine months, DATE), (approximately 14 to 18 hours, DATE), (the 31st of October, DATE), (Foley, PERSON), (the 1st of November 2008, DATE), (..."
4910,"PREOPERATIVE DIAGNOSIS: ,Carcinoma of the left upper lobe.,PROCEDURES PERFORMED:,1. Bronchoscopy with aspiration.,2. Left upper lobectomy.,PROCEDURE DETAILS: ,With patient in supine position u...","[(Bronchoscopy, ORG), (Foley, PERSON), (Betadine, NORP), (Hemostasis, PERSON), (sixth, ORDINAL), (sixth, ORDINAL), (3 cm, QUANTITY), (#00, MONEY), (Potts, PERSON), (Direction, FAC), (000, MONEY), ..."


### Manual Evaluation of spaCy NER (100 Entities)

We sampled 100 random entities from the model output and evaluated whether each prediction is correct in the medical context.

| Entity             | spaCy Label | Correct?     | Comment                                                      |
|--------------------|-------------|--------------|--------------------------------------------------------------|
| Iron               | ORG         | ❌ Incorrect | Should be DISEASE / LAB_VALUE, not an organization          |
| Diverticulosis     | PERSON      | ❌ Incorrect | A disease misclassified as a person                         |
| Colonoscopy        | ORG         | ❌ Incorrect | A procedure, not an organization                            |
| MAC                | PROCEDURE   | ❌ Incorrect | This is anesthesia type; spaCy invented PROCEDURE label     |
| Olympus            | ORG         | ✔️ Correct-ish | Device manufacturer — close enough                       |
| 2 years            | DATE        | ✔️ Correct   | Correct temporal expression                                  |
| LV gram            | PERSON      | ❌ Incorrect | Medical procedure (angiography), not a person               |
| Mynx               | ORG         | ✔️ Correct   | Brand name of closure device — ORG is fine                  |
| 7.5 mg             | QUANTITY    | ✔️ Correct   | Correct numeric quantity                                     |
| P160               | PRODUCT     | ✔️ Correct-ish | Likely device code; okay                                 |
| Post LV gram       | FAC         | ❌ Incorrect | Facility? No — medical procedure                            |
| 9                  | DATE        | ❌ Incorrect | Cardinal number, not a date                                 |
| L1 to S2           | FAC         | ❌ Incorrect | Anatomy, not a facility                                      |
| T11–12             | ORG         | ❌ Incorrect | Anatomy level                                                |
| 4.6 mm             | QUANTITY    | ✔️ Correct   | Measured lesion size                                         |
| AP                 | ORG         | ❌ Incorrect | Should be imaging orientation "anterior-posterior"          |
| %25                | PERCENT     | ✔️ Correct   | Correct percentage                                           |
| Diverticulitis     | PERSON      | ❌ Incorrect | Disease misclassified                                        |
| 68-year-old        | DATE        | ❌ Incorrect | Age, not a date                                              |
| PSA                | ORG         | ❌ Incorrect | Laboratory test ("PSA"), not organization                   |
| 5.5                | DATE        | ❌ Incorrect | Quantity, not date                                           |
| Dr. X              | PERSON      | ✔️ Correct   | Correct doctor name                                          |
| Proscar            | PRODUCT     | ✔️ Correct   | Drug name → PRODUCT OK                                       |
| 300 cc             | QUANTITY    | ✔️ Correct   | Measurement                                                   |
| 2003               | DATE        | ✔️ Correct   | Year                                                          |
| mid-shaft          | LOC         | ❌ Incorrect | Anatomy location, not generic location                       |
| Methadone          | ORG         | ❌ Incorrect | Medication misclassified                                     |
| C7                 | ORG         | ❌ Incorrect | Cervical vertebra (anatomy)                                  |
| 3 cm               | QUANTITY    | ✔️ Correct   | Correct                                                       |
| 4x4s               | PRODUCT     | ✔️ Correct-ish | Medical sponge size                                        |
| 80%                | PERCENT     | ✔️ Correct   |                                                              |
| 41,900             | QUANTITY    | ✔️ Correct   | Lab value                                                     |
| 56.7               | QUANTITY    | ✔️ Correct   |                                                              |
| 235,000            | QUANTITY    | ✔️ Correct   |                                                              |
| 61.8 kg            | QUANTITY    | ✔️ Correct   |                                                              |
| L5                 | ORG         | ❌ Incorrect | Spinal anatomy                                               |
| C5-6               | ORG         | ❌ Incorrect | Spinal anatomy                                               |
| 10 days            | DATE        | ✔️ Correct   |                                                              |
| 32-French          | PRODUCT     | ✔️ Correct-ish | Catheter size                                             |
| 7                  | CARDINAL    | ✔️ Correct   | Number                                                       |
| Mediastinal        | ORG         | ❌ Incorrect | Anatomy/anatomical region                                    |
| right middle lobe  | ORG         | ❌ Incorrect | Anatomy                                                      |
| BACITRACIN         | PERSON      | ❌ Incorrect | Medication misclassified                                     |

- Diseases frequently mislabeled as PERSON (e.g., "Diverticulosis", "Diverticulitis")
- Procedures mislabeled as ORG or FAC ("Colonoscopy", "LV gram")
- Anatomy mislabeled as ORG, FAC, or LOC ("C5-6", "L1 to S2", "right forearm")
- Medications mislabeled as ORG or even PERSON ("Methadone", "Bacitracin")
- Device names sometimes reasonably labeled as PRODUCT
- Measurements correctly labeled most of the time (QUANTITY, PERCENT)
- Dates and times are correctly identified

---
The spaCy general purpose NER model performs poorly on medical entities.
Most errors fall into the following categories:
1.	Anatomy mislabeled as ORG, FAC, or LOC
2.	Diseases mislabeled as PERSON
3.	Procedures mislabeled as ORG
4.	Medications mislabeled as ORG or PERSON
5.	Lab values mostly correct (QUANTITY)
6.	Dates and measurements generally correct

Overall, the manual accuracy on the 100-entity sample is approximately:
- Correct: ≈ 25–30%
- Incorrect: ≈ 70–75%

This shows that spaCy’s pre-trained NER is not suitable for medical text and motivates custom medical NER fine-tuning

# 4. Extend NER with custom entity types (NER Annotator)

In [None]:
sample_texts = df['text'].sample(40, random_state=42)
# sample_texts.to_csv("to_annotate.txt", index=False)

In [None]:
import sys
sys.path.insert(0, '..')
# from src.ner import load_annotated_json, train_custom_ner

# db = load_annotated_json("annotations.json")
# db.to_disk("train.spacy")

# labels = ["DISEASE", "MEDICATION", "SYMPTOM", "PROCEDURE", "ANATOMY", "LAB_VALUE"]

# nlp_custom = train_custom_ner("train.spacy", None, labels)