## Clinical NLP Mini-Project:Structuring Unstructured Paediatric Records

###### This project demonstrates a compact natural language processing (NLP) pipeline for transforming unstructured clinical text into structured, AI-ready data. The focus is on Clinical Named Entity Recognition (CNER) as a foundational step in clinical data digitisation.

###### The notebook presents a proof-of-concept workflow that prioritises clarity, reproducibility, and methodological correctness rather than model training or large-scale evaluation.

## Workflow Overview

1. Load a public, de-identified clinical text dataset.
2. Apply minimal preprocessing to preserve clinical meaning.
3. Perform Clinical Named Entity Recognition using a pre-trained transformer model.
4. Post-process and structure extracted entities into tabular form.
5. Demonstrate a short human-in-the-loop review step.
6. Discuss ethical considerations, limitations, and future extensions.

In [1]:
# Core libraries
import pandas as pd
import numpy as np

# NLP / Transformer utilities
from transformers import pipeline

print("Environment initialised and libraries loaded")




Environment initialised and libraries loaded


In [2]:
# Load the MTSamples dataset (downloaded from Kaggle)
df = pd.read_csv("/content/mtsamples.csv")

# Drop records without clinical text
df = df.dropna(subset=["transcription"]).reset_index(drop=True)

# Inspect the dataset
df.head()


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [3]:
# filter for paediatric-related notes
df_peds = df[df["medical_specialty"].str.contains("Pediatrics", na=False)].reset_index(drop=True)

print(f"Total notes: {len(df)}")
print(f"Paediatric-related notes: {len(df_peds)}")


Total notes: 4966
Paediatric-related notes: 70


## Minimal Preprocessing

Clinical text contains important cues such as negation (e.g., “no fever”), abbreviations, and shorthand. This step applies only light, safe preprocessing:
- standardise whitespace


In [4]:
import re

def clean_whitespace(text: str) -> str:
    text = str(text)
    text = text.replace("\r", "\n")
    text = re.sub(r"\n{2,}", "\n\n", text)      # to collapse newlines
    text = re.sub(r"[ \t]+", " ", text)         # to collapse spaces or tabs
    return text.strip()

df_peds["text_raw"] = df_peds["transcription"].astype(str)
df_peds["text"] = df_peds["text_raw"].apply(clean_whitespace)

df_peds[["sample_name", "medical_specialty", "text"]].head(3)

Unnamed: 0,sample_name,medical_specialty,text
0,Well-Child Check - 7,Pediatrics - Neonatal,"SUBJECTIVE:, This is a 1-month-old who comes i..."
1,Well-Child Check - 6,Pediatrics - Neonatal,"SUBJECTIVE:, Patient presents with Mom and Dad..."
2,Well-Child Check - 4,Pediatrics - Neonatal,"SUBJECTIVE:, The patient presents with Mom and..."


## Clinical Named Entity Recognition (CNER)

The goal is to convert unstructured clinical text into structured signals by extracting key clinical concepts using a pre-trained clinical transformer encoder (DistilBERT-based, i2b2-trained) for token-level named entity recognition:
- PROBLEM (conditions, symptoms)
- TEST (labs, imaging, examinations)
- TREATMENT (medications, procedures)





In [5]:
ner = pipeline(
    "token-classification",
    model="nlpie/clinical-distilbert-i2b2-2010",
    aggregation_strategy="max"
)

# Quick test
sample_text = df_peds.loc[0, "text"]
ents = ner(sample_text)

ents[:10], len(ents)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/359 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu


([{'entity_group': 'problem',
   'score': np.float32(0.99602365),
   'word': 'sore throat',
   'start': 160,
   'end': 171},
  {'entity_group': 'problem',
   'score': np.float32(0.99769044),
   'word': 'fevers',
   'start': 179,
   'end': 185},
  {'entity_group': 'treatment',
   'score': np.float32(0.9412074),
   'word': 'Enfamil Lipil',
   'start': 440,
   'end': 453},
  {'entity_group': 'problem',
   'score': np.float32(0.989988),
   'word': 'acute distress',
   'start': 710,
   'end': 724},
  {'entity_group': 'problem',
   'score': np.float32(0.9922184),
   'word': 'rash',
   'start': 748,
   'end': 752},
  {'entity_group': 'problem',
   'score': np.float32(0.99212694),
   'word': 'lesion',
   'start': 756,
   'end': 762},
  {'entity_group': 'problem',
   'score': np.float32(0.6218898),
   'word': 'masses',
   'start': 1207,
   'end': 1213}],
 7)

In [6]:
LABEL_MAP = {"problem":"PROBLEM", "test":"TEST", "treatment":"TREATMENT"}


## Structuring Extracted Clinical Entities

The output of the CNER model consists of entity mentions identified within free-text clinical notes. To enable downstream analysis and reuse, these outputs are converted into structured tabular formats.

Two representations are created:
- a long format table, where each row corresponds to a single extracted entity, and
- a wide format table, where entities are grouped by note and clinical category.

In [7]:
N = min(100, len(df_peds))  # start with 100 notes

rows = []
for i in range(N):
    note_id = int(df_peds.index[i])
    text = df_peds.loc[note_id, "text"]
    ents = ner(text)

    for e in ents:
        rows.append({
            "note_id": note_id,
            "sample_name": df_peds.loc[note_id, "sample_name"],
            "medical_specialty": df_peds.loc[note_id, "medical_specialty"],
            "entity_text": e["word"],
            "label": LABEL_MAP.get(e["entity_group"], e["entity_group"].upper()),
            "start": e["start"],
            "end": e["end"],
            "score": float(e["score"]),
        })

entities_long = pd.DataFrame(rows)
entities_long.to_csv("entities_long.csv", index=False)

entities_long.head(), entities_long["label"].value_counts()


(   note_id             sample_name       medical_specialty     entity_text  \
 0        0   Well-Child Check - 7    Pediatrics - Neonatal     sore throat   
 1        0   Well-Child Check - 7    Pediatrics - Neonatal          fevers   
 2        0   Well-Child Check - 7    Pediatrics - Neonatal   Enfamil Lipil   
 3        0   Well-Child Check - 7    Pediatrics - Neonatal  acute distress   
 4        0   Well-Child Check - 7    Pediatrics - Neonatal            rash   
 
        label  start  end     score  
 0    PROBLEM    160  171  0.996024  
 1    PROBLEM    179  185  0.997690  
 2  TREATMENT    440  453  0.941207  
 3    PROBLEM    710  724  0.989988  
 4    PROBLEM    748  752  0.992218  ,
 label
 PROBLEM      1063
 TREATMENT     453
 TEST          191
 Name: count, dtype: int64)

In [8]:
entities_wide = (
    entities_long.groupby(["note_id","label"])["entity_text"]
    .apply(lambda s: sorted(set(s)))
    .unstack(fill_value=[])
    .reset_index()
)

entities_wide.to_csv("entities_wide.csv", index=False)
entities_wide.head()


label,note_id,PROBLEM,TEST,TREATMENT
0,0,"[acute distress, fevers, lesion, masses, rash,...",[],[Enfamil Lipil]
1,1,"[Dad, Mom and, acute distress, behavioral conc...","[Blood, Temperature]",[]
2,2,"[Mom, afebrile, concerns, drooling, hearing, l...",[],"[Tylenol, a multivitamin]"
3,3,"[asthma, breast, cardiovascular disease, colon...",[birth],"[Mylicon drops, hepatitis, medications]"
4,4,"[Conjunctivae, Normocephalic, Tympanic membran...",[],"[antibiotics, his immunizations, vaginal]"


## Human-in-the-Loop (HITL) Review

Automated clinical NLP systems can produce errors, particularly in complex or safety-critical contexts. To illustrate how human expertise can guide and validate model outputs, a human-in-the-loop (HITL) review step is included(for just 20 samples) to manually identify incorrect or missing entities, correct misclassified labels, and document common error patterns.

In [9]:
# Sample a small number of entities for HITL review
hitl_sample = (
    entities_long
    .sample(n=20, random_state=42)
    .assign(
        corrected_label="",
        error_type="",
        review_notes=""
    )
)

# Save HITL review log as a csv file
hitl_sample.to_csv("hitl_review_log.csv", index=False)

hitl_sample


Unnamed: 0,note_id,sample_name,medical_specialty,entity_text,label,start,end,score,corrected_label,error_type,review_notes
567,26,Pediatric Urology Letter,Pediatrics - Neonatal,pain through his entire right side,PROBLEM,556,590,0.956661,,,
1325,53,Ear pain - Pediatric Consult,Pediatrics - Neonatal,frequent heartburn symptoms,PROBLEM,1393,1420,0.997012,,,
1350,54,Circumcision - Infant,Pediatrics - Neonatal,glans,PROBLEM,953,958,0.787805,,,
115,6,URI & Eustachian Congestion,Pediatrics - Neonatal,enlarged,PROBLEM,1339,1347,0.909894,,,
453,21,Pediatric Rheumatology Consult,Pediatrics - Neonatal,hospitalizations,TREATMENT,1347,1363,0.861866,,,
1368,55,Difficulty Breathing - ER Visit,Pediatrics - Neonatal,Increased work of breathing,PROBLEM,741,768,0.988145,,,
483,22,Prematurity - Discharge Summary,Pediatrics - Neonatal,physiologic,PROBLEM,1197,1208,0.904642,,,
614,28,Patent Ductus Arteriosus Ligation,Pediatrics - Neonatal,persistent pulmonary over circulation,PROBLEM,177,214,0.997344,,,
1185,48,Head Injury,Pediatrics - Neonatal,dizzier,PROBLEM,430,437,0.996875,,,
915,40,Neuroblastoma - Consult,Pediatrics - Neonatal,his chemotherapy,TREATMENT,812,828,0.824656,,,


The HITL review table is exported as a template; annotation fields are intended to be completed by a clinician or qualified domain expert.


## Ethics, Limitations, and Next Steps

### Ethical Considerations
This project uses a public, de-identified clinical text dataset to avoid exposure to identifiable patient information. No attempt is made to re-identify individuals or infer sensitive attributes. In real clinical deployments, strict data governance, institutional approvals, and clinician oversight would be required.

### Limitations
The dataset used here is limited in size and not specific to a single healthcare system. The NER model is applied without fine-tuning, which can result in label ambiguity or spurious extractions. As demonstrated, automated outputs should not be treated as clinically authoritative without expert review.

### Next Steps
Future extensions could include applying the pipeline to institution-specific clinical records under appropriate governance, incorporating optical character recognition (OCR) for handwritten notes, expanding domain-specific vocabularies, and using expert feedback to iteratively improve extraction quality.
