## Clinical NLP Mini-Project:Structuring Unstructured Paediatric Records

###### This project demonstrates a compact natural language processing (NLP) pipeline for transforming unstructured clinical text into structured, AI-ready data. The focus is on Clinical Named Entity Recognition (CNER) as a foundational step in clinical data digitisation.

###### The notebook presents a proof-of-concept workflow that prioritises clarity, reproducibility, and methodological correctness rather than model training or large-scale evaluation.

## Workflow Overview

1. Load a public, de-identified clinical text dataset.
2. Apply minimal preprocessing to preserve clinical meaning.
3. Perform Clinical Named Entity Recognition using a pre-trained transformer model.
4. Post-process and structure extracted entities into tabular form.
5. Demonstrate a short human-in-the-loop review step.
6. Discuss ethical considerations, limitations, and future extensions.

In [None]:
# Core libraries
import pandas as pd
import numpy as np

# NLP / Transformer utilities
from transformers import pipeline

print("Environment initialised and libraries loaded")


In [None]:
# Load the MTSamples dataset (downloaded from Kaggle)
df = pd.read_csv("/content/mtsamples.csv")

# Drop records without clinical text
df = df.dropna(subset=["transcription"]).reset_index(drop=True)

# Inspect the dataset
df.head()


In [None]:
# filter for paediatric-related notes
df_peds = df[df["medical_specialty"].str.contains("Pediatrics", na=False)].reset_index(drop=True)

print(f"Total notes: {len(df)}")
print(f"Paediatric-related notes: {len(df_peds)}")


## Minimal Preprocessing

Clinical text contains important cues such as negation (e.g., “no fever”), abbreviations, and shorthand. This step applies only light, safe preprocessing:
- standardise whitespace


In [None]:
import re

def clean_whitespace(text: str) -> str:
    text = str(text)
    text = text.replace("\r", "\n")
    text = re.sub(r"\n{2,}", "\n\n", text)      # to collapse newlines
    text = re.sub(r"[ \t]+", " ", text)         # to collapse spaces or tabs
    return text.strip()

df_peds["text_raw"] = df_peds["transcription"].astype(str)
df_peds["text"] = df_peds["text_raw"].apply(clean_whitespace)

df_peds[["sample_name", "medical_specialty", "text"]].head(3)

## Clinical Named Entity Recognition (CNER)

The goal is to convert unstructured clinical text into structured signals by extracting key clinical concepts using a pre-trained clinical transformer encoder (DistilBERT-based, i2b2-trained) for token-level named entity recognition:
- PROBLEM (conditions, symptoms)
- TEST (labs, imaging, examinations)
- TREATMENT (medications, procedures)





In [None]:
ner = pipeline(
    "token-classification",
    model="nlpie/clinical-distilbert-i2b2-2010",
    aggregation_strategy="max"
)

# Quick test
sample_text = df_peds.loc[0, "text"]
ents = ner(sample_text)

ents[:10], len(ents)


In [None]:
LABEL_MAP = {"problem":"PROBLEM", "test":"TEST", "treatment":"TREATMENT"}


## Structuring Extracted Clinical Entities

The output of the CNER model consists of entity mentions identified within free-text clinical notes. To enable downstream analysis and reuse, these outputs are converted into structured tabular formats.

Two representations are created:
- a long format table, where each row corresponds to a single extracted entity, and
- a wide format table, where entities are grouped by note and clinical category.

In [None]:
N = min(100, len(df_peds))  # start with 100 notes

rows = []
for i in range(N):
    note_id = int(df_peds.index[i])
    text = df_peds.loc[note_id, "text"]
    ents = ner(text)

    for e in ents:
        rows.append({
            "note_id": note_id,
            "sample_name": df_peds.loc[note_id, "sample_name"],
            "medical_specialty": df_peds.loc[note_id, "medical_specialty"],
            "entity_text": e["word"],
            "label": LABEL_MAP.get(e["entity_group"], e["entity_group"].upper()),
            "start": e["start"],
            "end": e["end"],
            "score": float(e["score"]),
        })

entities_long = pd.DataFrame(rows)
entities_long.to_csv("entities_long.csv", index=False)

entities_long.head(), entities_long["label"].value_counts()


In [None]:
entities_wide = (
    entities_long.groupby(["note_id","label"])["entity_text"]
    .apply(lambda s: sorted(set(s)))
    .unstack(fill_value=[])
    .reset_index()
)

entities_wide.to_csv("entities_wide.csv", index=False)
entities_wide.head()


## Human-in-the-Loop (HITL) Review

Automated clinical NLP systems can produce errors, particularly in complex or safety-critical contexts. To illustrate how human expertise can guide and validate model outputs, a human-in-the-loop (HITL) review step is included(for just 20 samples) to manually identify incorrect or missing entities, correct misclassified labels, and document common error patterns.

In [None]:
# Sample a small number of entities for HITL review
hitl_sample = (
    entities_long
    .sample(n=20, random_state=42)
    .assign(
        corrected_label="",
        error_type="",
        review_notes=""
    )
)

# Save HITL review log as a csv file
hitl_sample.to_csv("hitl_review_log.csv", index=False)

hitl_sample


The HITL review table is exported as a template; annotation fields are intended to be completed by a clinician or qualified domain expert.


## Ethics, Limitations, and Next Steps

### Ethical Considerations
This project uses a public, de-identified clinical text dataset to avoid exposure to identifiable patient information. No attempt is made to re-identify individuals or infer sensitive attributes. In real clinical deployments, strict data governance, institutional approvals, and clinician oversight would be required.

### Limitations
The dataset used here is limited in size and not specific to a single healthcare system. The NER model is applied without fine-tuning, which can result in label ambiguity or spurious extractions. As demonstrated, automated outputs should not be treated as clinically authoritative without expert review.

### Next Steps
Future extensions could include applying the pipeline to institution-specific clinical records under appropriate governance, incorporating optical character recognition (OCR) for handwritten notes, expanding domain-specific vocabularies, and using expert feedback to iteratively improve extraction quality.
