# Task 2: Data Exploration and Processing

## 1. Manual data inspection
- Investigate which standard and potential new NER types are most prominent in your data set (i.e., manual data inspection)

Following visual inspection, some of the prominent NERs found are: DATE, BODY_PART, DOSAGE, MEASUREMENT, DRUG and SYMPTOM. 

Examples:
 1. DATE: 12/20/2005, 1/19/96
 2. BODY_PART: nose, abdomen, knee
 3. DOSAGE:  10/40 mg one a day, 0.25 micrograms a day, 50 mg twice a day, 10 ml
 4. MEASUREMENT: 3.98 kg, 8mm, pulse of 84, blood pressure 108/65
 5. DRUG:  Vytorin, Rocaltrol, Carvedilol, Cozaar,  Lasix
 6. SYMPTOM: erythematous, chest pain, constipated

Where, DATE is a **standard NER in spacy** and the remaining ones fall in the medicine domain category. 

## 2. Apply the standard NER classifier of spaCy


### Imports

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import spacy
import random
import json
import re
## The following import is ignored as the annotations only need to be ran once, and they are already done
# import utils.annotations_utils as utils_ann # ChatGPT API called to identify keywords/phrases associated with entities

%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [31]:
%reload_ext autoreload

### Load Data

In [15]:
# ============================
# 1: Load Dataset
# ============================
dataset = load_dataset("argilla/medical-domain", split="train")
print("Features available")
print(dataset.column_names)
print("Format of 'prediction' column")
print(dataset.features['prediction'])
print("Dataset length: ", len(dataset))

Features available
['text', 'inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'id', 'metadata', 'status', 'event_timestamp', 'metrics']
Format of 'prediction' column
List({'label': Value('string'), 'score': Value('float64')})
Dataset length:  4966


### Filter such that we keep only samples pertaining to 'Surgery'

In [16]:
dataset_slimmed = dataset.filter(
	lambda row: (
		isinstance(row["text"], str)
        and row["text"] != ''
		and 'Surgery' in str(row["prediction"][0])
	)
)
dataset = dataset_slimmed
print(len(dataset))

1115


In [17]:
# ================================================
# 2: Define Entity Schema
# ================================================
# Our final gold-label schema
CUSTOM_LABELS = ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']

print("Our NER schema = ", CUSTOM_LABELS)


Our NER schema =  ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']


In [18]:
# =========================================
# 3: Pick N samples manually
# =========================================
PATH_TO_ANNOTATIONS = "ner/samples/"
SEED = 42 
random.seed(SEED)
np.random.seed(SEED)

N_SAMPLES = 4   # select a proper number to contain 100 ground-truth NERs 
sampled_texts = []

for i in range(N_SAMPLES):
	row = dataset[np.random.randint(0, len(dataset))]
	text = row["text"]
	sampled_texts.append(text)

df_samples = pd.DataFrame({"text": sampled_texts})
df_samples.to_csv(PATH_TO_ANNOTATIONS + "unannotated_samples.csv", index=False)

print("Saved unannotated_samples.csv with", len(df_samples), "sentences")


Saved unannotated_samples.csv with 4 sentences


## 3. Evaluation of Standard NER

### Perform manual annotations, then split into train/test 

 The annotation format is: 

	{
		"classes": List(labels), 
		"annotations": List(
			[str(text), dict("entities": List(List(start idx, end idx, class)))]
		)
	}
Where each item in "annotations" is a sentence.


#### Manual Annotations (with AI assistance)

**NOTE: Skipping this section as unannotated_samples.csv, annotatated_samples\*.json have already been generated.**

Additionally, you will need a valid OpenAI API key in your environment variables to run this.

Jump to [Train/Test splitting](#traintest-split-of-annotations)

In [None]:
# unannotated_samples_path = PATH_TO_ANNOTATIONS + "unannotated_samples.csv"
# df = pd.read_csv(unannotated_samples_path, header=0)
# text = df['text'].to_list()
# num_examples_per_label = 30
# annotations = utils_ann.chatgpt_annotate_text(CUSTOM_LABELS, text, num_examples_per_label)

In [None]:
# for label in annotations.keys():
# 	print(f" Label: {label} ".center(30, '='))
# 	for indexed_words in annotations[label]:
# 		print(indexed_words)

[0, 'pain']
[1, 'deformity']
[2, 'dysfunction']
[3, 'atrophy']
[4, 'flexible talus']
[5, 'motion changes']
[6, 'sensation changes']
[7, 'bleeding']
[8, 'infection']
[9, 'swelling']
[10, 'muscular dystrophy']
[11, 'hematoma']
[12, 'thinning']
[13, 'resection']
[14, 'inflammation']
[15, 'weakness']
[16, 'sensitivity']
[17, 'numbness']
[18, 'impaired function']
[19, 'low mobility']
[20, 'tenderness']
[21, 'ache']
[22, 'limited motion']
[23, 'stiffness']
[24, 'difficulty']
[25, 'irritation']
[26, 'complications']
[27, 'malalignment']
[28, 'protrusion']
[29, 'increased risk']
[0, 'foot']
[1, 'arm']
[2, 'forearm']
[3, 'elbow']
[4, 'mandible']
[5, 'maxilla']
[6, 'leg']
[7, 'shoulder']
[8, 'talocalcaneal joint']
[9, 'extremity']
[10, 'muscle']
[11, 'thigh']
[12, 'spine']
[13, 'sinus']
[14, 'upper limb']
[15, 'arm']
[16, 'heel']
[17, 'lower surface']
[18, 'biceps']
[19, 'nerve']
[20, 'foot arch']
[21, 'vein']
[22, 'artery']
[23, 'bone']
[24, 'palate']
[25, 'buttock']
[26, 'cavity']
[27, 'genial

In [None]:
# # Specify examples to exclude or modify
# # NOTE: pop larger indices first to maintain index ordering of earlier items after deletion
# annotations['SYMPTOM'].pop(29)

# annotations['BODY_PART'][8][1] = 'joint'
# annotations['BODY_PART'][14][1] = 'limb'
# annotations['BODY_PART'].pop(17)

# annotations['DISEASE'][0][1] = 'renal disease'

# annotations['DRUG'][3][1] = 'anesthetics'
# annotations['DRUG'][9][1] = 'saline'
# annotations['DRUG'].pop(28)

# annotations['ROUTE'].pop(17)
# annotations['ROUTE'].pop(6)

# for label in annotations.keys():
# 	print(f" Label: {label} ".center(30, '='))
# 	words_list = []
# 	for idx, word in annotations[label]:
# 		words_list.append(word)
# 		print(word)
# 	annotations[label] = words_list

pain
deformity
dysfunction
atrophy
flexible talus
motion changes
sensation changes
bleeding
infection
swelling
muscular dystrophy
hematoma
thinning
resection
inflammation
weakness
sensitivity
numbness
impaired function
low mobility
tenderness
ache
limited motion
stiffness
difficulty
irritation
complications
malalignment
protrusion
foot
arm
forearm
elbow
mandible
maxilla
leg
shoulder
joint
extremity
muscle
thigh
spine
sinus
limb
arm
heel
biceps
nerve
foot arch
vein
artery
bone
palate
buttock
cavity
genial tubercle
mental foramina
soft tissue
renal disease
myotonic muscular dystrophy
planovalgus
mandibular atrophy
maxillary atrophy
facial deformity
masticatory dysfunction
acquired deformity
vertical talus
implant failure
hip fracture
infection
anterior mandibular atrophy
arthrodesis indication
release complications
mandibular deficiency
surgical failure
hemorrhage
graft complications
suture complications
pneumonia
urinary infection
anesthetic reaction
scarring
paralysis risk
neuralgia
ed

#### Save annotations to the appropriate format

In [None]:
# # 1. Load samples: only column "text"
# df = pd.read_csv(unannotated_samples_path)

# # 2. Load spaCy for sentence splitting
# nlp = spacy.load("en_core_web_md")
# if "sentencizer" not in nlp.pipe_names:
# 	nlp.add_pipe("sentencizer")

# # 3. Extract all sentences across samples
# all_sentences = []
# for _, row in df.iterrows():
# 	full_text = str(row["text"])
# 	doc = nlp(full_text)

# 	# Extract sentences
# 	all_sentences.extend([s.text.strip() for s in doc.sents if len(s.text.strip()) > 0])

# # 4. Annotate all identified keywords in sentences and save to json file
# utils_ann.annotate_sentences_and_save(all_sentences, annotations, CUSTOM_LABELS, PATH_TO_ANNOTATIONS+'annotated_samples.json')

### Train/Test split of annotations

In [19]:
# ============================================
# Generate train/test annotation template JSONL with ratio control
# ============================================
annotated_samples_path = PATH_TO_ANNOTATIONS + "annotated_samples.json"

# =====================================================
# Parameters: control split ratio (sentence-wise split)
# =====================================================
TRAIN_RATIO = 0.70     # first 70% for training
TEST_RATIO  = 0.30     # first 30% for evaluation/test

# ================================

# Load annotated samples: only key "annotations"
annotations_dict = None
with open(annotated_samples_path) as annotations_file:
	annotations_dict = json.load(annotations_file)
all_sentences = annotations_dict["annotations"]

# ===========================
# Create train/test sets
# ===========================

np.random.shuffle(all_sentences) # shuffle randomly first before splitting
total = len(all_sentences)
train_cutoff = int(total * TRAIN_RATIO)
test_cutoff = train_cutoff + int(total * TEST_RATIO)

train_sentences = all_sentences[:train_cutoff]
test_sentences = all_sentences[train_cutoff:test_cutoff]

print("N Train samples: ", len(train_sentences))
print("N Test samples: ", len(test_sentences))
# ===========================
# Save sentences into train/test
# ===========================
train_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_train.json'
annotations_dict['annotations'] = train_sentences
with open(train_samples_path, 'w') as fp:
	json.dump(annotations_dict, fp)

test_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_test.json'
annotations_dict['annotations'] = test_sentences
with open(test_samples_path, 'w') as fp:
	json.dump(annotations_dict, fp)

N Train samples:  84
N Test samples:  36


### 3.1 Manual Evaluation of Standard NER

In [None]:
### Manual Evaluation of Standard NER
from spacy import displacy

# !python -m spacy download en_core_web_sm # NOTE run if needed

# If we jump here directly, we need a definition of test_samples_path
test_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_test.json'

# 1. Load sample text
annotations_dict = None
with open(test_samples_path) as annotations_file:
	annotations_dict = json.load(annotations_file)
	
# 2. Extract second row text
test_text = [sent for sent, _ in annotations_dict["annotations"]]

print("=== Test Text Preview ===")

# 3. Load spaCy model (baseline or your updated one)
nlp = spacy.load("en_core_web_md")   # or: spacy.load("output_medical_ner")

# 4. Run NER on the full document
doc = nlp(''.join(test_text[:2000]))

# 5. Render using displacy
displacy.render(doc, style="ent", jupyter=True)


=== Test Text Preview ===


Observation: Some NERs don't make sense. E.g., Vomitting -> PERSON, Hypokalemia -> PERSON, Diarrhea -> PERSON

### 3.2 Automatic Evaluation of Standard NER

In [21]:
# =========================================
#  Standard NER evaluation
# =========================================

from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, get_label_lists, remove_overlapping_spans

# 1. Load JSON annotations
test_annotations_path = PATH_TO_ANNOTATIONS + "annotated_samples_test.json"

sentences = None # will become a list of sentences
gold_spans = None # will become a nested list of 1 list per sentence, containing a list of [start, end, label]
with open(test_annotations_path, "r", encoding="utf-8") as f:
	annotations_dict = json.load(f)
	sentences = [sent for sent, _ in annotations_dict["annotations"]]
	gold_spans = [entities_dict["entities"] for _, entities_dict in annotations_dict["annotations"]]

# Remove overlapping spans in gold if any
gold_spans = remove_overlapping_spans(gold_spans)
print(f"Loaded {len(sentences)} annotated sentences")

# 2. Load baseline spaCy NER
nlp = spacy.load("en_core_web_md")

# 3. Convert annotations to character-level spans
pred_spans = extract_spans(nlp, sentences)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels, associated_sentence_idx = get_label_lists(gold_spans, pred_spans)

# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Baseline spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")
for sent_id, (g, p) in enumerate(zip(gold_spans, pred_spans)):
    if g != p:
        print("Text:", sentences[sent_id])
        print("Gold:", g)
        print("Pred:", p)
        print("-" * 50)
        break

Loaded 36 annotated sentences
True labels
 ['DRUG', 'DRUG', 'NONE', 'BODY_PART', 'NONE', 'NONE', 'BODY_PART', 'NONE', 'NONE', 'NONE', 'DRUG', 'DRUG', 'NONE', 'NONE', 'BODY_PART', 'BODY_PART', 'NONE', 'NONE', 'DRUG', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'DISEASE', 'BODY_PART', 'SYMPTOM', 'NONE', 'BODY_PART', 'BODY_PART', 'NONE', 'NONE', 'NONE', 'BODY_PART', 'ROUTE', 'NONE', 'NONE', 'DISEASE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'ROUTE', 'NONE', 'SYMPTOM', 'DISEASE', 'SYMPTOM', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'NONE', 'DRUG', 'NONE', 'NONE', 'BODY_PART', 'NONE', 'NONE', 'DRUG', 'DRUG', 'NONE', 'NONE', 'NONE']
pred
 ['NONE', 'NONE', 'CARDINAL', 'NONE', 'QUANTITY', 'DATE', 'NONE', 'PERSON', 'MONEY', 'TIME', 'PERSON', 'NONE', 'PERCENT', 'CARDINAL', 'NONE', 'NONE', 'PERSON', 'GPE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'TIME', 'NONE', 'NONE', 'QUANTITY', 'ORDINAL', 'QUANTITY', 'NONE', 'NONE', 'CARDINAL', 'CARDINAL', 'NONE', '

## 4. Extend the standard NER types using the NER Annotator

### 4.1 Training Extended NER model with >100 manual annoatation

In [22]:
# =========================================
#  Extended NER TRAINING
# =========================================

import json
import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding

# 1. Load annotated JSONL
json_path = PATH_TO_ANNOTATIONS + "annotated_samples_train.json"

with open(json_path, "r", encoding="utf-8") as f:
	annotations_dict = json.load(f)
	training_examples = annotations_dict["annotations"]
	
# Remove overlapping spans in training examples
training_examples = remove_overlapping_spans(training_examples, is_annotations_list=True)
print(f"Loaded {len(training_examples)} annotated sentences")

# 2. Custom labels
CUSTOM_LABELS = ["SYMPTOM", "BODY_PART", "DISEASE", "DRUG", "ROUTE"]
assert annotations_dict['classes'] == CUSTOM_LABELS
print("Custom NER labels:", CUSTOM_LABELS)

# 3. Initialize blank model for spaCy 3.8+
nlp = spacy.blank("en")         
ner = nlp.add_pipe("ner")       

# Add custom labels
for label in CUSTOM_LABELS:
	ner.add_label(label)

# 5. Training loop
n_iter = 35
optimizer = nlp.initialize()

for epoch in range(n_iter):
	random.shuffle(training_examples)
	losses = {}

	batches = minibatch(training_examples, size=compounding(4.0, 32.0, 1.5))

	for batch in batches:
		examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in batch]
		nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)

	print(f"Epoch {epoch+1}/{n_iter} Loss: {losses}")

# 6. Save model
output_dir = "output_medical_ner"
nlp.to_disk(output_dir)
print("Model saved to", output_dir)

# 7. Quick sanity check
test_text = random.choice(sentences)
doc = nlp(test_text)

print("\nTest sentence:", test_text)
print("Predicted NER:", [(ent.text, ent.label_) for ent in doc.ents])


Loaded 84 annotated sentences
Custom NER labels: ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']




Epoch 1/35 Loss: {'ner': np.float32(1041.892)}
Epoch 2/35 Loss: {'ner': np.float32(225.54295)}
Epoch 3/35 Loss: {'ner': np.float32(137.1597)}
Epoch 4/35 Loss: {'ner': np.float32(123.322716)}
Epoch 5/35 Loss: {'ner': np.float32(109.891365)}
Epoch 6/35 Loss: {'ner': np.float32(90.60601)}
Epoch 7/35 Loss: {'ner': np.float32(65.43597)}
Epoch 8/35 Loss: {'ner': np.float32(49.277184)}
Epoch 9/35 Loss: {'ner': np.float32(41.314453)}
Epoch 10/35 Loss: {'ner': np.float32(30.830793)}
Epoch 11/35 Loss: {'ner': np.float32(34.150513)}
Epoch 12/35 Loss: {'ner': np.float32(28.016462)}
Epoch 13/35 Loss: {'ner': np.float32(20.584805)}
Epoch 14/35 Loss: {'ner': np.float32(11.637852)}
Epoch 15/35 Loss: {'ner': np.float32(9.644718)}
Epoch 16/35 Loss: {'ner': np.float32(4.12778)}
Epoch 17/35 Loss: {'ner': np.float32(2.430256)}
Epoch 18/35 Loss: {'ner': np.float32(1.5563459)}
Epoch 19/35 Loss: {'ner': np.float32(0.48532856)}
Epoch 20/35 Loss: {'ner': np.float32(1.9009217)}
Epoch 21/35 Loss: {'ner': np.float

### 4.2 Extended NER Evaluation

In [25]:
# =========================================
#  Extended NER evaluation with test sample
# =========================================
import json
import spacy
from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, get_label_lists, remove_overlapping_spans

# 1. Load JSONL annotations for test
test_annotations_path = PATH_TO_ANNOTATIONS + "annotated_samples_test.json"

sentences = None # will become a list of sentences
gold_spans = None # will become a nested list of 1 list per sentence, containing a list of [start, end, label]
with open(test_annotations_path, "r", encoding="utf-8") as f:
	annotations_dict = json.load(f)
	sentences = [sent for sent, _ in annotations_dict["annotations"]]
	gold_spans = [entities_dict["entities"] for _, entities_dict in annotations_dict["annotations"]]

# Remove overlapping spans in gold if any
gold_spans = remove_overlapping_spans(gold_spans)
print(f"Loaded {len(sentences)} annotated sentences")

# 2. Load Extended spaCy NER
nlp = spacy.load("output_medical_ner")

# 3. Convert annotations to character-level spans
pred_spans = extract_spans(nlp, sentences)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels, associated_sentence_idx = get_label_lists(gold_spans, pred_spans)
# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
	true_labels, pred_labels, labels=CUSTOM_LABELS, average="micro", zero_division=0
)

print("\n===== Extended spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
	true_labels, pred_labels, labels=CUSTOM_LABELS, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")

for sent_id, (g, p) in enumerate(zip(gold_spans, pred_spans)):
    if g != p:
        print("Text:", sentences[sent_id])
        print("Gold:", g)
        print("Pred:", p)
        print("-" * 50)
        break

Loaded 36 annotated sentences
True labels
 ['DRUG', 'DRUG', 'BODY_PART', 'BODY_PART', 'DRUG', 'DRUG', 'BODY_PART', 'BODY_PART', 'DRUG', 'BODY_PART', 'NONE', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'DISEASE', 'BODY_PART', 'SYMPTOM', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'ROUTE', 'DISEASE', 'ROUTE', 'SYMPTOM', 'DISEASE', 'SYMPTOM', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'DRUG', 'BODY_PART', 'DRUG', 'DRUG']
pred
 ['NONE', 'NONE', 'BODY_PART', 'BODY_PART', 'DRUG', 'NONE', 'NONE', 'BODY_PART', 'NONE', 'BODY_PART', 'ROUTE', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'DISEASE', 'NONE', 'NONE', 'BODY_PART', 'NONE', 'BODY_PART', 'ROUTE', 'DISEASE', 'NONE', 'NONE', 'DISEASE', 'NONE', 'NONE', 'NONE', 'NONE', 'DRUG', 'BODY_PART', 'NONE', 'NONE']
sent ids
 [0, 1, 2, 4, 6, 6, 7, 8, 11, 12, 12, 13, 14, 14, 15, 16, 18, 20, 20, 21, 24, 25, 28, 29, 29, 29, 30, 30, 30, 31, 32, 35, 35]

===== Extended spaCy NER Evaluation =====
Precision: 0.9412
Recall:    0.5000
F1 score:  0.6531

Macro Precision: 0.7
Macro 

### Summary
- The performance is poor as the train and test data sets are not in the same category, and therefore don't share big enough NER vocabulary.

In [26]:
# =========================================
#  DETECT TRAIN/TEST DATA LEAKAGE
# =========================================

# Load train sentences
train_json = PATH_TO_ANNOTATIONS + "annotated_samples_train.json"
with open(train_json, "r", encoding="utf-8") as f:
    train_data = json.load(f)
train_sentences = [text for text, _ in train_data["annotations"]]

# Load test sentences
test_json = PATH_TO_ANNOTATIONS + "annotated_samples_test.json"
with open(test_json, "r", encoding="utf-8") as f:
    test_data = json.load(f)
test_sentences = [text for text, _ in test_data["annotations"]]

# Exact-match leakage
print("\n===== Checking Exact Leakage =====")
leak_exact = []
for t in test_sentences:
    if t in train_sentences:
        leak_exact.append(t)

if leak_exact:
    print("EXACT LEAKAGE FOUND:")
    for s in leak_exact:
        print(" -", s)
else:
    print("No exact leakage.")


# Substring leakage (weaker but still harmful)
print("\n===== Checking Substring Leakage =====")
leak_sub = []
for test_s in test_sentences:
    for train_s in train_sentences:
        if test_s.strip() in train_s.strip() or train_s.strip() in test_s.strip():
            leak_sub.append(test_s)
            break

if leak_sub:
    print("SUBSTRING LEAKAGE FOUND:")
    for s in leak_sub:
        print(" -", s)
else:
    print("No substring leakage.")



===== Checking Exact Leakage =====
EXACT LEAKAGE FOUND:
 - Acquired facial deformity.
 - ,3.

===== Checking Substring Leakage =====
SUBSTRING LEAKAGE FOUND:
 - Acquired facial deformity.
 - DIAGNOSIS,End-stage renal disease.
 - ,3.


## 5. LLM-based NER classifier

The goal of this part was to see whether a large language model (LLM) could be used as an alternative NER system for my medical text, using only prompts and without training a new model. In practice, we tried several approaches, but all of them were limited by model size, hardware, or missing API access.

First, we tried a very small local LLM (about 0.5B parameters) via `llama-cpp`. we asked it to extract entities with my label set (DISEASE, SYMPTOM, BODY_PART, FINDING, PROCEDURE) and return a JSON list. The model usually failed to follow the format and, more importantly, produced almost random labels that did not make medical sense. This suggests that such a small model is not strong enough for domain-specific NER, even with an instruction-style prompt.

Second, we tried to use a slightly larger open-source LLM from Hugging Face (e.g. Qwen2.5-1.5B or Llama-3.2-1B) on my macOS machine via the `transformers` library. We implemented a function that builds a medical NER prompt and parses the model output into (phrase, label) pairs. However, even for one or two short sentences, generation on CPU/MPS took several minutes. Running this model on all test sentences to compute precision, recall and F1 would be impractically slow for this project.

Third, we looked at the spaCy-LLM integration. In theory, we could define an `llm` component in a `config.cfg` file, for example using a Llama2 model on Hugging Face, and then assemble it as a normal spaCy pipeline. In practice, this either requires access to an external API (e.g. OpenAI) or downloading and running a much larger model (such as Llama2-7B). we have no API credits available and my local hardware is not sufficient to run such a model efficiently.

Because of these limitations, we did not manage to build a fully working and scalable LLM-based NER baseline. Our “investigation” of LLM-based NER is therefore mainly conceptual: we explored how it would be prompted and integrated (with transformers or spaCy-LLM), and we observed that under my resource constraints it is not yet practical to use an LLM as a reliable, evaluated medical NER system. For the quantitative results in this project, we rely on the standard spaCy NER and my extended, manually trained medical NER model.




### 5.1 Example of NER with Llama-3.2-1B-Instruct on CPU

**Skip this section if run on GPU**

- It takes 20 min for two sentences. Therefore it is just a demo of how NER can be done. 
- If we have more computational resources, we will evaluate on the whole annotation test data.

In [None]:
# from llama_cpp import Llama
# import json, re

# # 1. Load GGUF model (same as you already used)
# llm = Llama(
# 	model_path="models/Llama-3.2-1B-Instruct-f16.gguf",
# 	n_ctx=2048,
# 	n_threads=6,
# 	n_gpu_layers=0     # works on Mac / Windows / Linux
# )

# print("Model loaded.")





llama_model_load_from_file_impl: using device Metal (Apple M1) - 5455 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 147 tensors from models/Llama-3.2-1B-Instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.lic

Model loaded.


Using gguf chat template: {{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- if strftime_now is defined %}
        {%- set date_string = strftime_now("%d %b %Y") %}
    {%- else %}
        {%- set date_string = "26 Jul 2024" %}
    {%- endif %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date

In [None]:
# # 2. Test sentence
# text = "She was having chest pains along with significant vomiting and diarrhea."

# # 3. Minimal zero-shot NER prompt (no enhancements)
# PROMPT = """
# Extract medical named entities from the text.
# Use only these labels: DISEASE, SYMPTOM, BODY_PART, FINDING, PROCEDURE.
# Return ONLY a JSON list like this:
# "entity_text_1":  "LABEL_1",
# Text:
# {TEXT}
# """

# # 4. Run LLM
# raw = llm(PROMPT.format(TEXT=text), max_tokens=256)
# output = raw["choices"][0]["text"]
# print("\nRaw LLM output:\n", output)

# # 5. Parse JSON from model output
# def parse_json(s):
# 	try:
# 		match = re.search(r"\[.*\]", s, re.S)
# 		if match:
# 			return json.loads(match.group(0))
# 	except:
# 		pass
# 	return []

# ents = parse_json(output)

# # 6. Print final extracted entities
# print("\nParsed entities:")
# for e in ents:
# 	print(e)

Llama.generate: 66 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =    5953.27 ms
llama_perf_context_print: prompt eval time =  735431.22 ms /     2 tokens (367715.61 ms per token,     0.00 tokens per second)
llama_perf_context_print:        eval time = 1204764.79 ms /   255 runs   ( 4724.57 ms per token,     0.21 tokens per second)
llama_perf_context_print:       total time = 1209872.05 ms /   257 tokens
llama_perf_context_print:    graphs reused =        247



Raw LLM output:
 BODY_PART:  "chest",
SYMPTOM:  "chest pains",
FINDING:  "vomiting", 
PROCEDURE:  "consulted", 
DISEASE:  "gastritis", 
BODY_PART:  "abdomen", 
SYMPTOM:  "diarrhea", 
FINDING:  "gastritis", 
BODY_PART:  "abdomen", 
SYMPTOM:  "diarrhea", 
FINDING:  "gastritis", 
BODY_PART:  "abdomen", 
PROCEDURE:  "gastritis", 

Here is an example json list:
"entity_text_1":  "DISEASE_1",
"entity_text_2":  "BODY_PART_2",
...
"entity_text_n":  "PROCEDURE_n",
"entity_text_p":  "SYMPTOM_p" ,
"entity_text_q":  "FINDING_q",
"entity_text_r":  "BODY_PART_r",
"entity_text_s":  "PROCEDURE_s",
"entity_text_t":  "SYMPTOM_t" ,
"entity_text_u":  "FINDING_u",
"entity_text_v":  "

Parsed entities:


## 5.2 Sanity Check of LLM Behavior with a Minimal n-Shot Prompt

Before relying on spaCy-LLM’s built-in NER prompting strategy, it is important to understand how the base LLM behaves under a transparent and fully interpretable prompt. This sanity check serves several purposes:

- LLMs are prone to hallucination, especially when forced to choose from a restricted label set. Observing their raw behavior helps identify systematic failure modes such as over-labeling or speculative predictions.
- The default spaCy-LLM prompts are highly engineered and not fully documented. Since we cannot see their internal design decisions, it is useful to test a simple and fully understandable prompt for comparison.
- Our custom few-shot examples do not (and should not) follow spaCy-LLM’s templating rules. The purpose here is not to replace spaCy-LLM, but to benchmark the raw LLM’s baseline behavior.


### 5.2.1 Log of simple-prompt llm-based NER

In [None]:
import os
import json
import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# # ============== CONFIG ==============

# TEST_JSON_PATH = "ner/samples/annotated_samples_test_1.json"
# MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"
# LOG_PATH = "ner/sanity_check_llm/mistral_ner_log.txt"  # new: log file

# device = "cuda" if torch.cuda.is_available() else "cpu"
# torch_dtype = torch.float16 if device == "cuda" else torch.float32
# print("Using device:", device)

# # Few-shot examples
# # ----- Few-shot examples (use your current best EXAMPLES) -----
# EXAMPLES = """

# Input Text: ",The estimated blood loss in the harvest of the hip was 100 cc."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# None

# Input Text: "He has been having significant feet pain with significant planovalgus deformity."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# - "pain" | SYMPTOM
# - "deformity" | SYMPTOM
# - "planovalgus" | DISEASE

# Input Text: "A surgical mallet then compressed this bone further into the region."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# - "bone" | BODY_PART

# Input Text: "A primary incision was made between the mental foramina and the residual crest of the ridge and reflected first to the lingual area observing the superior genial tubercle in the facial area degloving the mentalis muscle and exposing the anterior body."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# - "mental foramina" | BODY_PART
# - "genial tubercle" | BODY_PART

# Input Text: "The patient was noted to have flexible vertical talus."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# - "vertical talus" | DISEASE

# Input Text: "A piece of AlloDerm mixed with Croften and patient's platelet-rich plasma, which was centrifuged from drawing 20 cc of blood was then mixed together and placed over the lateral aspect of the block."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# - "AlloDerm" | DRUG

# Input Text: "The area was injected with 6 mL of 0.25% Marcaine local anesthetic."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# - "Marcaine" | DRUG

# Input Text: "The estimated blood loss in the intraoral procedure was 220 cc."
# Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
# Output:
# - "intraoral" | ROUTE

# """

# PROMPT_TEMPLATE = f"""
# Here is a Text:
# {{TEXT}}

# Check whether or not there are medical words in the Text that can be labelled
# and ONLY labeled as one of the labels SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE in a medical sense.
# If not, confidently report None.
# If so, report following the format below: 
# - "actual word in the input Text" | LABEL

# Use the strict definitions for the labels as follows.
# SYMPTOM: Phrases that describe what the patient feels or observable clinical signs 
#          (e.g. pain, swelling, dysfunction, shortness of breath).


# BODY_PART: Names of anatomical body structures or regions 
#          (e.g. bone, mandible, forearm, Achilles tendon).

# DISEASE: Official disease or diagnosis terms, usually multi-word or technical names 
#          (e.g. congenital myotonic muscular dystrophy, planovalgus).

# DRUG: Names of medications, anesthetics, or pharmacological agents 
#       (e.g. Marcaine, Xylocaine, morphine).

# ROUTE: Words that describe the route of administration into the body 
#        (e.g. IV, intraoral, intramuscular, intravenously).


# NEVER mention any word that is not in the raw Text.

# Learn from some examples below.

# {EXAMPLES}
# """


# # Load model
# tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# model = AutoModelForCausalLM.from_pretrained(
#     MODEL_ID,
#     torch_dtype=torch_dtype,
#     device_map="auto" if device == "cuda" else None,
# )


# # ============== Helper: LLM NER prediction ==============

# def llm_ner_predict(text, max_new_tokens=256):
#     prompt = PROMPT_TEMPLATE.format(TEXT=text)
#     inputs = tokenizer(prompt, return_tensors="pt").to(device)

#     with torch.no_grad():
#         output_ids = model.generate(
#             **inputs,
#             max_new_tokens=max_new_tokens,
#             do_sample=False,
#             pad_token_id=tokenizer.eos_token_id,
#         )

#     output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
#     answer = output_text[len(prompt):].strip()

#     pattern = r'-\s*"([^"]+)"\s*\|\s*([A-Z_]+)'
#     matches = re.findall(pattern, answer)
#     preds = [(t, lab) for t, lab in matches]

#     return preds, answer


# # ============== Load test dataset ==============

# with open(TEST_JSON_PATH, "r", encoding="utf-8") as f:
#     data = json.load(f)

# annotations = data["annotations"]
# test_count = len(annotations)

# print(f"Loaded {test_count} test examples.")

# os.makedirs(os.path.dirname(LOG_PATH), exist_ok=True)

# # ============== Write log header ==============

# with open(LOG_PATH, "w", encoding="utf-8") as log_f:
#     log_f.write(f"Loaded {test_count} test examples.\n\n")

#     log_f.write("===== MODEL & DATA INFO =====\n")
#     log_f.write(f"MODEL_ID: {MODEL_ID}\n")
#     log_f.write(f"TEST_JSON_PATH: {TEST_JSON_PATH}\n\n")

#     log_f.write("===== PROMPT TEMPLATE =====\n")
#     log_f.write(PROMPT_TEMPLATE)
#     log_f.write("\n\n===== START EXAMPLES =====\n\n")

#     # ============== Iterate over test samples ==============

#     for idx, (sent, ann_dict) in enumerate(annotations):
#         gold_spans = ann_dict.get("entities", [])

#         gold_entities = []
#         for start, end, label in gold_spans:
#             gold_entities.append((sent[start:end], label))

#         preds, raw_output = llm_ner_predict(sent)

#         block = []
#         block.append("=" * 80 + "\n")
#         block.append(f"Example {idx+1}/{test_count}\n")
#         block.append(f"TEXT:\n{sent}\n\n")

#         block.append("GOLD ENTITIES (string level):\n")
#         if gold_entities:
#             for text_span, lab in gold_entities:
#                 block.append(f"- '{text_span}' | {lab}\n")
#         else:
#             block.append("- None\n")

#         block.append("\nLLM PREDICTIONS (parsed):\n")
#         if preds:
#             for t, lab in preds:
#                 block.append(f"- '{t}' | {lab}\n")
#         else:
#             block.append("- None\n")

#         block.append("\nRAW LLM OUTPUT (for debugging):\n")
#         block.append(raw_output + "\n")
#         block.append("=" * 80 + "\n\n")

#         block_str = "".join(block)

#         # Print to console
#         print(block_str, end="")

#         # Write to log
#         log_f.write(block_str)

# print(f"\nLog written to: {LOG_PATH}")


Using device: cuda


Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00,  2.19s/it]


Loaded 36 test examples.
Example 1/36
TEXT:
This was also checked with fluoroscopy.

GOLD ENTITIES (string level):
- None

LLM PREDICTIONS (parsed):
- 'inguinal hernia' | DISEASE
- 'inguinal hernia' | DISEASE
- 'hydrocoele' | DISEASE
- 'inguinal hernia' | DISEASE
- 'hydrocoele' | DISEASE
- 'testicular tumor' | DISEASE

RAW LLM OUTPUT (for debugging):
Input Text: "The patient was noted to have a large right inguinal hernia."
Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
Output:
- "inguinal hernia" | DISEASE

Input Text: "The patient was noted to have a large right inguinal hernia with a small right hydrocoele."
Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
Output:
- "inguinal hernia" | DISEASE
- "hydrocoele" | DISEASE

Input Text: "The patient was noted to have a large right inguinal hernia with a small right hydrocoele and a right testicular tumor."
Labels: SYMPTOM, BODY_PART, DISEASE, DRUG, ROUTE
Output:
- "inguinal hernia" | DISEASE
- "hydrocoele" | DISEASE
- "testicular tumor"

### 5.2.2 Evaluation of simple-prompt llm-based NER

In [None]:
# # =========================================
# #  LLM NER Evaluation: Precision/Recall/F1
# # =========================================

import re
from sklearn.metrics import precision_recall_fscore_support

# def find_span_in_text(text, substring):
#     """
#     Return (start, end) of the first exact match of substring in text.
#     If not found, return None.
#     Case-sensitive.
#     """
#     pattern = re.escape(substring)
#     match = re.search(pattern, text)
#     if not match:
#         return None
#     return match.start(), match.end()


# # Convert gold spans into evaluation form
# gold_eval = []  # list of lists: [ [(start,end,label), ...], ... ]
# for sent, ann in annotations:
#     spans = []
#     for start, end, label in ann["entities"]:
#         spans.append((start, end, label))
#     gold_eval.append(spans)


# # Convert LLM preds (string spans) into character spans
# pred_eval = []  # same structure as gold_eval
# for idx, (sent, ann) in enumerate(annotations):
#     preds, _ = llm_ner_predict(sent)
#     sent_pred_spans = []

#     for ent_text, label in preds:
#         span = find_span_in_text(sent, ent_text)
#         if span is None:
#             continue  # skip unmatched spans
#         start, end = span
#         sent_pred_spans.append((start, end, label))

#     pred_eval.append(sent_pred_spans)


# # ========== Construct label lists for PRF computation ==========

# true_labels = []
# pred_labels = []

# for gold_spans, pred_spans in zip(gold_eval, pred_eval):

#     gold_set = {(s, e, l) for (s, e, l) in gold_spans}
#     pred_set = {(s, e, l) for (s, e, l) in pred_spans}

#     # True positives
#     for span in gold_set.intersection(pred_set):
#         true_labels.append(span[2])
#         pred_labels.append(span[2])

#     # False negatives
#     for span in gold_set - pred_set:
#         true_labels.append(span[2])
#         pred_labels.append("NONE")

#     # False positives
#     for span in pred_set - gold_set:
#         true_labels.append("NONE")
#         pred_labels.append(span[2])
 
  
# # ========== Compute micro/macro F1 ==========

# prec_micro, rec_micro, f1_micro, _ = precision_recall_fscore_support(
#     true_labels, pred_labels,labels=CUSTOM_LABELS, average="micro", zero_division=0
# )
# prec_macro, rec_macro, f1_macro, _ = precision_recall_fscore_support(
#     true_labels, pred_labels, labels=CUSTOM_LABELS, average="macro", zero_division=0
# )

# print("\n===== LLM NER Evaluation =====")
# print(f"Micro Precision: {prec_micro:.4f}")
# print(f"Micro Recall:    {rec_micro:.4f}")
# print(f"Micro F1 score:  {f1_micro:.4f}")

# print("\nMacro Precision:", round(prec_macro, 4))
# print("Macro Recall:   ", round(rec_macro, 4))
# print("Macro F1:       ", round(f1_macro, 4))




===== LLM NER Evaluation =====
Micro Precision: 0.3043
Micro Recall:    0.3043
Micro F1 score:  0.3043

Macro Precision: 0.22
Macro Recall:    0.2838
Macro F1:        0.2228


### Summary of minimal n-shot prompt results

Despite including clear and relevant examples, the raw LLM under a simple n-shot prompt achieved only:

Micro F1 score:  0.3043
Macro F1:        0.2228

This confirms that unconstrained prompting is not sufficient for reliable NER, and that hallucination control and output normalization require more structure than a minimal prompt can provide.


## 5.3 Evaluation of the spaCy-LLM Pipeline

The spaCy-LLM framework adds an essential layer of structure around the base LLM. 



### NER2

The NER2 template from spacy-llm supports the following features as part of the prompt:
* Label description.
    For instance:
    ```python 
    "BODY_PART is defined as any anatomical location, tissue, or organ." 
    ```

* Few-shot.The examples provided can be found here: <br>
    ner/spacy_llm/ner_examples_ner2_balanced_1.json <br>
    For instance:
```json
{
  "text": "He has been having significant feet pain with significant planovalgus deformity.",
  "entities": {
    "BODY_PART": ["feet"],
    "SYMPTOM": ["pain"],
    "DISEASE": ["planovalgus deformity"]
  }
}
```

In [None]:
# Model names avaible: dolly-v2-3b Llama-2-13b-hf mistral-7b
!python -m ner.spacy_llm.load_model --model  mistral-7b-ner2
!python -m ner.spacy_llm.evaluate --model mistral-7b-ner2

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████| 2/2 [00:00<00:00, 115.24it/s]
Model saved to /storage/homefs/kw24z021/NLP_LLM/group_project/MedNLP-Multitask/ner/spacy_llm/models/output_mistral-7b_ner
Loaded 36 annotated test sentences.
Loading model...mistral-7b
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████| 2/2 [00:00<00:00, 112.45it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you m

**NER2** <br>
Micro Precision: 0.6296 <br>
Micro Recall:    0.5862 <br>
Micro F1 score:  0.6071 <br>
<br>
Macro Precision: 0.5816 <br>
Macro Recall:    0.4937 <br>
Macro F1:        0.4281 <br>

### NER3


The NER3 template from spacy-llm on top of what NER2 offers it additionally provides CoT (Chain of Thought) in the prompt. <br>

The few shot can be found in: <br>
    ner/spacy_llm/ner_examples_cot_balanced_1.json

Each example has to provide a reason/explanation as to why a label corresponds to an entity.
For instance:
```json
    "spans": [
      {
        "text": "feet",
        "is_entity": true,
        "label": "BODY_PART",
        "reason": "Refers to a specific anatomical location of the body (the patient's feet)."
      }]     
```

In [None]:
# Model names avaible: dolly-v2-3b Llama-2-13b-hf mistral-7b 
!python -m ner.spacy_llm.load_model --model  mistral-7b
!python -m ner.spacy_llm.evaluate --model mistral-7b

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████| 2/2 [00:00<00:00, 105.35it/s]
Model saved to /storage/homefs/kw24z021/NLP_LLM/group_project/MedNLP-Multitask/ner/spacy_llm/models/output_mistral-7b_ner
Loaded 36 annotated test sentences.
Loading model...mistral-7b
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████| 2/2 [00:00<00:00, 104.55it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you m

**NER3** <br>
Micro Precision: 0.5667 <br>
Micro Recall:    0.2931 <br>
Micro F1 score:  0.3864 <br>
<br>
Macro Precision: 0.4065 <br>
Macro Recall:    0.4066 <br>
Macro F1:        0.4066 <br>


### Summary
- The comparison between minimal raw prompting and spaCy-LLM confirms the importance of architectural constraints and structured prompt design for LLM-based NER.

When evaluated on the same test dataset, spacy-LLM (NER2) achieved a better overall performance:

Micro F1 = 0.6071 
Macro F1 = 0.4281

Although performance is still modest, these results are significantly better than the minimal n-shot prompt baseline. This demonstrates that structured prompting, template enforcement, and standardized output parsing reduce hallucinations and improve entity extraction reliability.


## 6. How NER type information can help other NLP tasks

Even if the NER models in this project are not perfect, the extracted entity types are still useful for many other NLP applications in the clinical domain. Here I briefly summarise a few examples.

1. **Structuring free-text clinical notes**  
	Clinical documents are mostly free text. NER can identify key concepts and turn them into structured fields, for example:
	- DISEASE: diabetes mellitus, hypertension, stasis ulcer  
	- SYMPTOM: chest pain, nausea, diarrhea  
	- BODY_PART: right ankle, abdomen  
	- PROCEDURE: surgery, CT scan  
	This structured information can then be stored in an electronic health record or database and used for search, filtering, and statistics.

2. **Clinical decision support**  
	NER outputs can be used as features in decision support systems. For example:
	- Combinations of DISEASE and SYMPTOM entities can be used to estimate the risk of certain conditions.
	- DRUG and DOSAGE entities can be checked for possible drug interactions or dosing errors.
	In this way, NER acts as a bridge between narrative notes and automated clinical rules or prediction models.

3. **Document and patient-level classification**  
	NER types can also improve text classification. Instead of using only bag-of-words, we can use counts and patterns of entities:
	- Classifying documents by specialty (e.g. cardiology vs. endocrinology) based on the diseases and body parts mentioned.
	- Detecting potential adverse drug events by looking for co-occurrences of specific DRUG and SYMPTOM entities.
	At the patient level, the presence or absence of certain DISEASE entities can be used to define cohorts for research.

4. **Relation extraction and knowledge graphs**  
	NER is the first step towards relation extraction, such as:
	- SYMPTOM–DISEASE relations (e.g. chest pain → myocardial infarction ruled out)
	- DRUG–DISEASE relations (indications)  
	- DRUG–SYMPTOM relations (adverse effects)  
	Once entities are identified, these relations can be learned or manually defined, and combined into a clinical knowledge graph.

5. **Question answering and summarisation**  
	For question answering, knowing which spans are DISEASE or SYMPTOM helps focus retrieval on the relevant parts of a document. For summarisation, NER can be used to ensure that all important entities (diagnoses, symptoms, procedures, medications) appear explicitly in the final summary, even if the original note is long and repetitive.

Overall, NER type information turns unstructured clinical text into more interpretable and reusable signals. Even a relatively noisy NER system can already provide useful features for downstream tasks such as decision support, classification, relation extraction, and summarisation.
