TODO:

Update part 1 with new labels

Check if 3.1 runs

# Task 2: Data Exploration and Processing

## 1. Manual data inspection
- Investigate which standard and potential new NER types are most prominent in your data set (i.e., manual data inspection)

Following visual inspection, some of the prominent NERs found are: DATE, BODY_PART, DOSAGE, MEASUREMENT, DRUG and SYMPTOM. 

Examples:
 1. DATE: 12/20/2005, 1/19/96
 2. BODY_PART: nose, abdomen, knee
 3. DOSAGE:  10/40 mg one a day, 0.25 micrograms a day, 50 mg twice a day, 10 ml
 4. MEASUREMENT: 3.98 kg, 8mm, pulse of 84, blood pressure 108/65
 5. DRUG:  Vytorin, Rocaltrol, Carvedilol, Cozaar,  Lasix
 6. SYMPTOM: erythematous, chest pain, constipated

Where, DATE is a **standard NER in spacy** and the remaining ones fall in the medicine domain category. 

## 2. Apply the standard NER classifier of spaCy


### Imports

In [79]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import spacy
import random
import json
import re

%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [80]:
%reload_ext autoreload

### Load Data

In [51]:
# ============================
# 1: Load Dataset
# ============================
dataset = load_dataset("argilla/medical-domain", split="train")
print("Features available")
print(dataset.column_names)
print("Format of 'prediction' column")
print(dataset.features['prediction'])
print("Dataset length: ", len(dataset))

Features available
['text', 'inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'id', 'metadata', 'status', 'event_timestamp', 'metrics']
Format of 'prediction' column
List({'label': Value('string'), 'score': Value('float64')})
Dataset length:  4966


### Filter such that we keep only samples pertaining to 'Surgery'

In [52]:
dataset_slimmed = dataset.filter(
	lambda row: (
		isinstance(row["text"], str)
        and row["text"] != ''
		and 'Surgery' in str(row["prediction"][0])
	)
)
dataset = dataset_slimmed
print(len(dataset))

Filter:   0%|          | 0/4966 [00:00<?, ? examples/s]

Filter: 100%|██████████| 4966/4966 [00:00<00:00, 14060.83 examples/s]

1115





In [42]:
# ================================================
# 2: Define Entity Schema
# ================================================
# Our final gold-label schema
CUSTOM_LABELS = ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']

print("Our NER schema = ", CUSTOM_LABELS)


Our NER schema =  ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']


In [55]:
# =========================================
# 3: Pick N samples manually
# =========================================
PATH_TO_ANNOTATIONS = "ner/samples/"
SEED = 42 
random.seed(SEED)
np.random.seed(SEED)

N_SAMPLES = 3   # select a proper number to contain 100 ground-truth NERs 
sampled_texts = []

for i in range(N_SAMPLES):
	row = dataset[np.random.randint(0, len(dataset))]
	text = row["text"]
	sampled_texts.append(text)

df_samples = pd.DataFrame({"text": sampled_texts})
df_samples.to_csv(PATH_TO_ANNOTATIONS + "unannotated_samples.csv", index=False)

print("Saved unannotated_samples.csv with", len(df_samples), "sentences")


Saved unannotated_samples.csv with 3 sentences


## 3. Evaluation of Standard NER

### Perform manual annotations, then split into train/test 

 The annotation format is: 

	{
		"classes": List(labels), 
		"annotations": List(
			[str(text), dict("entities": List(List(start idx, end idx, class)))]
		)
	}
Where each item in "annotations" is a sentence.


#### Manual Annotations (with AI assistance)

In [None]:
import utils.annotations_utils as utils_ann # ChatGPT API called to identify keywords/phrases associated with entities

unannotated_samples_path = PATH_TO_ANNOTATIONS + "unannotated_samples.csv"
df = pd.read_csv(unannotated_samples_path, header=0)
text = df['text'].to_list()
num_examples_per_label = 30
annotations = utils_ann.chatgpt_annotate_text(CUSTOM_LABELS, text, num_examples_per_label)

In [73]:
for label in annotations.keys():
	print(f" Label: {label} ".center(30, '='))
	for indexed_words in annotations[label]:
		print(indexed_words)

[0, 'pain']
[1, 'deformity']
[2, 'dysfunction']
[3, 'atrophy']
[4, 'flexible talus']
[5, 'motion changes']
[6, 'sensation changes']
[7, 'bleeding']
[8, 'infection']
[9, 'swelling']
[10, 'muscular dystrophy']
[11, 'hematoma']
[12, 'thinning']
[13, 'resection']
[14, 'inflammation']
[15, 'weakness']
[16, 'sensitivity']
[17, 'numbness']
[18, 'impaired function']
[19, 'low mobility']
[20, 'tenderness']
[21, 'ache']
[22, 'limited motion']
[23, 'stiffness']
[24, 'difficulty']
[25, 'irritation']
[26, 'complications']
[27, 'malalignment']
[28, 'protrusion']
[29, 'increased risk']
[0, 'foot']
[1, 'arm']
[2, 'forearm']
[3, 'elbow']
[4, 'mandible']
[5, 'maxilla']
[6, 'leg']
[7, 'shoulder']
[8, 'talocalcaneal joint']
[9, 'extremity']
[10, 'muscle']
[11, 'thigh']
[12, 'spine']
[13, 'sinus']
[14, 'upper limb']
[15, 'arm']
[16, 'heel']
[17, 'lower surface']
[18, 'biceps']
[19, 'nerve']
[20, 'foot arch']
[21, 'vein']
[22, 'artery']
[23, 'bone']
[24, 'palate']
[25, 'buttock']
[26, 'cavity']
[27, 'genial

In [None]:
# Specify examples to exclude or modify
# NOTE: pop larger indices first to maintain index ordering of earlier items after deletion
annotations['SYMPTOM'].pop(29)

annotations['BODY_PART'][8][1] = 'joint'
annotations['BODY_PART'][14][1] = 'limb'
annotations['BODY_PART'].pop(17)

annotations['DISEASE'][0][1] = 'renal disease'

annotations['DRUG'][3][1] = 'anesthetics'
annotations['DRUG'][9][1] = 'saline'
annotations['DRUG'].pop(28)

annotations['ROUTE'].pop(17)
annotations['ROUTE'].pop(6)

for label in annotations.keys():
	print(f" Label: {label} ".center(30, '='))
	words_list = []
	for idx, word in annotations[label]:
		words_list.append(word)
		print(word)
	annotations[label] = words_list

pain
deformity
dysfunction
atrophy
flexible talus
motion changes
sensation changes
bleeding
infection
swelling
muscular dystrophy
hematoma
thinning
resection
inflammation
weakness
sensitivity
numbness
impaired function
low mobility
tenderness
ache
limited motion
stiffness
difficulty
irritation
complications
malalignment
protrusion
foot
arm
forearm
elbow
mandible
maxilla
leg
shoulder
joint
extremity
muscle
thigh
spine
sinus
limb
arm
heel
biceps
nerve
foot arch
vein
artery
bone
palate
buttock
cavity
genial tubercle
mental foramina
soft tissue
renal disease
myotonic muscular dystrophy
planovalgus
mandibular atrophy
maxillary atrophy
facial deformity
masticatory dysfunction
acquired deformity
vertical talus
implant failure
hip fracture
infection
anterior mandibular atrophy
arthrodesis indication
release complications
mandibular deficiency
surgical failure
hemorrhage
graft complications
suture complications
pneumonia
urinary infection
anesthetic reaction
scarring
paralysis risk
neuralgia
ed

#### Save annotations to the appropriate format

In [94]:
# 1. Load samples: only column "text"
df = pd.read_csv(unannotated_samples_path)

# 2. Load spaCy for sentence splitting
nlp = spacy.load("en_core_web_md")
if "sentencizer" not in nlp.pipe_names:
	nlp.add_pipe("sentencizer")

# 3. Extract all sentences across samples
all_sentences = []
for _, row in df.iterrows():
	full_text = str(row["text"])
	doc = nlp(full_text)

	# Extract sentences
	all_sentences.extend([s.text.strip() for s in doc.sents if len(s.text.strip()) > 0])

# 4. Annotate all identified keywords in sentences and save to json file
utils_ann.annotate_sentences_and_save(all_sentences, annotations, CUSTOM_LABELS, PATH_TO_ANNOTATIONS+'annotated_samples.json')

### Train/Test split of annotations

In [96]:
# ============================================
# Generate train/test annotation template JSONL with ratio control
# ============================================
annotated_samples_path = PATH_TO_ANNOTATIONS + "annotated_samples.json"

# =====================================================
# Parameters: control split ratio (sentence-wise split)
# =====================================================
TRAIN_RATIO = 0.70     # first 70% for training
TEST_RATIO  = 0.30     # first 30% for evaluation/test

# ================================

# Load annotated samples: only key "annotations"
annotations_dict = None
with open(annotated_samples_path) as annotations_file:
	annotations_dict = json.load(annotations_file)
all_sentences = annotations_dict["annotations"]

# ===========================
# Create train/test sets
# ===========================

np.random.shuffle(all_sentences) # shuffle randomly first before splitting
total = len(all_sentences)
train_cutoff = int(total * TRAIN_RATIO)
test_cutoff = train_cutoff + int(total * TEST_RATIO)

train_sentences = all_sentences[:train_cutoff]
test_sentences = all_sentences[train_cutoff:test_cutoff]

print("N Train samples: ", len(train_sentences))
print("N Test samples: ", len(test_sentences))
# ===========================
# Save sentences into train/test
# ===========================
train_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_train.json'
annotations_dict['annotations'] = train_sentences
with open(train_samples_path, 'w') as fp:
	json.dump(annotations_dict, fp)

test_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_test.json'
annotations_dict['annotations'] = test_sentences
with open(test_samples_path, 'w') as fp:
	json.dump(annotations_dict, fp)

N Train samples:  84
N Test samples:  36


### 3.1 Manual Evaluation of Standard NER

In [103]:
### Manual Evaluation of Standard NER
from spacy import displacy

# 1. Load sample text
annotations_dict = None
with open(test_samples_path) as annotations_file:
	annotations_dict = json.load(annotations_file)
	
# 2. Extract second row text
test_text = [sent for sent, _ in annotations_dict["annotations"]]

print("=== Test Text Preview ===")

# 3. Load spaCy model (baseline or your updated one)
nlp = spacy.load("en_core_web_md")   # or: spacy.load("output_medical_ner")

# 4. Run NER on the full document
doc = nlp(''.join(test_text[:2000]))

# 5. Render using displacy
displacy.render(doc, style="ent", jupyter=True)


=== Test Text Preview ===


ImportError: cannot import name 'display' from 'IPython.core.display' (c:\Users\Esther\anaconda3\envs\nlp\Lib\site-packages\IPython\core\display.py)

Observation: Some NERs don't make sense. E.g., Vomitting -> PERSON, Hypokalemia -> PERSON, Diarrhea -> PERSON

### 3.2 Automatic Evaluation of Standard NER

In [None]:
# =========================================
#  Standard NER evaluation
# =========================================

from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, convert_to_labels

# 1. Load JSONL annotations
jsonl_path = PATH_TO_ANNOTATIONS + "test_annotation_complete.jsonl"

data = []
with open(jsonl_path, "r", encoding="utf-8") as f:
	for line in f:
		obj = json.loads(line)
		data.append(obj)

print(f"Loaded {len(data)} annotated sentences")

# 2. Load baseline spaCy NER
nlp = spacy.load("en_core_web_md")

# 3. Convert annotations to character-level spans
gold_spans, pred_spans, labels = extract_spans(nlp, data)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels = convert_to_labels(gold_spans, pred_spans)

# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Baseline spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")

for i, (g, p, item) in enumerate(zip(gold_spans, pred_spans, data)):
	g_set = set(g)
	p_set = set(p)
	if g_set != p_set:
		print("Text:", item["text"])
		print("Gold:", g_set)
		print("Pred:", p_set)
		print("-" * 50)
		break

Loaded 26 annotated sentences

===== Baseline spaCy NER Evaluation =====
Precision: 0.0000
Recall:    0.0000
F1 score:  0.0000

Macro Precision: 0.0
Macro Recall:    0.0
Macro F1:        0.0

===== Examples of WRONG predictions =====

Text: ADMISSION DIAGNOSES:,1.
Gold: set()
Pred: {(0, 9, 'ORG')}
--------------------------------------------------


## 4. Extend the standard NER types using the NER Annotator

### 4.1 Training Extended NER model with >100 manual annoatation

In [None]:
# Make sure there is no overlapping entity problems in the training sample
# 
# import json

path = PATH_TO_ANNOTATIONS + "train_annotation_complete.jsonl"

def spans_overlap(a, b):
	return not (a[1] <= b[0] or b[1] <= a[0])

bad = []

with open(path, "r", encoding="utf-8") as f:
	for line in f:
		obj = json.loads(line)
		text = obj["text"]
		anns = obj["annotation"]  # [[phrase,label],...]

		spans = []
		for phrase, label in anns:
			start = text.lower().find(phrase.lower())
			if start == -1:
				continue
			end = start + len(phrase)
			spans.append((start, end, phrase, label))

		# check overlapping
		for i in range(len(spans)):
			for j in range(i+1, len(spans)):
				a = spans[i]
				b = spans[j]
				if spans_overlap(a, b):
					bad.append((obj["sentence_id"], text, a, b))

print("===== Overlapping entity problems found =====")
for sid, text, a, b in bad:
	print(f"\nSentence {sid}: {text}")
	print("  ->", a)
	print("  ->", b)


===== Overlapping entity problems found =====


In [None]:
# =========================================
#  Extended NER TRAINING
# =========================================

import json
import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding
from ner.utils import find_span

# 1. Load annotated JSONL
json_path = PATH_TO_ANNOTATIONS + "train_annotation_complete.jsonl"
data = []

with open(json_path, "r", encoding="utf-8") as f:
	for line in f:
		data.append(json.loads(line))

print(f"Loaded {len(data)} annotated sentences")

# 2. Custom labels
CUSTOM_LABELS = ["DISEASE", "BODY_PART", "FINDING", "PROCEDURE", "SYMPTOM"]
print("Custom NER labels:", CUSTOM_LABELS)

# 3. Prepare training data
training_examples = []

for item in data:
	text = item["text"]
	labels = item["annotation"]  # [["hypertension","DISEASE"], ...]

	entities = []
	for phrase, label in labels:
		span = find_span(text, phrase)
		if span:
			entities.append((span[0], span[1], label))

	training_examples.append((text, {"entities": entities}))

print("Example:", training_examples[0])

# 4. Initialize blank model for spaCy 3.8+
nlp = spacy.blank("en")         
ner = nlp.add_pipe("ner")       

# Add custom labels
for label in CUSTOM_LABELS:
	ner.add_label(label)

# 5. Training loop
n_iter = 35
optimizer = nlp.initialize()

for epoch in range(n_iter):
	random.shuffle(training_examples)
	losses = {}

	batches = minibatch(training_examples, size=compounding(4.0, 32.0, 1.5))

	for batch in batches:
		examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in batch]
		nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)

	print(f"Epoch {epoch+1}/{n_iter} Loss: {losses}")

# 6. Save model
output_dir = "output_medical_ner"
nlp.to_disk(output_dir)
print("Model saved to", output_dir)

# 7. Quick sanity check
test_text = random.choice(data)["text"]
doc = nlp(test_text)

print("\nTest sentence:", test_text)
print("Predicted NER:", [(ent.text, ent.label_) for ent in doc.ents])


Loaded 83 annotated sentences
Custom NER labels: ['DISEASE', 'BODY_PART', 'FINDING', 'PROCEDURE', 'SYMPTOM']
Example: ('REASON FOR THE CONSULT:,  Nonhealing right ankle stasis ulcer.,HISTORY', {'entities': [(26, 36, 'FINDING'), (37, 48, 'BODY_PART'), (49, 61, 'DISEASE')]})
Epoch 1/35 Loss: {'ner': np.float32(776.6242)}
Epoch 2/35 Loss: {'ner': np.float32(269.1504)}
Epoch 3/35 Loss: {'ner': np.float32(131.03236)}
Epoch 4/35 Loss: {'ner': np.float32(118.2531)}
Epoch 5/35 Loss: {'ner': np.float32(100.19832)}
Epoch 6/35 Loss: {'ner': np.float32(97.127174)}
Epoch 7/35 Loss: {'ner': np.float32(82.98265)}
Epoch 8/35 Loss: {'ner': np.float32(72.80677)}
Epoch 9/35 Loss: {'ner': np.float32(69.08254)}
Epoch 10/35 Loss: {'ner': np.float32(76.261856)}
Epoch 11/35 Loss: {'ner': np.float32(84.334564)}
Epoch 12/35 Loss: {'ner': np.float32(57.126667)}
Epoch 13/35 Loss: {'ner': np.float32(90.96537)}
Epoch 14/35 Loss: {'ner': np.float32(81.351776)}
Epoch 15/35 Loss: {'ner': np.float32(70.72307)}
Epoch 16

### 4.2 Extended NER evaluation

In [None]:
# =========================================
#  Extended NER evaluation with test sample
# =========================================
import json
import spacy
from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, convert_to_labels

# 1. Load JSONL annotations for test
jsonl_path = PATH_TO_ANNOTATIONS + "test_annotation_complete.jsonl"

data = []
with open(jsonl_path, "r", encoding="utf-8") as f:
	for line in f:
		obj = json.loads(line)
		data.append(obj)

print(f"Loaded {len(data)} annotated sentences")

# 2. Load Extended spaCy NER
nlp = spacy.load("output_medical_ner")

# 3. Convert annotations to character-level spans
gold_spans, pred_spans, labels = extract_spans(nlp, data)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels = convert_to_labels(gold_spans, pred_spans)
# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Extended spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")

for i, (g, p, item) in enumerate(zip(gold_spans, pred_spans, data)):
	g_set = set(g)
	p_set = set(p)
	if g_set != p_set:
		print("Text:", item["text"])
		print("Gold:", g_set)
		print("Pred:", p_set)
		print("-" * 50)
		break

Loaded 26 annotated sentences

===== Extended spaCy NER Evaluation =====
Precision: 0.1333
Recall:    0.1333
F1 score:  0.1333

Macro Precision: 0.3333
Macro Recall:    0.0583
Macro F1:        0.0952

===== Examples of WRONG predictions =====

Text: Nausea.,3.
Gold: {(0, 6, 'SYMPTOM')}
Pred: set()
--------------------------------------------------


### Summary
- The performance is poor as the train and test data sets are not in the same category, and therefore don't share big enough NER vocabulary.

## 5. LLM-based NER classifier

The goal of this part was to see whether a large language model (LLM) could be used as an alternative NER system for my medical text, using only prompts and without training a new model. In practice, we tried several approaches, but all of them were limited by model size, hardware, or missing API access.

First, we tried a very small local LLM (about 0.5B parameters) via `llama-cpp`. we asked it to extract entities with my label set (DISEASE, SYMPTOM, BODY_PART, FINDING, PROCEDURE) and return a JSON list. The model usually failed to follow the format and, more importantly, produced almost random labels that did not make medical sense. This suggests that such a small model is not strong enough for domain-specific NER, even with an instruction-style prompt.

Second, we tried to use a slightly larger open-source LLM from Hugging Face (e.g. Qwen2.5-1.5B or Llama-3.2-1B) on my macOS machine via the `transformers` library. We implemented a function that builds a medical NER prompt and parses the model output into (phrase, label) pairs. However, even for one or two short sentences, generation on CPU/MPS took several minutes. Running this model on all test sentences to compute precision, recall and F1 would be impractically slow for this project.

Third, we looked at the spaCy-LLM integration. In theory, we could define an `llm` component in a `config.cfg` file, for example using a Llama2 model on Hugging Face, and then assemble it as a normal spaCy pipeline. In practice, this either requires access to an external API (e.g. OpenAI) or downloading and running a much larger model (such as Llama2-7B). we have no API credits available and my local hardware is not sufficient to run such a model efficiently.

Because of these limitations, we did not manage to build a fully working and scalable LLM-based NER baseline. Our “investigation” of LLM-based NER is therefore mainly conceptual: we explored how it would be prompted and integrated (with transformers or spaCy-LLM), and we observed that under my resource constraints it is not yet practical to use an LLM as a reliable, evaluated medical NER system. For the quantitative results in this project, we rely on the standard spaCy NER and my extended, manually trained medical NER model.




### Below is just an example of NER with Llama-3.2-1B-Instruct of one or two sentences. 

- It takes 20 min for two sentences. Therefore it is just a demo of how NER can be done. 
- If we have more computational resources, we will evaluate on the whole annotation test data.

In [None]:
from llama_cpp import Llama
import json, re

# 1. Load GGUF model (same as you already used)
llm = Llama(
	model_path="models/Llama-3.2-1B-Instruct-f16.gguf",
	n_ctx=2048,
	n_threads=6,
	n_gpu_layers=0     # works on Mac / Windows / Linux
)

print("Model loaded.")





llama_model_load_from_file_impl: using device Metal (Apple M1) - 5455 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 147 tensors from models/Llama-3.2-1B-Instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.lic

Model loaded.


Using gguf chat template: {{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- if strftime_now is defined %}
        {%- set date_string = strftime_now("%d %b %Y") %}
    {%- else %}
        {%- set date_string = "26 Jul 2024" %}
    {%- endif %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date

In [None]:
# 2. Test sentence
text = "She was having chest pains along with significant vomiting and diarrhea."

# 3. Minimal zero-shot NER prompt (no enhancements)
PROMPT = """
Extract medical named entities from the text.
Use only these labels: DISEASE, SYMPTOM, BODY_PART, FINDING, PROCEDURE.
Return ONLY a JSON list like this:
"entity_text_1":  "LABEL_1",
Text:
{TEXT}
"""

# 4. Run LLM
raw = llm(PROMPT.format(TEXT=text), max_tokens=256)
output = raw["choices"][0]["text"]
print("\nRaw LLM output:\n", output)

# 5. Parse JSON from model output
def parse_json(s):
	try:
		match = re.search(r"\[.*\]", s, re.S)
		if match:
			return json.loads(match.group(0))
	except:
		pass
	return []

ents = parse_json(output)

# 6. Print final extracted entities
print("\nParsed entities:")
for e in ents:
	print(e)

Llama.generate: 66 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =    5953.27 ms
llama_perf_context_print: prompt eval time =  735431.22 ms /     2 tokens (367715.61 ms per token,     0.00 tokens per second)
llama_perf_context_print:        eval time = 1204764.79 ms /   255 runs   ( 4724.57 ms per token,     0.21 tokens per second)
llama_perf_context_print:       total time = 1209872.05 ms /   257 tokens
llama_perf_context_print:    graphs reused =        247



Raw LLM output:
 BODY_PART:  "chest",
SYMPTOM:  "chest pains",
FINDING:  "vomiting", 
PROCEDURE:  "consulted", 
DISEASE:  "gastritis", 
BODY_PART:  "abdomen", 
SYMPTOM:  "diarrhea", 
FINDING:  "gastritis", 
BODY_PART:  "abdomen", 
SYMPTOM:  "diarrhea", 
FINDING:  "gastritis", 
BODY_PART:  "abdomen", 
PROCEDURE:  "gastritis", 

Here is an example json list:
"entity_text_1":  "DISEASE_1",
"entity_text_2":  "BODY_PART_2",
...
"entity_text_n":  "PROCEDURE_n",
"entity_text_p":  "SYMPTOM_p" ,
"entity_text_q":  "FINDING_q",
"entity_text_r":  "BODY_PART_r",
"entity_text_s":  "PROCEDURE_s",
"entity_text_t":  "SYMPTOM_t" ,
"entity_text_u":  "FINDING_u",
"entity_text_v":  "

Parsed entities:


## 6. How NER type information can help other NLP tasks

Even if the NER models in this project are not perfect, the extracted entity types are still useful for many other NLP applications in the clinical domain. Here I briefly summarise a few examples.

1. **Structuring free-text clinical notes**  
	Clinical documents are mostly free text. NER can identify key concepts and turn them into structured fields, for example:
	- DISEASE: diabetes mellitus, hypertension, stasis ulcer  
	- SYMPTOM: chest pain, nausea, diarrhea  
	- BODY_PART: right ankle, abdomen  
	- PROCEDURE: surgery, CT scan  
	This structured information can then be stored in an electronic health record or database and used for search, filtering, and statistics.

2. **Clinical decision support**  
	NER outputs can be used as features in decision support systems. For example:
	- Combinations of DISEASE and SYMPTOM entities can be used to estimate the risk of certain conditions.
	- DRUG and DOSAGE entities can be checked for possible drug interactions or dosing errors.
	In this way, NER acts as a bridge between narrative notes and automated clinical rules or prediction models.

3. **Document and patient-level classification**  
	NER types can also improve text classification. Instead of using only bag-of-words, we can use counts and patterns of entities:
	- Classifying documents by specialty (e.g. cardiology vs. endocrinology) based on the diseases and body parts mentioned.
	- Detecting potential adverse drug events by looking for co-occurrences of specific DRUG and SYMPTOM entities.
	At the patient level, the presence or absence of certain DISEASE entities can be used to define cohorts for research.

4. **Relation extraction and knowledge graphs**  
	NER is the first step towards relation extraction, such as:
	- SYMPTOM–DISEASE relations (e.g. chest pain → myocardial infarction ruled out)
	- DRUG–DISEASE relations (indications)  
	- DRUG–SYMPTOM relations (adverse effects)  
	Once entities are identified, these relations can be learned or manually defined, and combined into a clinical knowledge graph.

5. **Question answering and summarisation**  
	For question answering, knowing which spans are DISEASE or SYMPTOM helps focus retrieval on the relevant parts of a document. For summarisation, NER can be used to ensure that all important entities (diagnoses, symptoms, procedures, medications) appear explicitly in the final summary, even if the original note is long and repetitive.

Overall, NER type information turns unstructured clinical text into more interpretable and reusable signals. Even a relatively noisy NER system can already provide useful features for downstream tasks such as decision support, classification, relation extraction, and summarisation.
