# Task 2: Data Exploration and Processing

## 1. Manual data inspection
- Investigate which standard and potential new NER types are most prominent in your data set (i.e., manual data inspection)

Following visual inspection, some of the prominent NERs found are: DATE, BODY_PART, DOSAGE, MEASUREMENT, DRUG and SYMPTOM. 

Examples:
 1. DATE: 12/20/2005, 1/19/96
 2. BODY_PART: nose, abdomen, knee
 3. DOSAGE:  10/40 mg one a day, 0.25 micrograms a day, 50 mg twice a day, 10 ml
 4. MEASUREMENT: 3.98 kg, 8mm, pulse of 84, blood pressure 108/65
 5. DRUG:  Vytorin, Rocaltrol, Carvedilol, Cozaar,  Lasix
 6. SYMPTOM: erythematous, chest pain, constipated

Where, DATE is a **standard NER in spacy** and the remaining ones fall in the medicine domain category. 

## 2. Apply the standard NER classifier of spaCy


### Imports

In [30]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import spacy
import random
import json
import re
# import utils.annotations_utils as utils_ann # ChatGPT API called to identify keywords/phrases associated with entities

%load_ext autoreload
%autoreload 2


In [31]:
%reload_ext autoreload

### Load Data

In [3]:
# ============================
# 1: Load Dataset
# ============================
dataset = load_dataset("argilla/medical-domain", split="train")
print("Features available")
print(dataset.column_names)
print("Format of 'prediction' column")
print(dataset.features['prediction'])
print("Dataset length: ", len(dataset))

Features available
['text', 'inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'id', 'metadata', 'status', 'event_timestamp', 'metrics']
Format of 'prediction' column
List({'label': Value('string'), 'score': Value('float64')})
Dataset length:  4966


### Filter such that we keep only samples pertaining to 'Surgery'

In [5]:
dataset_slimmed = dataset.filter(
	lambda row: (
		isinstance(row["text"], str)
        and row["text"] != ''
		and 'Surgery' in str(row["prediction"][0])
	)
)
dataset = dataset_slimmed
print(len(dataset))

Filter: 100%|██████████| 4966/4966 [00:00<00:00, 20966.75 examples/s]

1115





In [6]:
# ================================================
# 2: Define Entity Schema
# ================================================
# Our final gold-label schema
CUSTOM_LABELS = ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']

print("Our NER schema = ", CUSTOM_LABELS)


Our NER schema =  ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']


In [7]:
# =========================================
# 3: Pick N samples manually
# =========================================
PATH_TO_ANNOTATIONS = "ner/samples/"
SEED = 42 
random.seed(SEED)
np.random.seed(SEED)

N_SAMPLES = 4   # select a proper number to contain 100 ground-truth NERs 
sampled_texts = []

for i in range(N_SAMPLES):
	row = dataset[np.random.randint(0, len(dataset))]
	text = row["text"]
	sampled_texts.append(text)

df_samples = pd.DataFrame({"text": sampled_texts})
df_samples.to_csv(PATH_TO_ANNOTATIONS + "unannotated_samples.csv", index=False)

print("Saved unannotated_samples.csv with", len(df_samples), "sentences")


Saved unannotated_samples.csv with 4 sentences


## 3. Evaluation of Standard NER

### Perform manual annotations, then split into train/test 

 The annotation format is: 

	{
		"classes": List(labels), 
		"annotations": List(
			[str(text), dict("entities": List(List(start idx, end idx, class)))]
		)
	}
Where each item in "annotations" is a sentence.


#### Manual Annotations (with AI assistance)

**NOTE: Skip this section if unannotated_samples.csv, annotatated_samples\*.json exist**

Jump to [Train/Test splitting](#traintest-split-of-annotations)

#### Save annotations to the appropriate format

### Train/Test split of annotations

In [8]:
# ============================================
# Generate train/test annotation template JSONL with ratio control
# ============================================
annotated_samples_path = PATH_TO_ANNOTATIONS + "annotated_samples.json"

# =====================================================
# Parameters: control split ratio (sentence-wise split)
# =====================================================
TRAIN_RATIO = 0.70     # first 70% for training
TEST_RATIO  = 0.30     # first 30% for evaluation/test

# ================================

# Load annotated samples: only key "annotations"
annotations_dict = None
with open(annotated_samples_path) as annotations_file:
	annotations_dict = json.load(annotations_file)
all_sentences = annotations_dict["annotations"]

# ===========================
# Create train/test sets
# ===========================

np.random.shuffle(all_sentences) # shuffle randomly first before splitting
total = len(all_sentences)
train_cutoff = int(total * TRAIN_RATIO)
test_cutoff = train_cutoff + int(total * TEST_RATIO)

train_sentences = all_sentences[:train_cutoff]
test_sentences = all_sentences[train_cutoff:test_cutoff]

print("N Train samples: ", len(train_sentences))
print("N Test samples: ", len(test_sentences))
# ===========================
# Save sentences into train/test
# ===========================
train_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_train.json'
annotations_dict['annotations'] = train_sentences
with open(train_samples_path, 'w') as fp:
	json.dump(annotations_dict, fp)

test_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_test.json'
annotations_dict['annotations'] = test_sentences
with open(test_samples_path, 'w') as fp:
	json.dump(annotations_dict, fp)

N Train samples:  84
N Test samples:  36


### 3.1 Manual Evaluation of Standard NER

In [10]:
### Manual Evaluation of Standard NER
from spacy import displacy
!python -m spacy download en_core_web_md
# If we jump here directly, we need a definition of test_samples_path
test_samples_path = PATH_TO_ANNOTATIONS + 'annotated_samples_test.json'

# 1. Load sample text
annotations_dict = None
with open(test_samples_path) as annotations_file:
	annotations_dict = json.load(annotations_file)
	
# 2. Extract second row text
test_text = [sent for sent, _ in annotations_dict["annotations"]]

print("=== Test Text Preview ===")

# 3. Load spaCy model (baseline or your updated one)
nlp = spacy.load("en_core_web_md")   # or: spacy.load("output_medical_ner")

# 4. Run NER on the full document
doc = nlp(''.join(test_text[:2000]))

# 5. Render using displacy
displacy.render(doc, style="ent", jupyter=True)


Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     - -------------------------------------- 1.3/33.5 MB 7.5 MB/s eta 0:00:05
     --- ------------------------------------ 3.1/33.5 MB 8.0 MB/s eta 0:00:04
     ----- ---------------------------------- 4.7/33.5 MB 7.9 MB/s eta 0:00:04
     ------- -------------------------------- 6.6/33.5 MB 8.1 MB/s eta 0:00:04
     --------- ------------------------------ 8.1/33.5 MB 7.9 MB/s eta 0:00:04
     ----------- ---------------------------- 9.4/33.5 MB 7.6 MB/s eta 0:00:04
     ------------ --------------------------- 10.7/33.5 MB 7.5 MB/s eta 0:00:04
     -------------- ------------------------- 12.3/33.5 MB 7.4 MB/s eta 0:00:03
     ---------------- ----------------------- 13.6/33.5 MB 7.4 MB/s eta 0:00:03
     ----------------- ---------------

Observation: Some NERs don't make sense. E.g., Vomitting -> PERSON, Hypokalemia -> PERSON, Diarrhea -> PERSON

### 3.2 Automatic Evaluation of Standard NER

In [11]:
# =========================================
#  Standard NER evaluation
# =========================================

from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, get_label_lists, remove_overlapping_spans

# 1. Load JSON annotations
test_annotations_path = PATH_TO_ANNOTATIONS + "annotated_samples_test.json"

sentences = None # will become a list of sentences
gold_spans = None # will become a nested list of 1 list per sentence, containing a list of [start, end, label]
with open(test_annotations_path, "r", encoding="utf-8") as f:
	annotations_dict = json.load(f)
	sentences = [sent for sent, _ in annotations_dict["annotations"]]
	gold_spans = [entities_dict["entities"] for _, entities_dict in annotations_dict["annotations"]]

# Remove overlapping spans in gold if any
gold_spans = remove_overlapping_spans(gold_spans)
print(f"Loaded {len(sentences)} annotated sentences")

# 2. Load baseline spaCy NER
nlp = spacy.load("en_core_web_md")

# 3. Convert annotations to character-level spans
pred_spans = extract_spans(nlp, sentences)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels, associated_sentence_idx = get_label_lists(gold_spans, pred_spans)

# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Baseline spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")
for sent_id, (g, p) in enumerate(zip(gold_spans, pred_spans)):
    if g != p:
        print("Text:", sentences[sent_id])
        print("Gold:", g)
        print("Pred:", p)
        print("-" * 50)
        break

Loaded 36 annotated sentences
True labels
 ['DRUG', 'DRUG', 'NONE', 'BODY_PART', 'NONE', 'NONE', 'BODY_PART', 'NONE', 'NONE', 'NONE', 'DRUG', 'DRUG', 'NONE', 'NONE', 'BODY_PART', 'BODY_PART', 'NONE', 'NONE', 'DRUG', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'DISEASE', 'BODY_PART', 'SYMPTOM', 'NONE', 'BODY_PART', 'BODY_PART', 'NONE', 'NONE', 'NONE', 'BODY_PART', 'ROUTE', 'NONE', 'NONE', 'DISEASE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'ROUTE', 'NONE', 'SYMPTOM', 'DISEASE', 'SYMPTOM', 'BODY_PART', 'BODY_PART', 'BODY_PART', 'NONE', 'DRUG', 'NONE', 'NONE', 'BODY_PART', 'NONE', 'NONE', 'DRUG', 'DRUG', 'NONE', 'NONE', 'NONE']
pred
 ['NONE', 'NONE', 'CARDINAL', 'NONE', 'QUANTITY', 'DATE', 'NONE', 'PERSON', 'MONEY', 'TIME', 'PERSON', 'NONE', 'PERCENT', 'CARDINAL', 'NONE', 'NONE', 'PERSON', 'GPE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'NONE', 'TIME', 'NONE', 'NONE', 'QUANTITY', 'ORDINAL', 'QUANTITY', 'NONE', 'NONE', 'CARDINAL', 'CARDINAL', 'NONE', '

## 4. Extend the standard NER types using the NER Annotator

### 4.1 Training Extended NER model with >100 manual annoatation

In [12]:
# =========================================
#  Extended NER TRAINING
# =========================================

import json
import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding

# 1. Load annotated JSONL
json_path = PATH_TO_ANNOTATIONS + "annotated_samples_train.json"

with open(json_path, "r", encoding="utf-8") as f:
	annotations_dict = json.load(f)
	training_examples = annotations_dict["annotations"]
	
# Remove overlapping spans in training examples
training_examples = remove_overlapping_spans(training_examples, is_annotations_list=True)
print(f"Loaded {len(training_examples)} annotated sentences")

# 2. Custom labels
CUSTOM_LABELS = ["SYMPTOM", "BODY_PART", "DISEASE", "DRUG", "ROUTE"]
assert annotations_dict['classes'] == CUSTOM_LABELS
print("Custom NER labels:", CUSTOM_LABELS)

# 3. Initialize blank model for spaCy 3.8+
nlp = spacy.blank("en")         
ner = nlp.add_pipe("ner")       

# Add custom labels
for label in CUSTOM_LABELS:
	ner.add_label(label)

# 5. Training loop
n_iter = 35
optimizer = nlp.initialize()

for epoch in range(n_iter):
	random.shuffle(training_examples)
	losses = {}

	batches = minibatch(training_examples, size=compounding(4.0, 32.0, 1.5))

	for batch in batches:
		examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in batch]
		nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)

	print(f"Epoch {epoch+1}/{n_iter} Loss: {losses}")

# 6. Save model
output_dir = "output_medical_ner"
nlp.to_disk(output_dir)
print("Model saved to", output_dir)

# 7. Quick sanity check
test_text = random.choice(sentences)
doc = nlp(test_text)

print("\nTest sentence:", test_text)
print("Predicted NER:", [(ent.text, ent.label_) for ent in doc.ents])


Loaded 84 annotated sentences
Custom NER labels: ['SYMPTOM', 'BODY_PART', 'DISEASE', 'DRUG', 'ROUTE']




Epoch 1/35 Loss: {'ner': np.float32(1041.8917)}
Epoch 2/35 Loss: {'ner': np.float32(225.54295)}
Epoch 3/35 Loss: {'ner': np.float32(137.1597)}
Epoch 4/35 Loss: {'ner': np.float32(123.322716)}
Epoch 5/35 Loss: {'ner': np.float32(109.89136)}
Epoch 6/35 Loss: {'ner': np.float32(90.60602)}
Epoch 7/35 Loss: {'ner': np.float32(65.435486)}
Epoch 8/35 Loss: {'ner': np.float32(49.276722)}
Epoch 9/35 Loss: {'ner': np.float32(41.300846)}
Epoch 10/35 Loss: {'ner': np.float32(30.858707)}
Epoch 11/35 Loss: {'ner': np.float32(34.177063)}
Epoch 12/35 Loss: {'ner': np.float32(28.06081)}
Epoch 13/35 Loss: {'ner': np.float32(20.424911)}
Epoch 14/35 Loss: {'ner': np.float32(11.390415)}
Epoch 15/35 Loss: {'ner': np.float32(9.220827)}
Epoch 16/35 Loss: {'ner': np.float32(3.7878766)}
Epoch 17/35 Loss: {'ner': np.float32(2.4475176)}
Epoch 18/35 Loss: {'ner': np.float32(0.8665091)}
Epoch 19/35 Loss: {'ner': np.float32(0.53081256)}
Epoch 20/35 Loss: {'ner': np.float32(1.9285018)}
Epoch 21/35 Loss: {'ner': np.fl

### 4.2 Extended NER Evaluation

In [18]:
# =========================================
#  Extended NER evaluation with test sample
# =========================================
import json
import spacy
from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, get_label_lists, remove_overlapping_spans

# 1. Load JSONL annotations for test
test_annotations_path = PATH_TO_ANNOTATIONS + "annotated_samples_test.json"

sentences = None # will become a list of sentences
gold_spans = None # will become a nested list of 1 list per sentence, containing a list of [start, end, label]
with open(test_annotations_path, "r", encoding="utf-8") as f:
	annotations_dict = json.load(f)
	sentences = [sent for sent, _ in annotations_dict["annotations"]]
	gold_spans = [entities_dict["entities"] for _, entities_dict in annotations_dict["annotations"]]

# Remove overlapping spans in gold if any
gold_spans = remove_overlapping_spans(gold_spans)
print(f"Loaded {len(sentences)} annotated sentences")

# 2. Load Extended spaCy NER
nlp = spacy.load("output_medical_ner")

# 3. Convert annotations to character-level spans
pred_spans = extract_spans(nlp, sentences)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels, associated_sentence_idx = get_label_lists(gold_spans, pred_spans)
# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Extended spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
	true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")

for sent_id, (g, p) in enumerate(zip(gold_spans, pred_spans)):
    if g != p:
        print("Text:", sentences[sent_id])
        print("Gold:", g)
        print("Pred:", p)
        print("-" * 50)
        break

Loaded 36 annotated sentences

===== Extended spaCy NER Evaluation =====
Precision: 0.6522
Recall:    0.6522
F1 score:  0.6522

Macro Precision: 0.5466
Macro Recall:    0.4343
Macro F1:        0.4592

===== Examples of WRONG predictions =====

Text: Once the bone was harvested, surgical templets were used to recontour initially the maxillary graft and the mandibular graft.
Gold: [[84, 91, 'BODY_PART'], [9, 13, 'BODY_PART']]
Pred: [[9, 13, 'BODY_PART']]
--------------------------------------------------


### Summary
- The performance is poor as the train and test data sets are not in the same category, and therefore don't share big enough NER vocabulary.

In [19]:
# =========================================
#  DETECT TRAIN/TEST DATA LEAKAGE
# =========================================

# Load train sentences
train_json = PATH_TO_ANNOTATIONS + "annotated_samples_train.json"
with open(train_json, "r", encoding="utf-8") as f:
    train_data = json.load(f)
train_sentences = [text for text, _ in train_data["annotations"]]

# Load test sentences
test_json = PATH_TO_ANNOTATIONS + "annotated_samples_test.json"
with open(test_json, "r", encoding="utf-8") as f:
    test_data = json.load(f)
test_sentences = [text for text, _ in test_data["annotations"]]

# Exact-match leakage
print("\n===== Checking Exact Leakage =====")
leak_exact = []
for t in test_sentences:
    if t in train_sentences:
        leak_exact.append(t)

if leak_exact:
    print("EXACT LEAKAGE FOUND:")
    for s in leak_exact:
        print(" -", s)
else:
    print("No exact leakage.")


# Substring leakage (weaker but still harmful)
print("\n===== Checking Substring Leakage =====")
leak_sub = []
for test_s in test_sentences:
    for train_s in train_sentences:
        if test_s.strip() in train_s.strip() or train_s.strip() in test_s.strip():
            leak_sub.append(test_s)
            break

if leak_sub:
    print("SUBSTRING LEAKAGE FOUND:")
    for s in leak_sub:
        print(" -", s)
else:
    print("No substring leakage.")



===== Checking Exact Leakage =====
EXACT LEAKAGE FOUND:
 - Maxillary atrophy.
 - ,2.
 - Severe mandibular atrophy.
 - Acquired facial deformity.
 - ,3.

===== Checking Substring Leakage =====
SUBSTRING LEAKAGE FOUND:
 - Maxillary atrophy.
 - ,2.
 - Severe mandibular atrophy.
 - Acquired facial deformity.
 - ,3.


## 5. LLM-based NER classifier

The goal of this part was to see whether a large language model (LLM) could be used as an alternative NER system for my medical text, using only prompts and without training a new model. In practice, we tried several approaches, but all of them were limited by model size, hardware, or missing API access.

First, we tried a very small local LLM (about 0.5B parameters) via `llama-cpp`. we asked it to extract entities with my label set (DISEASE, SYMPTOM, BODY_PART, FINDING, PROCEDURE) and return a JSON list. The model usually failed to follow the format and, more importantly, produced almost random labels that did not make medical sense. This suggests that such a small model is not strong enough for domain-specific NER, even with an instruction-style prompt.

Second, we tried to use a slightly larger open-source LLM from Hugging Face (e.g. Qwen2.5-1.5B or Llama-3.2-1B) on my macOS machine via the `transformers` library. We implemented a function that builds a medical NER prompt and parses the model output into (phrase, label) pairs. However, even for one or two short sentences, generation on CPU/MPS took several minutes. Running this model on all test sentences to compute precision, recall and F1 would be impractically slow for this project.

Third, we looked at the spaCy-LLM integration. In theory, we could define an `llm` component in a `config.cfg` file, for example using a Llama2 model on Hugging Face, and then assemble it as a normal spaCy pipeline. In practice, this either requires access to an external API (e.g. OpenAI) or downloading and running a much larger model (such as Llama2-7B). we have no API credits available and my local hardware is not sufficient to run such a model efficiently.

Because of these limitations, we did not manage to build a fully working and scalable LLM-based NER baseline. Our “investigation” of LLM-based NER is therefore mainly conceptual: we explored how it would be prompted and integrated (with transformers or spaCy-LLM), and we observed that under my resource constraints it is not yet practical to use an LLM as a reliable, evaluated medical NER system. For the quantitative results in this project, we rely on the standard spaCy NER and my extended, manually trained medical NER model.




### 5.1 Example of NER with Llama-3.2-1B-Instruct on CPU

**Skip this section if run on GPU**

- It takes 20 min for two sentences. Therefore it is just a demo of how NER can be done. 
- If we have more computational resources, we will evaluate on the whole annotation test data.

## 5.2 Sanity Check of LLM Behavior with a Minimal n-Shot Prompt

Before relying on spaCy-LLM’s built-in NER prompting strategy, it is important to understand how the base LLM behaves under a transparent and fully interpretable prompt. This sanity check serves several purposes:

- LLMs are prone to hallucination, especially when forced to choose from a restricted label set. Observing their raw behavior helps identify systematic failure modes such as over-labeling or speculative predictions.
- The default spaCy-LLM prompts are highly engineered and not fully documented. Since we cannot see their internal design decisions, it is useful to test a simple and fully understandable prompt for comparison.
- Our custom few-shot examples do not (and should not) follow spaCy-LLM’s templating rules. The purpose here is not to replace spaCy-LLM, but to benchmark the raw LLM’s baseline behavior.


### 5.2.1 Log of simple-prompt llm-based NER

### 5.2.2 Evaluation of simple-prompt llm-based NER

### Summary of minimal n-shot prompt results

Despite including clear and relevant examples, the raw LLM under a simple n-shot prompt achieved only:

Micro F1 score:  0.3043
Macro F1:        0.2228

This confirms that unconstrained prompting is not sufficient for reliable NER, and that hallucination control and output normalization require more structure than a minimal prompt can provide.


## 5.3 Evaluation of the spaCy-LLM Pipeline

The spaCy-LLM framework adds an essential layer of structure around the base LLM. 



In [None]:
# Model names avaible: dolly-v2-3b Llama-2-13b-hf mistral-7b (more to be explored) original
!python -m ner.spacy_llm.load_model --model  mistral-7b
!python -m ner.spacy_llm.evaluate --model mistral-7b

Current Annotation

==== Micro-Averaged Metrics ====

Precision: 0.1972

Recall:    0.1972

F1-score:  0.1972

Macro Precision: 0.2220

Macro Recall:   0.1636

Macro F1:       0.1710

In [None]:
# Try different example settings, combined with revised gold annotations

# See example settings in ner_examples_cot_balanced_1.json, revised gold annotation in annotated_samples_test_1.json
# Modify the .cfg and evaluation.py accordingly. 
#  
# Model names avaible: dolly-v2-3b Llama-2-13b-hf mistral-7b (more to be explored)
!python -m ner.spacy_llm.load_model --model  mistral-7b
!python -m ner.spacy_llm.evaluate --model mistral-7b

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████| 2/2 [00:00<00:00, 105.35it/s]
Model saved to /storage/homefs/kw24z021/NLP_LLM/group_project/MedNLP-Multitask/ner/spacy_llm/models/output_mistral-7b_ner
Loaded 36 annotated test sentences.
Loading model...mistral-7b
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████| 2/2 [00:00<00:00, 104.55it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you m

Revised gold annotation and examples

==== Micro-Averaged Metrics ====

Precision: 0.2429

Recall:    0.2429

F1-score:  0.2429

Macro Precision: 0.4889

Macro Recall:   0.3388

Macro F1:       0.3388


### Summary
- The comparison between minimal raw prompting and spaCy-LLM confirms the importance of architectural constraints and structured prompt design for LLM-based NER.

When evaluated on the same test dataset, spaCy-LLM achieved:

Micro F1 = 0.2063  
Macro F1 = 0.2041

Although performance is still modest, these results are significantly better than the minimal n-shot prompt baseline. This demonstrates that structured prompting, template enforcement, and standardized output parsing reduce hallucinations and improve entity extraction reliability.


## 6. How NER type information can help other NLP tasks

Even if the NER models in this project are not perfect, the extracted entity types are still useful for many other NLP applications in the clinical domain. Here I briefly summarise a few examples.

1. **Structuring free-text clinical notes**  
	Clinical documents are mostly free text. NER can identify key concepts and turn them into structured fields, for example:
	- DISEASE: diabetes mellitus, hypertension, stasis ulcer  
	- SYMPTOM: chest pain, nausea, diarrhea  
	- BODY_PART: right ankle, abdomen  
	- PROCEDURE: surgery, CT scan  
	This structured information can then be stored in an electronic health record or database and used for search, filtering, and statistics.

2. **Clinical decision support**  
	NER outputs can be used as features in decision support systems. For example:
	- Combinations of DISEASE and SYMPTOM entities can be used to estimate the risk of certain conditions.
	- DRUG and DOSAGE entities can be checked for possible drug interactions or dosing errors.
	In this way, NER acts as a bridge between narrative notes and automated clinical rules or prediction models.

3. **Document and patient-level classification**  
	NER types can also improve text classification. Instead of using only bag-of-words, we can use counts and patterns of entities:
	- Classifying documents by specialty (e.g. cardiology vs. endocrinology) based on the diseases and body parts mentioned.
	- Detecting potential adverse drug events by looking for co-occurrences of specific DRUG and SYMPTOM entities.
	At the patient level, the presence or absence of certain DISEASE entities can be used to define cohorts for research.

4. **Relation extraction and knowledge graphs**  
	NER is the first step towards relation extraction, such as:
	- SYMPTOM–DISEASE relations (e.g. chest pain → myocardial infarction ruled out)
	- DRUG–DISEASE relations (indications)  
	- DRUG–SYMPTOM relations (adverse effects)  
	Once entities are identified, these relations can be learned or manually defined, and combined into a clinical knowledge graph.

5. **Question answering and summarisation**  
	For question answering, knowing which spans are DISEASE or SYMPTOM helps focus retrieval on the relevant parts of a document. For summarisation, NER can be used to ensure that all important entities (diagnoses, symptoms, procedures, medications) appear explicitly in the final summary, even if the original note is long and repetitive.

Overall, NER type information turns unstructured clinical text into more interpretable and reusable signals. Even a relatively noisy NER system can already provide useful features for downstream tasks such as decision support, classification, relation extraction, and summarisation.
