# Task 2: Data Exploration and Processing

## 1. Manual data inspection
- Investigate which standard and potential new NER types are most prominent in your data set (i.e., manual data inspection)

Following visual inspection,some of the prominent NERs found are: DATE, BODY_PART, DOSAGE, MEASUREMENT, DRUG and SYMPTOM. 

Examples:
 1. DATE: 12/20/2005, 1/19/96
 2. BODY_PART: nose, abdomen, knee
 3. DOSAGE:  10/40 mg one a day, 0.25 micrograms a day, 50 mg twice a day, 10 ml
 4. MEASUREMENT: 3.98 kg, 8mm, pulse of 84, blood pressure 108/65
 5. DRUG:  Vytorin, Rocaltrol, Carvedilol, Cozaar,  Lasix
 6. SYMPTON: erythematous, chest pain, constipated

Where, DATE is a **standard NER in spacy** and the remaining ones fall in the medicine domain category. 

## 2. Apply the standard NER classifier of spaCy


### Imports

In [1]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import spacy
import random



  from .autonotebook import tqdm as notebook_tqdm


### Load Data

In [2]:
# ============================
# 1: Load Dataset
# ============================
dataset = load_dataset("argilla/medical-domain", split="train")
print(len(dataset))

4966


In [3]:
# ================================================
# 2: Define Entity Schema
# ================================================
# Our final gold-label schema
CUSTOM_LABELS = [

    "DISEASE", "BODY_PART", "PROCEDURE", "FINDING", "SYMPTOM"
]

print("Our NER schema = ", CUSTOM_LABELS)


Our NER schema =  ['DISEASE', 'BODY_PART', 'PROCEDURE', 'FINDING', 'SYMPTOM']


In [4]:
# =========================================
# 3: Pick N samples manually
# =========================================
PATH_TO_ANNOTATIONS = "ner/samples/"
SEED = 42 
random.seed(SEED)
np.random.seed(SEED)

N_SAMPLES = 2   # select a proper number to contain 100 ground-truth NERs 
sampled_texts = []

for i in range(N_SAMPLES):
    row = dataset[np.random.randint(0, len(dataset))]
    text = row["text"] if "text" in row else row["content"]
    sent = row["text"] if i==0 else None
    sampled_texts.append(text)

df_samples = pd.DataFrame({"text": sampled_texts})
df_samples.to_csv(PATH_TO_ANNOTATIONS + "sample_for_annotation.csv", index=False)

print("Saved sample_for_annotation.csv with", len(df_samples), "sentences")


Saved sample_for_annotation.csv with 2 sentences


## 3. Evaluation of Standard NER


### Generate Templates for Manual Annotation. 

 The annotation format should be: 

    {
        "sample_id": 0, 
        "sentence_id": 0, 
        "text": "REASON FOR THE CONSULT:,  Nonhealing right ankle stasis ulcer.,HISTORY", 
        "annotation": "[["Nonhealing","FINDING"],["right ankle","BODY_PART"],["stasis ulcer","DISEASE"]]"
    }



In [5]:
# ============================================
# Generate train/test annotation template JSONL with ratio control
# ============================================
INPUT_CSV = PATH_TO_ANNOTATIONS + "sample_for_annotation.csv"

# ================================
# Parameters: control split ratio
# ================================
TRAIN_RATIO = 0.60     # first 60% for training（sample 0）
TEST_RATIO  = 0.30     # first 30% for evaluation/test（sample 1）

# ================================

# 1. Load samples: only column "text"
df = pd.read_csv(INPUT_CSV)
df["sample_id"] = df.index  # auto-index

# 2. Load spaCy for sentence splitting
nlp = spacy.load("en_core_web_md")
if "sentencizer" not in nlp.pipe_names:
    nlp.add_pipe("sentencizer")

# Buckets
train_items = []
test_items = []

# ===========================
# Split sentences per sample
# ===========================
for _, row in df.iterrows():

    sample_id = int(row["sample_id"])
    full_text = str(row["text"])
    doc = nlp(full_text)

    # Extract sentences
    sents = [s.text.strip() for s in doc.sents if len(s.text.strip()) > 0]
    total = len(sents)

    if sample_id == 0:
        # sample 0 → train sample
        cutoff = int(total * TRAIN_RATIO)
        selected = sents[:cutoff]

        for i, text in enumerate(selected):
            train_items.append({
                "sample_id": sample_id,
                "sentence_id": i,
                "text": text,
                "annotation": []
            })

    elif sample_id == 1:
        # sample 1 → test sample
        cutoff = int(total * TEST_RATIO)
        selected = sents[:cutoff]

        for i, text in enumerate(selected):
            test_items.append({
                "sample_id": sample_id,
                "sentence_id": i,
                "text": text,
                "annotation": []
            })

print("Sample 0 sentences:", len([s for s in nlp(df.loc[0, "text"]).sents]))
print("Sample 1 sentences:", len([s for s in nlp(df.loc[1, "text"]).sents]))



Sample 0 sentences: 83
Sample 1 sentences: 89


### 3.1 Manual Evaluation of Standard NER

In [6]:
### Manual Evaluation of Standard NER

import spacy
import pandas as pd
from spacy import displacy

# 1. Load sample text file
df_samples = pd.read_csv(PATH_TO_ANNOTATIONS + "sample_for_annotation.csv")

# 2. Extract second row text
# sample_for_annotation.csv has:
# row 0: header "text"
# row 1: sample 1 -> this is the train sample for further NER extention 
# row 2: sample 2  -> this is the test sample
test_text = df_samples.iloc[1]["text"]

print("=== Test Text Preview ===")

# 3. Load spaCy model (baseline or your updated one)
nlp = spacy.load("en_core_web_md")   # or: spacy.load("output_medical_ner")

# 4. Run NER on the full document
doc = nlp(test_text[:2000])

# 5. Render using displacy
displacy.render(doc, style="ent", jupyter=True)


=== Test Text Preview ===


Observation: Some NERs don't make sense. E.g., Vomitting -> PERSON, Hypokalemia -> PERSON, Diarrhea -> PERSON

### 3.2 Automatic Evaluation of Standard NER

In [7]:
# =========================================
#  Standard NER evaluation
# =========================================

import json
import spacy
from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, convert_to_labels

# 1. Load JSONL annotations
jsonl_path = PATH_TO_ANNOTATIONS + "test_annotation_complete.jsonl"

data = []
with open(jsonl_path, "r", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        data.append(obj)

print(f"Loaded {len(data)} annotated sentences")

# 2. Load baseline spaCy NER
nlp = spacy.load("en_core_web_md")

# 3. Convert annotations to character-level spans
gold_spans, pred_spans, labels = extract_spans(nlp, data)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels = convert_to_labels(gold_spans, pred_spans)

# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
    true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Baseline spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
    true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")

for i, (g, p, item) in enumerate(zip(gold_spans, pred_spans, data)):
    g_set = set(g)
    p_set = set(p)
    if g_set != p_set:
        print("Text:", item["text"])
        print("Gold:", g_set)
        print("Pred:", p_set)
        print("-" * 50)
        break

Loaded 26 annotated sentences

===== Baseline spaCy NER Evaluation =====
Precision: 0.0000
Recall:    0.0000
F1 score:  0.0000

Macro Precision: 0.0
Macro Recall:    0.0
Macro F1:        0.0

===== Examples of WRONG predictions =====

Text: ADMISSION DIAGNOSES:,1.
Gold: set()
Pred: {(0, 9, 'ORG')}
--------------------------------------------------


## 4. Extend the standard NER types using the NER Annotator

### 4.1 Training Extended NER model with >100 manual annoatation

In [8]:
# Make sure there is no overlapping entity problems in the training sample
# 
# import json

path = PATH_TO_ANNOTATIONS + "train_annotation_complete.jsonl"

def spans_overlap(a, b):
    return not (a[1] <= b[0] or b[1] <= a[0])

bad = []

with open(path, "r", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        text = obj["text"]
        anns = obj["annotation"]  # [[phrase,label],...]

        spans = []
        for phrase, label in anns:
            start = text.lower().find(phrase.lower())
            if start == -1:
                continue
            end = start + len(phrase)
            spans.append((start, end, phrase, label))

        # check overlapping
        for i in range(len(spans)):
            for j in range(i+1, len(spans)):
                a = spans[i]
                b = spans[j]
                if spans_overlap(a, b):
                    bad.append((obj["sentence_id"], text, a, b))

print("===== Overlapping entity problems found =====")
for sid, text, a, b in bad:
    print(f"\nSentence {sid}: {text}")
    print("  ->", a)
    print("  ->", b)


===== Overlapping entity problems found =====


In [9]:
# =========================================
#  Extended NER TRAINING
# =========================================

import json
import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding
from ner.utils import find_span

# 1. Load annotated JSONL
json_path = PATH_TO_ANNOTATIONS + "train_annotation_complete.jsonl"
data = []

with open(json_path, "r", encoding="utf-8") as f:
    for line in f:
        data.append(json.loads(line))

print(f"Loaded {len(data)} annotated sentences")

# 2. Custom labels
CUSTOM_LABELS = ["DISEASE", "BODY_PART", "FINDING", "PROCEDURE", "SYMPTOM"]
print("Custom NER labels:", CUSTOM_LABELS)

# 3. Prepare training data
training_examples = []

for item in data:
    text = item["text"]
    labels = item["annotation"]  # [["hypertension","DISEASE"], ...]

    entities = []
    for phrase, label in labels:
        span = find_span(text, phrase)
        if span:
            entities.append((span[0], span[1], label))

    training_examples.append((text, {"entities": entities}))

print("Example:", training_examples[0])

# 4. Initialize blank model for spaCy 3.8+
nlp = spacy.blank("en")         
ner = nlp.add_pipe("ner")       

# Add custom labels
for label in CUSTOM_LABELS:
    ner.add_label(label)

# 5. Training loop
n_iter = 35
optimizer = nlp.initialize()

for epoch in range(n_iter):
    random.shuffle(training_examples)
    losses = {}

    batches = minibatch(training_examples, size=compounding(4.0, 32.0, 1.5))

    for batch in batches:
        examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in batch]
        nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)

    print(f"Epoch {epoch+1}/{n_iter} Loss: {losses}")

# 6. Save model
output_dir = "output_medical_ner"
nlp.to_disk(output_dir)
print("Model saved to", output_dir)

# 7. Quick sanity check
test_text = random.choice(data)["text"]
doc = nlp(test_text)

print("\nTest sentence:", test_text)
print("Predicted NER:", [(ent.text, ent.label_) for ent in doc.ents])


Loaded 83 annotated sentences
Custom NER labels: ['DISEASE', 'BODY_PART', 'FINDING', 'PROCEDURE', 'SYMPTOM']
Example: ('REASON FOR THE CONSULT:,  Nonhealing right ankle stasis ulcer.,HISTORY', {'entities': [(26, 36, 'FINDING'), (37, 48, 'BODY_PART'), (49, 61, 'DISEASE')]})
Epoch 1/35 Loss: {'ner': np.float32(776.60944)}
Epoch 2/35 Loss: {'ner': np.float32(269.15738)}
Epoch 3/35 Loss: {'ner': np.float32(131.04156)}
Epoch 4/35 Loss: {'ner': np.float32(118.32787)}
Epoch 5/35 Loss: {'ner': np.float32(100.092255)}
Epoch 6/35 Loss: {'ner': np.float32(96.86833)}
Epoch 7/35 Loss: {'ner': np.float32(96.317795)}
Epoch 8/35 Loss: {'ner': np.float32(74.446815)}
Epoch 9/35 Loss: {'ner': np.float32(68.70432)}
Epoch 10/35 Loss: {'ner': np.float32(67.18328)}
Epoch 11/35 Loss: {'ner': np.float32(66.88499)}
Epoch 12/35 Loss: {'ner': np.float32(61.775448)}
Epoch 13/35 Loss: {'ner': np.float32(80.03961)}
Epoch 14/35 Loss: {'ner': np.float32(73.56988)}
Epoch 15/35 Loss: {'ner': np.float32(84.17685)}
Epoch 

### 4.2 Extended NER evaluation

In [11]:
# =========================================
#  Extended NER evaluation with test sample
# =========================================
import json
import spacy
from sklearn.metrics import precision_recall_fscore_support
from ner.utils import extract_spans, convert_to_labels

# 1. Load JSONL annotations for test
jsonl_path = PATH_TO_ANNOTATIONS + "test_annotation_complete.jsonl"

data = []
with open(jsonl_path, "r", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        data.append(obj)

print(f"Loaded {len(data)} annotated sentences")

# 2. Load Extended spaCy NER
nlp = spacy.load("output_medical_ner")

# 3. Convert annotations to character-level spans
gold_spans, pred_spans, labels = extract_spans(nlp, data)

# 4. Convert spans to entity sets for evaluation
true_labels, pred_labels = convert_to_labels(gold_spans, pred_spans)
# 5. Evaluate macro and micro F1
prec, rec, f1, _ = precision_recall_fscore_support(
    true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Extended spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
    true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))

# 6. Show some error cases
print("\n===== Examples of WRONG predictions =====\n")

for i, (g, p, item) in enumerate(zip(gold_spans, pred_spans, data)):
    g_set = set(g)
    p_set = set(p)
    if g_set != p_set:
        print("Text:", item["text"])
        print("Gold:", g_set)
        print("Pred:", p_set)
        print("-" * 50)
        break

Loaded 26 annotated sentences

===== Extended spaCy NER Evaluation =====
Precision: 0.1515
Recall:    0.1515
F1 score:  0.1515

Macro Precision: 0.25
Macro Recall:    0.0734
Macro F1:        0.1111

===== Examples of WRONG predictions =====

Text: Nausea.,3.
Gold: {(0, 6, 'SYMPTOM')}
Pred: set()
--------------------------------------------------


## 5. LLM-based NER classifier

In [None]:
# =========================================
#  Extended NER evaluation with test sample
# =========================================
import json
import spacy
import time
from sklearn.metrics import precision_recall_fscore_support
from spacy_llm.util import assemble

# ===================================================
# 1. Load JSONL annotations for test
# ===================================================
jsonl_path = "test_annotation_complete.jsonl"

data = []
with open(jsonl_path, "r", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        data.append(obj)

print(f"Loaded {len(data)} annotated sentences")

# ===================================================
# 2. Create LLM NER
# ===================================================

nlp = assemble("config.cfg")
print("spaCy model loaded:", nlp.meta["name"])

# ===================================================
# 3. Convert annotations to character-level spans
# ===================================================

def find_span(text, phrase):
    """
    Locate (start, end) in the sentence.
    If multiple choices, take the first one.
    Retrun None if not found。
    """
    start = text.lower().find(phrase.lower())
    if start == -1:
        return None
    return (start, start + len(phrase))

gold_spans = []
pred_spans = []
labels = []   # For all ground-truth annotion in the test

print("Creating docs")
for item in data:
    text = item["text"]
    ann_list = item["annotation"]  # [["ankle","BODY_PART"], ...]
    
    doc = nlp(text)
    print('.')
    time.sleep(2) 

    # -------- gold spans --------
    gold = []
    for phrase, label in ann_list:
        span = find_span(text, phrase)
        if span is not None:
            gold.append((span[0], span[1], label))
            labels.append(label)
        # else:
        #     print("Warning: phrase not found:", phrase)

    gold_spans.append(gold)

    # -------- predicted spans --------
    pred = []
    for ent in doc.ents:
        pred.append((ent.start_char, ent.end_char, ent.label_))

    pred_spans.append(pred)

# ===================================================
# 4. Convert spans to entity sets for evaluation
# ===================================================

true_labels = []
pred_labels = []

for g_spans, p_spans in zip(gold_spans, pred_spans):

    #  (start,end,label) precise matching
    g_set = set(g_spans)
    p_set = set(p_spans)

    # True positives
    for span in g_set:
        if span in p_set:
            true_labels.append(span[2])
            pred_labels.append(span[2])
        else:
            true_labels.append(span[2])
            pred_labels.append("NONE")  # missed

    # False positives
    for span in p_set:
        if span not in g_set:
            true_labels.append("NONE")     # if no gold annotation
            pred_labels.append(span[2])    # wrong

# ===================================================
# 5. Evaluate macro and micro F1
# ===================================================

prec, rec, f1, _ = precision_recall_fscore_support(
    true_labels, pred_labels, average="micro", zero_division=0
)

print("\n===== Extended spaCy NER Evaluation =====")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 score:  {f1:.4f}")

prec_m, rec_m, f1_m, _ = precision_recall_fscore_support(
    true_labels, pred_labels, average="macro", zero_division=0
)

print("\nMacro Precision:", round(prec_m, 4))
print("Macro Recall:   ", round(rec_m, 4))
print("Macro F1:       ", round(f1_m, 4))


# ===================================================
# 6. Show some error cases
# ===================================================
print("\n===== Examples of WRONG predictions =====\n")

for i, (g, p, item) in enumerate(zip(gold_spans, pred_spans, data)):
    g_set = set(g)
    p_set = set(p)
    if g_set != p_set:
        print("Text:", item["text"])
        print("Gold:", g_set)
        print("Pred:", p_set)
        print("-" * 50)
        break



