### 1) Manual inspection of entity types

In [2]:
import spacy

spacy.__version__

'3.8.6'

In [36]:
from data_loader import load_df

df = load_df()
df.head()
#df.shape

Unnamed: 0,text,id,label,text_length
0,"PREOPERATIVE DIAGNOSIS:, Iron deficiency anem...",00001265-03e2-47b2-b6cf-bed32dad2fa9,Gastroenterology,1085
1,"CLINICAL INDICATION: ,Normal stress test.,PRO...",0007edf0-1413-4b16-8212-3a13c2ab4e43,Surgery,1798
2,"FINDINGS:,Axial scans were performed from L1 t...",00097d1e-1357-4447-a39a-fe8f8b7c36ae,Radiology,1141
3,"PREOPERATIVE DIAGNOSIS: , Blood loss anemia.,P...",001622b6-0182-4fee-9881-ae15e81ce836,Surgery,1767
4,"REASON FOR VISIT: ,Elevated PSA with nocturia...",0029245f-8b45-4796-ba09-7760612289c6,SOAP / Chart / Progress Notes,1519


In [12]:
sample_texts = df['text'].sample(5, random_state=42)

for text in sample_texts:
    print('### sample ###')
    print(text[:1000])
    print()

### sample ###
REASON FOR CONSULTATION: , Thyroid mass diagnosed as papillary carcinoma.,HISTORY OF PRESENT ILLNESS:  ,The patient is a 16-year-old young lady, who was referred from the Pediatric Endocrinology Department by Dr. X for evaluation and surgical recommendations regarding treatment of a mass in her thyroid, which has now been proven to be papillary carcinoma on fine needle aspiration biopsy.  The patient's parents relayed that they first noted a relatively small but noticeable mass in the middle portion of her thyroid gland about 2004.  An ultrasound examination had reportedly been done in the past and the mass is being observed.  When it began to enlarge recently, she was referred to the Pediatric Endocrinology Department and had an evaluation there.  The patient was referred for fine needle aspiration and the reports recently returned a diagnosis of papillary thyroid carcinoma.  The patient has not had any hoarseness, difficulty swallowing, or any symptoms of endocrine dys

In [139]:
import spacy

nlp = spacy.load("en_core_web_sm")

# print out the standard entity labels used by spacys ner model

for label in nlp.get_pipe("ner").labels:
    print(label)

CARDINAL
DATE
EVENT
FAC
GPE
LANGUAGE
LAW
LOC
MONEY
NORP
ORDINAL
ORG
PERCENT
PERSON
PRODUCT
QUANTITY
TIME
WORK_OF_ART


### Answer
After inspecting some text samples from the dataset, we can observe that the texts contain two standard spaCy entity types: Person (references to patients and clinicians) and Date (explicit dates). The other most common entities are mostly NEW types: Disease/Condition (papillary carcinoma), Procedure (colonoscopy), Anatomy/Body_Part (thyroid gland), Drug/Medication (Demerol), Measurement/Dosage (25 mg), Medical_Device (Foley catheter), Symptom (severe headache). Therefore, only a small subset of medically important entities align with the standard NER categories, while most medically relevant concepts require new custom NER labels.

### 2) Apply standard NER classifier of spaCy to our data

In [151]:
def extract_spacy_entities(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

subsetdf = df.iloc[5:505].copy() # we skip the first 5 texts because later we train our model on sentences from the first 5 texts

subsetdf['spacy_ents'] = subsetdf['text'].apply(extract_spacy_entities)
subsetdf[["text", "spacy_ents"]].head()

Unnamed: 0,text,spacy_ents
5,"XYZ, O.D.,RE: ABC,DOB: MM/DD/YYYY,Dear Dr. X...","[(XYZ, ORG), (ABC, ORG), (DOB, ORG), (MM, GPE)..."
6,"PREOPERATIVE DIAGNOSES:,1. Intrauterine pregn...","[(Intrauterine, NORP), (30 and 4/7th weeks, DA..."
7,"REASON FOR CT SCAN: , The patient is a 79-year...","[(79-year-old, DATE), (CT, ORG), (January 16, ..."
8,"CLINICAL HISTORY: , Patient is a 37-year-old f...","[(Patient, PERSON), (37-year-old, DATE), (CT, ..."
9,"INDICATION: , Iron deficiency anemia.,PROCEDUR...","[(Colonoscopy, ORG), (DIAGNOSIS, PERSON), (15 ..."


### Answer
We can see that the standard spaCy NER model often misclassifies medical text, for example,  it labels the word "Colonoscopy" as an org. This indicates that the model is not well trained to the structure and vocabulary of medical text.

### 3) Evaluate quality of the NER classification

In [141]:
all_entities_of_subset = []

for ents in subsetdf['spacy_ents']:
    for ent_text, ent_label in ents:
        all_entities_of_subset.append((ent_text, ent_label))

len(all_entities_of_subset)

16490

In [143]:
import pandas as pd

df_entities = pd.DataFrame(all_entities_of_subset, columns =  ['entity_text', 'predicted_label'])

samples_of_100_entities_df = df_entities.sample(100, random_state=42)

#df_entities.head()
samples_of_100_entities_df

Unnamed: 0,entity_text,predicted_label
8419,Judkins,PERSON
5504,1.0,CARDINAL
14154,only 5 mm,QUANTITY
3434,1%,PERCENT
13379,IV Toradol,PERSON
...,...,...
1836,CAT,ORG
4546,Paradox,GPE
15681,Proscar 5,PERSON
15447,11 cm,QUANTITY


In [144]:
# manually check the entities and their labels

is_correct = [0, 1, 1, 1, 0, 1, 1, 1, 0, 0,
              0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
              1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
              1, 0, 0, 0, 1, 1, 1, 1, 1, 0,
              0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
              0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
              0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
              0, 1, 0, 1, 1, 0, 1, 0, 1, 0,
              0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
              1, 0, 1, 0, 0, 0, 0, 0, 1, 1]

samples_of_100_entities_df['is_correct'] = is_correct

samples_of_100_entities_df

Unnamed: 0,entity_text,predicted_label,is_correct
8419,Judkins,PERSON,0
5504,1.0,CARDINAL,1
14154,only 5 mm,QUANTITY,1
3434,1%,PERCENT,1
13379,IV Toradol,PERSON,0
...,...,...,...
1836,CAT,ORG,0
4546,Paradox,GPE,0
15681,Proscar 5,PERSON,0
15447,11 cm,QUANTITY,1


In [145]:
import numpy as np

accuracy = np.mean(samples_of_100_entities_df['is_correct'])
print(accuracy)

0.4


In [146]:
label_counts_samples = samples_of_100_entities_df['predicted_label'].value_counts()
label_counts_samples

predicted_label
ORG         28
CARDINAL    24
PERSON      14
DATE         9
QUANTITY     6
GPE          6
PERCENT      3
PRODUCT      3
MONEY        2
NORP         2
TIME         2
ORDINAL      1
Name: count, dtype: int64

In [147]:
label_counts_entities = df_entities['predicted_label'].value_counts()
label_counts_entities

predicted_label
CARDINAL       3787
ORG            3737
PERSON         2338
DATE           2164
GPE            1080
QUANTITY        690
PRODUCT         489
TIME            467
ORDINAL         454
NORP            399
PERCENT         345
MONEY           292
LOC              97
FAC              56
LAW              53
WORK_OF_ART      20
LANGUAGE         12
EVENT            10
Name: count, dtype: int64

### Answer
We randomly sampled 100 predicted entities from the spaCy output and manually judged whether each prediction was correct.
The manual accuracy was 0.4.
This shows that less than half of the predicted entities were correct, and that the standard spaCy NER model struggles with medical text.
We computed the entity label distribution for all predicted entities in our subset (500 documents).
The most common predicted labels were: CARDINAL, ORG, PERSON and Date.
However, most medically relevant concepts (e.g., diseases, procedures, anatomy, device, drugs) are not assigned any label, because the standard spaCy model does not include these categories.
Given the low manual accuracy, many automatically counted entities are likely incorrect or incomplete.
The automatic evaluation therefore shows the prediction tendencies of the model, but not the correctness.

### 4) Extend the standard NER

In [148]:
medical_entities = ['ANATOMY', 'DISEASE', 'PROCEDURE', 'DRUG', 'DEVICE'] # tags used in NER text annotator

In [156]:
subsetdf_for_new_ner = df.head(5).copy()

subsetdf_for_new_ner['text'][4]
# we chose to take some sentences of the first 5 medical texts from our dataset and label these with the 5 medical entities above

'REASON FOR VISIT:  ,Elevated PSA with nocturia and occasional daytime frequency.,HISTORY: , A 68-year-old male with a history of frequency and some outlet obstructive issues along with irritative issues.  The patient has had history of an elevated PSA and PSA in 2004 was 5.5.  In 2003, he had undergone a biopsy by Dr. X, which was negative for adenocarcinoma of the prostate.  The patient has had PSAs as high as noted above.  His PSAs have been as low as 1.6, but those were on Proscar.  He otherwise appears to be doing reasonably well, off the Proscar, otherwise does have some irritative symptoms.  This has been ongoing for greater than five years.  No other associated symptoms or modifying factors.  Severity is moderate.  PSA relatively stable over time.,IMPRESSION: , Stable PSA over time, although he does have some irritative symptoms.  After our discussion, it does appear that if he is not drinking close to going to bed, he notes that his nocturia has significantly decreased.  At th

In [None]:
medical_sentences = ['The mucosa was normal throughout the colon.  No polyps or other lesions were identified, and no blood was noted.  Some diverticula were seen of the sigmoid colon with no luminal narrowing or evidence of inflammation.', 'Both groins were prepped and draped in the usual sterile fashion.  After local anesthesia with 2% lidocaine, a 6-French sheath was inserted in the right femoral artery.', 'Axial scans were performed from L1 to S2 and reformatted images were obtained in the sagittal and coronal planes.', 'The patient tolerated the procedure well and was transferred to recovery room in stable condition.  She will be placed on a high-fiber diet and Colace and we will continue to monitor her hemoglobin.', 'The patient will discontinue all caffeinated and carbonated beverages and any fluids three hours prior to going to bed.  He already knows that this does decrease his nocturia.  He will do this without medications to see how well he does and hopefully he may need no other additional medications other than may be changing his alpha-blocker to something of more efficacious.'] # these are the sentences we chose

In [157]:
import json

with open("annotations.json", "r") as f:
    data = json.load(f)

print(type(data)) # output: dict
data.keys() # output: classes, annotations
items = data['annotations']
items[:2]

<class 'dict'>


[['The mucosa was normal throughout the colon.  No polyps or other lesions were identified, and no blood was noted.  Some diverticula were seen of the sigmoid colon with no luminal narrowing or evidence of inflammation.\r',
  {'entities': [[4, 10, 'ANATOMY'],
    [37, 43, 'ANATOMY'],
    [48, 54, 'DISEASE'],
    [64, 71, 'DISEASE'],
    [119, 130, 'DISEASE'],
    [148, 161, 'ANATOMY'],
    [203, 215, 'DISEASE']]}],
 ['Both groins were prepped and draped in the usual sterile fashion.  After local anesthesia with 2% lidocaine, a 6-French sheath was inserted in the right femoral artery.\r',
  {'entities': [[5, 11, 'ANATOMY'],
    [73, 89, 'PROCEDURE'],
    [98, 107, 'DRUG'],
    [111, 126, 'DEVICE'],
    [147, 167, 'ANATOMY']]}]]

In [189]:
import random
import numpy as np
import spacy
from spacy.util import fix_random_seed

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
fix_random_seed(SEED)

nlp = spacy.load("en_core_web_sm")

ner = nlp.get_pipe("ner")

items = data['annotations']
training_data = [(text, ann) for (text, ann) in items]

for _, ann in training_data:
    for start, end, label in ann["entities"]:
        ner.add_label(label)

for label in nlp.get_pipe("ner").labels: # check if new labels are included
    print(label) # 5 entities (Drug, Anatomy, Procedure, Device, Disease) have been added

ANATOMY
CARDINAL
DATE
DEVICE
DISEASE
DRUG
EVENT
FAC
GPE
LANGUAGE
LAW
LOC
MONEY
NORP
ORDINAL
ORG
PERCENT
PERSON
PROCEDURE
PRODUCT
QUANTITY
TIME
WORK_OF_ART


In [159]:
training_data[:]

[('The mucosa was normal throughout the colon.  No polyps or other lesions were identified, and no blood was noted.  Some diverticula were seen of the sigmoid colon with no luminal narrowing or evidence of inflammation.\r',
  {'entities': [[4, 10, 'ANATOMY'],
    [37, 43, 'ANATOMY'],
    [48, 54, 'DISEASE'],
    [64, 71, 'DISEASE'],
    [119, 130, 'DISEASE'],
    [148, 161, 'ANATOMY'],
    [203, 215, 'DISEASE']]}),
 ('Both groins were prepped and draped in the usual sterile fashion.  After local anesthesia with 2% lidocaine, a 6-French sheath was inserted in the right femoral artery.\r',
  {'entities': [[5, 11, 'ANATOMY'],
    [73, 89, 'PROCEDURE'],
    [98, 107, 'DRUG'],
    [111, 126, 'DEVICE'],
    [147, 167, 'ANATOMY']]}),
 ('Axial scans were performed from L1 to S2 and reformatted images were obtained in the sagittal and coronal planes.\r',
  {'entities': [[0, 11, 'PROCEDURE'],
    [32, 34, 'ANATOMY'],
    [38, 40, 'ANATOMY'],
    [85, 93, 'ANATOMY'],
    [98, 112, 'ANATOMY']]}),


In [190]:
# training

import random
from spacy.training.example import Example

random.shuffle(training_data)
split = int(0.8 * len(training_data))
train_data = training_data[0:split]
test_data = training_data[split:]

other_pipes = [p for p in nlp.pipe_names if p != "ner"]

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.resume_training()

    for epoch in range(20):
        random.shuffle(train_data)
        losses = {}
        for text, ann in train_data:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, ann)
            nlp.update([example], sgd=optimizer, losses=losses) # update weights
        print(f"Epoch {epoch}: {losses}")
nlp.to_disk("my_ner_model")

Epoch 0: {'ner': 29.748876931298522}
Epoch 1: {'ner': 23.50913157152811}
Epoch 2: {'ner': 19.234374840728485}
Epoch 3: {'ner': 16.409423711366372}
Epoch 4: {'ner': 16.096347384444584}
Epoch 5: {'ner': 14.1439067641369}
Epoch 6: {'ner': 13.252874352085243}
Epoch 7: {'ner': 73.24367454946228}
Epoch 8: {'ner': 38.588141746614475}
Epoch 9: {'ner': 12.985829026415828}
Epoch 10: {'ner': 22.502134727766233}
Epoch 11: {'ner': 10.707578108723173}
Epoch 12: {'ner': 10.11143771259589}
Epoch 13: {'ner': 13.992257199712402}
Epoch 14: {'ner': 22.63944678402162}
Epoch 15: {'ner': 22.822468919991564}
Epoch 16: {'ner': 17.428441711729675}
Epoch 17: {'ner': 10.31902671807463}
Epoch 18: {'ner': 9.441057194641614}
Epoch 19: {'ner': 9.459954951747747}


In [161]:
train_data[:]

[('Axial scans were performed from L1 to S2 and reformatted images were obtained in the sagittal and coronal planes.\r',
  {'entities': [[0, 11, 'PROCEDURE'],
    [32, 34, 'ANATOMY'],
    [38, 40, 'ANATOMY'],
    [85, 93, 'ANATOMY'],
    [98, 112, 'ANATOMY']]}),
 ('Both groins were prepped and draped in the usual sterile fashion.  After local anesthesia with 2% lidocaine, a 6-French sheath was inserted in the right femoral artery.\r',
  {'entities': [[5, 11, 'ANATOMY'],
    [73, 89, 'PROCEDURE'],
    [98, 107, 'DRUG'],
    [111, 126, 'DEVICE'],
    [147, 167, 'ANATOMY']]}),
 ('The patient tolerated the procedure well and was transferred to recovery room in stable condition.  She will be placed on a high-fiber diet and Colace and we will continue to monitor her hemoglobin.\r',
  {'entities': [[144, 150, 'DRUG']]}),
 ('The patient will discontinue all caffeinated and carbonated beverages and any fluids three hours prior to going to bed.  He already knows that this does decrease his noctu

In [162]:
test_data[:]

[('The mucosa was normal throughout the colon.  No polyps or other lesions were identified, and no blood was noted.  Some diverticula were seen of the sigmoid colon with no luminal narrowing or evidence of inflammation.\r',
  {'entities': [[4, 10, 'ANATOMY'],
    [37, 43, 'ANATOMY'],
    [48, 54, 'DISEASE'],
    [64, 71, 'DISEASE'],
    [119, 130, 'DISEASE'],
    [148, 161, 'ANATOMY'],
    [203, 215, 'DISEASE']]})]

In [191]:
# testing

def evaluate_ner(nlp, data):
    tp = 0
    fn = 0
    fp = 0
    for text, ann in data:
        true_entities = set((start, end, label) for start, end, label in ann["entities"])

        doc = nlp(text)

        pred_entities = set([(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

        tp += len(true_entities & pred_entities) # entities model got right
        fp += len(pred_entities - true_entities) # entities model predicted wrongly
        fn += len(true_entities - pred_entities) # entities model missed
    return tp, fp, fn, pred_entities

nlp_new = spacy.load("my_ner_model")

tp, fp, fn, pred_entities = evaluate_ner(nlp_new, test_data)
print(tp, fp, fn, pred_entities)

1 2 6 {(4, 10, 'ANATOMY'), (119, 130, 'ANATOMY'), (148, 155, 'ANATOMY')}


### Apply model with new enitities to subset and samples from before

In [192]:
def extract_spacy_entities_new(text):
    doc = nlp_new(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

subsetdf_new = df.iloc[5:505].copy() # we skip the first 5 texts because before we trained our model on sentences from the first 5 texts

subsetdf_new['spacy_ents'] = subsetdf_new['text'].apply(extract_spacy_entities_new)
subsetdf_new[["text", "spacy_ents"]].head()

Unnamed: 0,text,spacy_ents
5,"XYZ, O.D.,RE: ABC,DOB: MM/DD/YYYY,Dear Dr. X...","[(XYZ, DRUG), (ABC, ORG), (DOB, ORG), (MM, GPE..."
6,"PREOPERATIVE DIAGNOSES:,1. Intrauterine pregn...","[(Intrauterine, ANATOMY), (30 and 4/7th weeks,..."
7,"REASON FOR CT SCAN: , The patient is a 79-year...","[(79-year-old, DATE), (CT, ORG), (January 16, ..."
8,"CLINICAL HISTORY: , Patient is a 37-year-old f...","[(Patient, DRUG), (37-year-old, DATE), (adenom..."
9,"INDICATION: , Iron deficiency anemia.,PROCEDUR...","[(Colonoscopy, DRUG), (DIAGNOSIS, ANATOMY), (F..."


In [193]:
all_entities_of_subset_new = []

for ents in subsetdf_new['spacy_ents']:
    for ent_text, ent_label in ents:
        all_entities_of_subset_new.append((ent_text, ent_label))

len(all_entities_of_subset_new)

14963

In [194]:
df_entities_new = pd.DataFrame(all_entities_of_subset_new, columns =  ['entity_text', 'predicted_label'])

samples_of_100_entities_df_new = df_entities_new.sample(100, random_state=42)

#df_entities.head()
samples_of_100_entities_df_new

Unnamed: 0,entity_text,predicted_label
12669,node,ANATOMY
5369,fifth,ORDINAL
3059,23-hour,TIME
9755,Infant,DRUG
14300,37 plus weeks,DATE
...,...,...
7756,regimen,ANATOMY
13056,DIAGNOSTIC & LAB,ORG
1688,15,CARDINAL
5196,Lumbar,DRUG


In [195]:
# manually check the entities and their labels

is_correct_new = [1, 1, 1, 0, 0, 0, 1, 0, 1, 1,
                  0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
                  0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
                  0, 0, 1, 1, 1, 0, 0, 1, 0, 1,
                  0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
                  0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
                  0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
                  1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
                  1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
                  1, 0, 0, 0, 1, 0, 0, 1, 0, 0]

samples_of_100_entities_df_new['is_correct'] = is_correct_new

samples_of_100_entities_df_new

Unnamed: 0,entity_text,predicted_label,is_correct
12669,node,ANATOMY,1
5369,fifth,ORDINAL,1
3059,23-hour,TIME,1
9755,Infant,DRUG,0
14300,37 plus weeks,DATE,0
...,...,...,...
7756,regimen,ANATOMY,0
13056,DIAGNOSTIC & LAB,ORG,0
1688,15,CARDINAL,1
5196,Lumbar,DRUG,0


In [196]:
import numpy as np

accuracy_new = np.mean(samples_of_100_entities_df_new['is_correct'])
print(accuracy_new)

0.34


In [197]:
label_counts_samples_new = samples_of_100_entities_df_new['predicted_label'].value_counts()
label_counts_samples_new

predicted_label
DRUG         42
ANATOMY      29
DATE         14
CARDINAL      5
ORG           4
ORDINAL       1
TIME          1
GPE           1
PROCEDURE     1
DEVICE        1
PERSON        1
Name: count, dtype: int64

In [198]:
label_counts_entities_new = df_entities_new['predicted_label'].value_counts()
label_counts_entities_new

predicted_label
DRUG         6107
ANATOMY      4564
DATE         1326
ORG           833
CARDINAL      536
PERSON        488
ORDINAL       236
QUANTITY      161
PROCEDURE     127
DEVICE        125
GPE           113
TIME           92
MONEY          79
NORP           48
DISEASE        40
PRODUCT        35
PERCENT        19
FAC            14
LOC            10
EVENT           7
LAW             3
Name: count, dtype: int64