**Tutorial Example of Building a Custom NER Model with Spacy**
- https://blog.futuresmart.ai/building-a-custom-ner-model-with-spacy-a-step-by-step-guide
- Dataset is from https://www.kaggle.com/datasets/finalepoch/medical-ner
- It is to recognize tag of diseases, pathogens and medications.

# Install Packages & Import Libraries

In [2]:
!pip install spacy --quiet

# Download the large English model for spaCy.
# The '--quiet' flag is used to suppress the download messages for a cleaner output.
!python -m spacy download en_core_web_lg --quiet

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [3]:
import spacy
from spacy import displacy
from spacy.tokens import DocBin
from spacy.util import filter_spans

from tqdm import tqdm

# Familiarizing Named Entties with Spacy

In [4]:
# Load the large English model into the 'nlp' variable.
# This model includes more features and higher accuracy compared to smaller models.
nlp = spacy.load("en_core_web_lg")

In [5]:
# Define the text to be processed
text = "Apple is looking at buying U.K. startup for $1 billion"

# Process the text using the loaded spaCy model
doc = nlp(text)

# Extract and print entities with their explanations
for ent in doc.ents:
    label = ent.label_
    explanation = spacy.explain(label)
    print(f"Entity: {ent.text}, Start: {ent.start_char}, End: {ent.end_char}, Label: {label}, Explanation: {explanation}")

Entity: Apple, Start: 0, End: 5, Label: ORG, Explanation: Companies, agencies, institutions, etc.
Entity: U.K., Start: 27, End: 31, Label: GPE, Explanation: Countries, cities, states
Entity: $1 billion, Start: 44, End: 54, Label: MONEY, Explanation: Monetary values, including unit


In [6]:
# Visualize the named entities using displacy
displacy.render(doc, style="ent", jupyter=True)

In [7]:
# Define a longer text to be processed
long_text = ("Apple is looking at buying U.K. startup for $1 billion. "
             "Amazon is also considering a similar move, investing in a French AI company. "
             "Meanwhile, Microsoft is expanding its operations in Canada.")

doc = nlp(long_text)
displacy.render(doc, style="ent", options={"ents": ["ORG", "GPE", "MONEY"]})

In [None]:
# displacy.serve(doc, style="ent", options={"ents": ["ORG", "GPE", "MONEY"]}, auto_select_port=True)

# Load the annotated medical data


In [8]:
import json
corona_filepath = 'Corona2.json'

with open(corona_filepath, 'r') as f:
    data = json.load(f)

In [9]:
data['examples'][0]['content'], data['examples'][0]['annotations'][0]

("While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]",
 {'id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
  'tag_id': 'c06bd022-6ded-44a5-8d90-f17685bb85a1',
  'end

In [10]:
training_data = []

for example in data['examples']:
    temp_dict = {}
    temp_dict['text'] = example['content']
    temp_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']
        end = annotation['end']
        label = annotation['tag_name'].upper()  # Convert label to uppercase
        temp_dict['entities'].append((start, end, label))
    training_data.append(temp_dict)

In [11]:
training_data

[{'text': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]",
  'entities': [(360, 371, 'MEDICINE'),
   (383, 408, 'MEDICINE'),
   (104, 112, 'MEDICALCONDITION

In [12]:
nlp = spacy.blank("en")   # load a new spacy model
doc_bin = DocBin()

In [15]:
for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk('train.spacy')

100%|█████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:00<00:00, 1154.00it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity





# Initializing the training config file
https://spacy.io/usage/training

In [17]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


# Training the NER model


In [18]:
!python -m spacy train config.cfg --output ./ --paths.train train.spacy --paths.dev train.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     59.36    0.00    0.00    0.00    0.00
  3     200        325.87   3270.19   70.42   72.02   68.90    0.70
  7     400         82.63   1001.90   91.09   89.69   92.52    0.91
[38;5;2m✔ Saved pipeline to output directory[0m
model-last


# Load the trained NER model

In [19]:
nlp_ner = spacy.load("model-best")

In [20]:
doc = nlp_ner("While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.")

colors = {"PATHOGEN": "#F67DE3", "MEDICINE": "#7DF6D9", "MEDICALCONDITION": "#a6e22d"}
options = {"colors": colors}

spacy.displacy.render(doc, style="ent", options=options, jupyter=True)

In [21]:
for ent in doc.ents:
    label = ent.label_
    print(f"Entity: {ent.text}, \tLabel: {label}")

Entity: bismuth compounds, 	Label: MEDICINE
Entity: Pepto-Bismol, 	Label: MEDICINE
Entity: diarrhea, 	Label: MEDICALCONDITION
Entity: loperamide, 	Label: MEDICINE
