<a href="https://colab.research.google.com/github/Praveen76/Build-a-Custom-NER-Model-using-Spacy/blob/main/Build-a-Custom-NER-Model-using-Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives:

At the end of the experiment, you will be able to:

* understand the spaCy library
* train a custom Named Entity Recognition (NER) model using spaCy

## Introduction

**Named Entity Recognition (NER)** is one of the most pivotal data processing tasks in the field of NLP. It aims to locate and categorize key information, i.e., entities, in text data.  These ‘entities’ can be any word or any sequence of words that consistently refer to the same thing.

At its core, entity recognition systems have two steps:

- Detecting the entities in text
- Categorizing the entities into named classes

These categories change depending on the use case. Some of the most common entities classes are:

- Person
- Organization
- Location
- Time
- Measurements or Quantities
- String patterns like email addresses, phone numbers, or IP addresses

Application of Named Entity Recognition:

- Information Extraction And Summarization
- Optimizing Search Engines
- Machine Translation
- Content Classification
- Customer Support

### Install packages

In [None]:
!pip -q install spacy==3.7.4

In [None]:
!python -m spacy info

[1m

spaCy version    3.7.4                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.58+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.7.1)        



From the above info, we can see that by default spaCy contains the small language model for the English language `en_core_web_sm`.

To use medium, large, and transformer pre-trained models, they need to be installed first using the `!python -m spacy download` command.

In [None]:
# Install English transformer pipeline
# NOTE that Runtime needs to restart after this step

!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-curated-transformers<0.3.0,>=0.2.0 (from en-core-web-trf==3.7.3)
  Downloading spacy_curated_transformers-0.2.2-py2.py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.3/236.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl (25 kB)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_tokenizers-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (731 kB)
[2K  

**Restart the Runtime/Session**

In [None]:
!python -m spacy info

[1m

spaCy version    3.7.4                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.58+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_trf (3.7.3), en_core_web_sm (3.7.1)



### Import required packages

In [None]:
import json
import spacy
from spacy import displacy             # to visualize/render text
from spacy.tokens import DocBin        # to efficiently serializes the information from a collection of spacy's Doc objects
from spacy.util import filter_spans    # to handle entity span overlaps
from tqdm import tqdm                  # to make your loops show a smart progress meter

## NER in spaCy

By default, the spaCy pipeline loads the part-of-speech tagger, dependency parser, and NER.

In [None]:
# Load transformer pipeline for English
nlp = spacy.load("en_core_web_trf")

nlp.pipe_names

['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
# Visualize a sample text
text = "On 3rd Feb, Ram was in Delhi.\nLater he traveled to Mumbai via Air India flight reading a Time magazine to meet Raj.\nAfter 10 days, he went again back to Delhi wearing a Timex watch."
doc = nlp(text)
spacy.displacy.render(doc, style='ent')

Test another example of medical text:

In [None]:
# Visualize another sample text
text2 = "Antiretroviral therapy (ART) is recommended for all HIV-infected individuals to reduce the risk of disease progression."
doc2 = nlp(text2)
spacy.displacy.render(doc2, style='ent')

Here, we can see that it doesn't perform that well on medical text data. For instance, if we try to extract entities from medical journal text it won't detect any relevant information.

To solve this we'll need to train our own NER model. The process is very straightforward with spaCy.

## Training custom NER model using spaCy

To train our custom named entity recognition model, we will need some relevant text data with the proper annotations.

### Dataset Description

**Medical NER** [dataset](https://www.kaggle.com/datasets/finalepoch/medical-ner) is a manually tagged data (diseases, pathogens, and medication) for training NER system.

This dataset was created to train a spaCy model to perform Named Entity Recognition for three categories:

* ***Medical condition names*** (eg.: influenza, headache, malaria)
* ***Medicine names*** (eg.: aspirin, penicillin, ribavirin, methotrexate)
* ***Pathogens*** ( eg.: Corona Virus, Zika Virus, cynobacteria, E. Coli)

In [None]:
#@title Download the data
!wget https://cdn.iisc.talentsprint.com/AIandMLOps/Datasets/Corona2.json

from IPython.display import clear_output
clear_output()
print("Data downloaded successfully!")
!ls | grep '.json'

Data downloaded successfully!
Corona2.json


The Corona2.json file contains annotated text which was generated using [LightTag](https://www.lighttag.io/) online tool.

Let's start by taking a look at the dataset.

### Load data

In [None]:
# Load data from json file

with open('/content/Corona2.json', 'r') as f:
    data = json.load(f)

In [None]:
data.keys()

dict_keys(['examples'])

In [None]:
# Number of samples
len(data['examples'])

31

In [None]:
# A sample
data['examples'][0]

{'id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'content': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]",
 'metadata': {},
 'annotations': [{'id': '0825a1

In [None]:
data['examples'][0].keys()

dict_keys(['id', 'content', 'metadata', 'annotations', 'classifications'])

In [None]:
data['examples'][0]['content']

"While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]"

In [None]:
data['examples'][0]['annotations'][0]

{'id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
 'tag_id': 'c06bd022-6ded-44a5-8d90-f17685bb85a1',
 'end': 371,
 'start': 360,
 'example_id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'tag_name': 'Medicine',
 'value': 'Diosmectite',
 'correct': None,
 'human_annotations': [{'timestamp': '2020-03-21T00:24:32.098000Z',
   'annotator_id': 1,
   'tagged_token_id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
   'name': 'Ashpat123',
   'reason': 'exploration'}],
 'model_annotations': []}

In [None]:
data['examples'][0]['annotations'][0].keys()

dict_keys(['id', 'tag_id', 'end', 'start', 'example_id', 'tag_name', 'value', 'correct', 'human_annotations', 'model_annotations'])

### Preprocess data

We only need the `text` string, the entity `start` and `end` indices, and the entity `type`.

In [None]:
# Extract text string, entity start and end indices, and entity label

training_data = []

for example in data['examples']:
    temp_dict = {}
    temp_dict['text'] = example['content']             # text string
    temp_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']                    # entity start index
        end = annotation['end']                        # entity end index
        label = annotation['tag_name'].upper()         # entity label
        temp_dict['entities'].append((start, end, label))
    training_data.append(temp_dict)

print(training_data[0])

{'text': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]", 'entities': [(360, 371, 'MEDICINE'), (383, 408, 'MEDICINE'), (104, 112, 'MEDICALCONDITION'), (679,

In [None]:
# Processed data for first sample
training_data[0]['text']

"While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]"

In [None]:
# Extracted entity details for first sample
training_data[0]['entities']

[(360, 371, 'MEDICINE'),
 (383, 408, 'MEDICINE'),
 (104, 112, 'MEDICALCONDITION'),
 (679, 689, 'MEDICINE'),
 (6, 23, 'MEDICINE'),
 (25, 37, 'MEDICINE'),
 (461, 470, 'MEDICALCONDITION'),
 (577, 589, 'MEDICINE'),
 (853, 865, 'MEDICALCONDITION'),
 (188, 198, 'MEDICINE'),
 (754, 762, 'MEDICALCONDITION'),
 (870, 880, 'MEDICALCONDITION'),
 (823, 833, 'MEDICINE'),
 (852, 853, 'MEDICALCONDITION'),
 (461, 469, 'MEDICALCONDITION'),
 (535, 543, 'MEDICALCONDITION'),
 (692, 704, 'MEDICINE'),
 (563, 571, 'MEDICALCONDITION')]

In [None]:
training_data[0]['text'][360:371]

'Diosmectite'

spaCy uses [*DocBin*](https://spacy.io/api/docbin) class for annotated data, so we'll have to create the *DocBin* objects for our training examples. This *DocBin* class efficiently serializes the information from a collection of *Doc* objects. It is faster and produces smaller data sizes than pickle, and allows the user to deserialize without executing arbitrary Python code.

In [None]:
from spacy.tokens import DocBin

# Load a new spacy model
nlp_blank = spacy.blank("en")
doc_bin = DocBin()

There are some entity span overlaps, i.e., the indices of some entities overlap. spaCy provides a utility method [filter_spans](https://spacy.io/api/top-level#util.filter_spans) to deal with this.

In [None]:
from tqdm import tqdm
from spacy.util import filter_spans

for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp_blank.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print(f"Skipping entity '{text[start:end]}'")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

100%|██████████| 31/31 [00:00<00:00, 345.73it/s]

Skipping entity 'flatulence'
Skipping entity ' '
Skipping entity 'l'
Skipping entity 'DMARDs.'
Skipping entity 'DMARDs'
Skipping entity 'leflunomide'
Skipping entity 'ie'
Skipping entity 'died'
Skipping entity 'richomonas vag'
Skipping entity 'inflammation'
Skipping entity 'Campylobacte'
Skipping entity 'lobemide'
Skipping entity 'nxiety'
Skipping entity 'M'
Skipping entity 'rifapentine'
Skipping entity 'HIV'





### Training Configuration

Training config files include all **settings and hyperparameters** for training your pipeline. Instead of providing lots of arguments on the command line, you only need to pass your `config.cfg` file to spacy train. This also makes it easy to integrate custom models and architectures, written in your framework of choice. A pipeline's `config.cfg` is considered the “single source of truth”, both at training and runtime.

Let's initiallize a config file using `!python -m spacy init config` command. This command requires few arguments and options to specify:

- ***output_file***: File to save the config to
- ***lang***: Two-letter code of the language to use
- ***pipeline***: Comma-separated names of trainable pipeline components to include
- ***optimize***: Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model)

To know more about other config arguments, run `!python -m spacy init config --help` command.

In [None]:
!python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### Training

In [None]:
!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy --nlp.batch_size 100 --training.max_epochs 25

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    153.29    0.00    0.00    0.00    0.00
  7     200        734.85   3155.35   76.73   79.66   74.02    0.77
 14     400        669.55    734.48   97.24   97.24   97.24    0.97
 22     600        945.17    272.49   98.03   98.03   98.03    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
model-last


### Inference

Let's load the best-performing model and test it on a piece of text.

In [None]:
# Load best model
nlp_ner = spacy.load("model-best")

In [None]:
doc = nlp_ner("While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.")

#colors = {"PATHOGEN": "#F67DE3", "MEDICINE": "#7DF6D9", "MEDICALCONDITION":"#a6e22d"}
colors = {"PATHOGEN": "yellow", "MEDICINE": "lightblue", "MEDICALCONDITION":"lightgreen"}
options = {"colors": colors}

spacy.displacy.render(doc, style="ent", options= options)

**Test another example 3:**

In [None]:
# Visualize another sample text
text2 = "Antiretroviral therapy (ART) is recommended for all HIV-infected individuals to reduce the risk of disease progression."
doc2 = nlp_ner(text2)

colors = {"PATHOGEN": "yellow", "MEDICINE": "lightblue", "MEDICALCONDITION":"lightgreen"}
options = {"colors": colors}

spacy.displacy.render(doc2, style='ent', options= options)