# NAMED ENTITY RECOGNITION

### Description
Named Entity Recognition (NER) is a sub-task of information extraction in Natural Language Processing (NLP) that classifies named entities into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, and more. In the realm of NLP, understanding these entities is crucial for many applications, as they often contain the most significant information in a text.
Named Entity Recognition (NER) serves as a bridge between unstructured text and structured data, enabling machines to sift through vast amounts of textual information and extract nuggets of valuable data in categorized forms. By pinpointing specific entities within a sea of words, NER transforms the way we process and utilize textual data.

In [80]:
# !pip install spacy

In [81]:
import spacy

Large model with the components that are available    
It has builtin functions to built the model

In [None]:
# !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_ig-3.0.0/en_core_web_ig-3.0.0.tar.gz
!python -m spacy download en_core_web_lg

2024-04-13 02:47:51.184149: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 02:47:51.184214: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 02:47:51.186264: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[33mDEPRECATION: https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0-py3-none-any.whl#egg=en_core_web_lg==3.0.0 contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found 

In [None]:
nlp=spacy.load("en_core_web_lg")

  _C._set_default_tensor_type(t)


In [None]:
nlp

<spacy.lang.en.English at 0x7957d12ed330>

### Document passed now will to the nlp loaded opbject

In [None]:
doc=nlp("Donald Trum is the President of US")

Reading an input file

In [None]:
with open("/content/Sample1.txt", 'r') as file:
    text = file.read()
doc1=nlp(text)

In [None]:
print(type(doc))
print(type(doc1))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>


In [None]:
doc.ents

(Donald Trum, US)

In [None]:
doc1.ents

(India,
 Bhārata,
 the Republic of India,
 first,
 the Indian Armed Forces,
 Droupadi Murmu,
 the 15th,
 25 July 2022,
 India,
 26 January 1950,
 the Parliament of India,
 India,
 Article 53 of the Constitution of India,
 the Council of Ministers,
 the Supreme Court,
 article 142,
 India,
 Article 60 of,
 Indian,
 constitution).The,
 Article 3,
 Article 111,
 Article 274,
 Article 74(2,
 Article 78C,
 Article 108,
 Article 111,
 India)

In [None]:
#specific info of the spacy object is obtained
doc.ents[0],type(doc.ents[0])

(Donald Trum, spacy.tokens.span.Span)

In [None]:
doc1.ents[0],type(doc1.ents[0])

(India, spacy.tokens.span.Span)

### Visualization
Generic enntities identified

In [None]:
from spacy import displacy
displacy.render(doc,style="ent",jupyter=True)

In [None]:
displacy.render(doc1,style="ent",jupyter=True)

Saving in an Output File

In [None]:
def save_entities_visualization(doc, output_file):
    html = spacy.displacy.render(doc, style="ent", page=True)
    with open(output_file, "w", encoding="utf-8") as file:
        file.write(html)

In [None]:
save_entities_visualization(doc1,"/content/Output.html")

In [None]:
def recognize_entities(text):
    nlp = spacy.load("en_core_web_lg")
    doc = nlp(text)
    named_entities = [(entity.text, entity.label_) for entity in doc.ents]
    return named_entities

In [None]:
def save_entities_to_file(named_entities, output_file):
    with open(output_file, "w") as file:
        for entity, label in named_entities:
            file.write(f"{entity}: {label}\n")

In [None]:
named_entities = recognize_entities(text)
save_entities_to_file(named_entities, "/content/output_file.txt")

### Custom Entity Recognition

In [None]:
import json

In [None]:
with open("/content/Corona2.json",'r') as f:
  data=json.load(f)

In [None]:
data["examples"][0]
#each training eg is a dictionary

{'id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'content': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]",
 'metadata': {},
 'annotations': [{'id': '0825a1

In [None]:
data["examples"][0].keys()

dict_keys(['id', 'content', 'metadata', 'annotations', 'classifications'])

In [None]:
data["examples"][0]["content"]

"While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]"

In [None]:
data["examples"][0]["annotations"][0]

{'id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
 'tag_id': 'c06bd022-6ded-44a5-8d90-f17685bb85a1',
 'end': 371,
 'start': 360,
 'example_id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'tag_name': 'Medicine',
 'value': 'Diosmectite',
 'correct': None,
 'human_annotations': [{'timestamp': '2020-03-21T00:24:32.098000Z',
   'annotator_id': 1,
   'tagged_token_id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
   'name': 'Ashpat123',
   'reason': 'exploration'}],
 'model_annotations': []}

In [None]:
training_data=[]
for example in data["examples"]:
  temp_dict={}
  temp_dict['text']=example["content"]
  temp_dict['entities']=[]
  for annotation in example['annotations']:
    start=annotation['start']
    end=annotation['end']
    label=annotation['tag_name'].upper()
    temp_dict['entities'].append((start,end,label))
  training_data.append(temp_dict)

print(training_data[0])

{'text': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]", 'entities': [(360, 371, 'MEDICINE'), (383, 408, 'MEDICINE'), (104, 112, 'MEDICALCONDITION'), (679,

In [None]:
training_data[0]["entities"]

[(360, 371, 'MEDICINE'),
 (383, 408, 'MEDICINE'),
 (104, 112, 'MEDICALCONDITION'),
 (679, 689, 'MEDICINE'),
 (6, 23, 'MEDICINE'),
 (25, 37, 'MEDICINE'),
 (461, 470, 'MEDICALCONDITION'),
 (577, 589, 'MEDICINE'),
 (853, 865, 'MEDICALCONDITION'),
 (188, 198, 'MEDICINE'),
 (754, 762, 'MEDICALCONDITION'),
 (870, 880, 'MEDICALCONDITION'),
 (823, 833, 'MEDICINE'),
 (852, 853, 'MEDICALCONDITION'),
 (461, 469, 'MEDICALCONDITION'),
 (535, 543, 'MEDICALCONDITION'),
 (692, 704, 'MEDICINE'),
 (563, 571, 'MEDICALCONDITION')]

In [None]:
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin()

In [None]:
from spacy.util import filter_spans

for training_example  in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

100%|██████████| 31/31 [00:00<00:00, 165.37it/s]


Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity


In [None]:

from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin()


from spacy.util import filter_spans

for training_example  in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

100%|██████████| 31/31 [00:00<00:00, 169.72it/s]


Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity


In [None]:
# !python -m spacy init fill-config base_config.cfg config.cfg
!python -m spacy init fill-config /content/base_config.cfg config.cfg

2024-04-13 04:20:48.496829: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 04:20:48.496910: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 04:20:48.499294: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  _C._set_default_tensor_type(t)
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

2024-04-13 04:21:03.905154: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 04:21:03.905227: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 04:21:03.906563: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[38;5;4mℹ Using CPU[0m
[1m
  _C._set_default_tensor_type(t)
[2024-04-13 04:21:08,609] [INFO] Set up nlp object from config
[2024-04-13 04:21:08,625] [INFO] Pipeline: ['tok2vec', 'ner']
[2024-04-13 04:21:08,629] [INFO] Created vocabulary
[2024-04-13 04:21:13,687] [INFO] Added vectors: en_core_web_lg
[2024-04-13 04:21:13,687] [INFO] Finished initializing nlp obj

In [None]:
nlp_ner = spacy.load("model-best")

In [None]:
doc = nlp_ner("While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.")

colors = {"PATHOGEN": "#F67DE3", "MEDICINE": "#7DF6D9", "MEDICALCONDITION":"#a6e22d"}
options = {"colors": colors}

spacy.displacy.render(doc, style="ent", options= options, jupyter=True)