<a href="https://colab.research.google.com/github/Jay-Nehra/SpaCy_NER/blob/main/02_rules_based_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rules Based NER

### There are two primary approaches to NLP and NER: rules-based and machine learning-based. This notebook will focus on the rules-based approach, while the subsequent one will explore machine learning-based techniques.

In rules-based NER, user either employs or devises an NLP system governed by a specific set of instructions, or "rules," to execute particular NLP tasks.
> For NER, this process often involves the use of a gazetteer. A gazetteer is essentially a collection or dictionary of entities categorized under a particular label. For instance, in identifying people, this could encompass a compilation of first and last names. If one were to design an NER system for a certain geographical area, as will be discussed in a later document, this might include a comprehensive list of all places within that area.

### *When should you use a rules-based NER approach?*

When the representation of a specific entity follows a limited set of patterns, enabling the capture of approximately 95-97% of occurrences through these rules, it's an efficient strategy. The target of 95-97% isn't rooted in industry norms but represents my personal benchmark for the performance of NER models. If I can attain this level of accuracy with a rules-based method, mirroring the precision of a machine learning model, I'm inclined to opt for it.
> This preference is primarily due to the quicker implementation time of rules-based approaches compared to the duration required to train, validate, and test a machine learning model.


### Rules-Based NER's Limitations

It's crucial to bear in mind, however, that rules-based methods are exactly what the term implies: reliant on rules. If an entity doesn't conform to the established rules, it won't be identified as such. This limitation becomes particularly apparent in texts that have undergone OCR processes, have been typed without spellcheck, remain unedited, or are in any other form of unprocessed state.

Cleaning texts is a fundamental step in preparing data for NLP applications, but it's not always feasible to thoroughly cleanse a text. Additionally, users of a specific NER framework might not be aware of the necessity to pre-clean texts.
>This represents a significant drawback of rules-based approaches and is a primary reason why researchers today lean towards machine learning methods. Machine learning models have the capacity to learn and, as a result, can generalize to unseen data, accommodating variances to a certain degree from previously encountered scenarios. This aspect will be discussed more comprehensively in the upcoming notebook.

### SpaCy's EntityRuler

There are a few ways to engage in rules-based NER with spaCy, but one of the more fundamental is its EntityRuler.


In [None]:
text = """Mary, a senior,
moved to Spain where she will be playing basketball and soccer until 05 June 2022 or until she can't play any longer."""

#Import spacy
import spacy

#Create a blank spaCy model that will parse English ("en")
nlp = spacy.blank("en")


#Create a set of patterns
patterns = [{"label": "SPORT", "pattern": "basketball"}]

nlp.add_pipe('sentencizer')
#Create a ruler that we will add to the model
entity_ruler = nlp.add_pipe("entity_ruler")

#Initialize the entity ruler with the patterns
entity_ruler.add_patterns(patterns)

#Create the doc object
doc = nlp(text)

#Iterate over all entities (there will be only one)
for ent in doc.ents:
    print (ent.text, ent.label_)


basketball SPORT


In [None]:

new_label = ({"label": "POTENTIAL_ZIP_CODE", "pattern": [{"IS_DIGIT": True, "LENGTH": 5}]})
patterns.append(new_label)
entity_ruler.add_patterns([new_label])

text = "my zip code is 12345 and, the package needs to arrive there. I am expecting a basketball"

doc = nlp(text)

# Function to check for the context word in the sentence of a detected entity
def context_word_in_sentence(doc, ent, context_word):
    # Retrieve the sentence containing the entity
    sentence = next(sent for sent in doc.sents if ent.start_char >= sent.start_char and ent.end_char <= sent.end_char)
    # Check if the context word is in the sentence
    return context_word in [token.lower_ for token in sentence]

# Iterate over detected entities and print those that meet the context condition
for ent in doc.ents:
    if ent.label_ == "POTENTIAL_ZIP_CODE" and context_word_in_sentence(doc, ent, "zip"):
        print (ent.text, ent.label_)
    else:
        print (ent.text, ent.label_)

12345 POTENTIAL_ZIP_CODE
basketball SPORT


In [None]:
import spacy
import pprint


nlp = spacy.load("en_core_web_sm")

text = "John Doe lives in Warsaw, Poland. His email is john.doe@example.com, and his birthday is on 5th July 1988. Wikipedia notes that Treblinka is not large."

corpus = []

doc = nlp(text)
for sent in doc.sents:
    corpus.append(sent.text)

ruler = nlp.add_pipe("entity_ruler", after="ner")

patterns = [
    {"label": "EMAIL", "pattern": [{"TEXT": {"REGEX": "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"}}]}
]

ruler.add_patterns(patterns)

TRAIN_DATA = []

for sentence in corpus:
    doc = nlp(sentence)
    entities = []
    for ent in doc.ents:
        entities.append([ent.text, ent.start_char, ent.end_char, ent.label_])
    TRAIN_DATA.append([sentence, {"entities": entities}])

pprint.pprint(TRAIN_DATA)

[['John Doe lives in Warsaw, Poland.',
  {'entities': [['John Doe', 0, 8, 'PERSON'],
                ['Warsaw', 18, 24, 'GPE'],
                ['Poland', 26, 32, 'GPE']]}],
 ['His email is john.doe@example.com, and his birthday is on 5th July 1988.',
  {'entities': [['john.doe@example.com', 13, 33, 'PERSON'],
                ['5th July 1988', 58, 71, 'DATE']]}],
 ['Wikipedia notes that Treblinka is not large.',
  {'entities': [['Wikipedia', 0, 9, 'ORG']]}]]


In [None]:
!pip install gliner-spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting gliner-spacy
  Downloading gliner-spacy-0.0.2.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting gliner (from gliner-spacy)
  Downloading gliner-0.1.6-py3-none-any.whl.metadata (8.3 kB)
Collecting transformers>=4.38.2 (from gliner->gliner-spacy)
  Downloading transformers-4.39.2-py3-none-any.whl.metadata (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting huggingface-hub>=0.21.4 (from gliner->gliner-spacy)
  Using cached huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting flair==0.13.1 (from gliner->gliner-spacy)
  Downloading flair-0.13.1-py3-none-any.whl.metadata (12 kB)
Collecting seqeval (from gliner->gliner-spacy)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.2

In [None]:
import spacy
from gliner_spacy.pipeline import GlinerSpacy

nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy")

text = "John Doe lives in Warsaw, Poland. His email is john.doe@example.com, and his birthday is on 5th July 1988. Wikipedia notes that Treblinka is not large."


doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)