### Custom Named Entity Recognition (NER) with CRF and spaCy

Continuing from our previous exercise, in this exercise, we will focus on extending **Named Entity Recognition (NER)** capabilities beyond pre-trained models. We will explore two primary methods for Custom NER:

- **Conditional Random Fields (CRF)**: A traditional, powerful sequence tagging model requiring custom feature engineering. We will cover the data preparation and feature extraction process critical for CRF training.

- `spaCy` **EntityRuler**: A simpler, more practical method for adding custom, rule-based entities directly into the spaCy pipeline for quick deployment.


#### We will be covering in this exercise

- **CRF Data Preparation**: Preparing training data and IOB format.

- **Feature Engineering for CRF**: Defining token features (word shape, context, etc.).

- **CRF Model Training (Conceptual)**: Overview of the training and evaluation process.

- **Custom NER with spaCy EntityRuler**: Implementing a rule-based pipeline.

- **Deployment and Application**: Applying the custom rules to test sentences.

#### What we expect to learn from this exercise

- Custom NER is needed when standard spaCy labels (PERSON, ORG) don't cover your domain-specific terms (e.g., PRODUCT, DRUGNAME).

- CRF models excel at sequence labelling by modeling transitions between IOB states, relying heavily on hand-crafted features.

- Feature Engineering is the most critical step for CRF performance.

- spaCy's EntityRuler is the quickest way to deploy custom NER rules within an existing pipeline.

**Let's get started**

#### Setup and Pre-requisites:

For the CRF component, you would typically use `pycrfsuite` or `sklearn-crfsuite`. We will also use `spaCy`.

In [None]:
# Required libraries for CRF (conceptual part)

! pip install pycrfsuite
! pip install sklearn-crfsuite

# Required library for spaCy pipeline
! pip install spacy
! python -m spacy download en_core_web_sm

[31mERROR: Could not find a version that satisfies the requirement pycrfsuite (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pycrfsuite[0m[31m
[0mCollecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.11 sklearn-crfsuite-0.5.0
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_

**Preparing Training Data (IOB Format)**: The training data for sequence models like CRF must be in a token-label format (or IOB format), usually as a list of lists of (word, label) pairs, where each inner list represents a sentence.

We will use a simple dataset to understand these concepts:
```
1. I bought the Giga Phone for 999 USD today.
2. She uses Nova Pad daily.
```

Next, we will need to parse the training data in I-O-B format as explored in the last challenge.

In [1]:
# Sample IOB-formatted training data (simplified for demonstration)
# Entities: PRODUCT (e.g., 'Giga_Phone') and PRICE
TRAIN_DATA_IOB = [
    [("I", "O"), ("bought", "O"), ("the", "O"), ("Giga", "B-PRODUCT"), ("Phone", "I-PRODUCT"), ("for", "O"), ("999", "B-PRICE"), ("USD", "I-PRICE"), ("today", "O"), (".", "O")],
    [("She", "O"), ("uses", "O"), ("the", "O"), ("Nova", "B-PRODUCT"), ("Pad", "I-PRODUCT"), ("daily", "O"), (".", "O")]
]

#### Conditional Random Fields (CRF) Implementation

A **Conditional Random Field (CRF)** is a discriminative probabilistic model used for labeling or parsing sequential data, such as finding named entities in text. Unlike Hidden Markov Models (**HMM**s), **CRF**s model the conditional probability of the label sequence given the entire observation sequence (the words), which helps avoid the "label bias problem."

##### Feature Engineering for CRF

For CRF to work effectively, we must manually extract features for every token. These features guide the model in deciding whether a token marks the start (B-), continuation (I-), or non-entity (O) of a named entity.

In [4]:
# Function to extract features for a token at a given index (i) in a sentence (sent)
def extract_features(sent, i):
    word = sent[i][0]

    # 1. Base Features
    features = {
        'word': word,
        'is_start': i == 0,
        'is_end': i == len(sent) - 1,
        'is_capitalized': word.istitle(),
        'is_all_caps': word.isupper(),
        'is_digit': word.isdigit(),
        'word_shape': re.sub(r'[A-Z]', 'X', re.sub(r'[a-z]', 'x', re.sub(r'[0-9]', 'd', word))),
        'prefix_3': word[:3],
        'suffix_3': word[-3:],
    }

    # 2. Context Features (Word before the target)
    if i > 0:
        prev_word = sent[i-1][0]
        features.update({
            'prev_word': prev_word,
            'prev_is_all_caps': prev_word.isupper(),
        })
    else:
        features['BOS'] = True # Beginning of Sentence

    # 3. Context Features (Word after the target)
    if i < len(sent) - 1:
        next_word = sent[i+1][0]
        features.update({
            'next_word': next_word,
            'next_is_capitalized': next_word.istitle(),
        })
    else:
        features['EOS'] = True # End of Sentence

    return features

# Example Feature Extraction (Need to import 're')
import re
print("--- Example Feature Extraction for 'Giga' (Index 3) ---")
# 'Giga' is the 4th word in the first sentence
sample_sent = TRAIN_DATA_IOB[0]
giga_features = extract_features(sample_sent, 3)
for k, v in giga_features.items():
    print(f"{k:15}: {v}")

--- Example Feature Extraction for 'Giga' (Index 3) ---
word           : Giga
is_start       : False
is_end         : False
is_capitalized : True
is_all_caps    : False
is_digit       : False
word_shape     : Xxxx
prefix_3       : Gig
suffix_3       : iga
prev_word      : the
prev_is_all_caps: False
next_word      : Phone
next_is_capitalized: True


#### CRF Training and Evaluation

To train the CRF model, you would convert your IOB data into feature vectors and label sequences.

In [7]:
# --- TRAINING CODE ---
from sklearn_crfsuite import CRF
from sklearn.model_selection import train_test_split
from sklearn_crfsuite.metrics import flatten
from sklearn.metrics import classification_report


def extract_features_for_sentence(sent):
    """
    Takes a sentence (list of (word, label) tuples) and extracts features for every token.
    """
    return [extract_features(sent, i) for i in range(len(sent))]

# Example output for the first three tokens of a sentence (conceptual):
# [
#     {'word': 'I', 'is_start': True, 'prefix_3': 'I', 'BOS': True, ...},
#     {'word': 'bought', 'prev_word': 'I', 'prefix_3': 'bou', ...},
#     {'word': 'the', 'prev_word': 'bought', 'prefix_3': 'the', ...},
#     # ... and so on
# ]


def get_labels_for_sentence(sent):
    """
    Takes a sentence (list of (word, label) tuples) and extracts the labels (y values).
    """
    print(f"sentence: {sent}")
    return [label for (word, label) in sent]

# Example output for the first sentence:
# ['O', 'O', 'O', 'B-PRODUCT', 'I-PRODUCT', 'O', 'B-PRICE', 'I-PRICE', 'O', 'O']


# 1. Prepare Data for CRF
X = [extract_features_for_sentence(s) for s in TRAIN_DATA_IOB]
y = [get_labels_for_sentence(s) for s in TRAIN_DATA_IOB]

# 2. Split Data (Need more than 3 sentences for a meaningful split, but this demonstrates the structure)
if len(X) > 1:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
else:
    # Use all data as train/test if corpus is too small (not recommended for real tasks)
    X_train, X_test = X, X
    y_train, y_test = y, y
    print("\nWarning: Using all data for both training and testing due to small corpus size.")

crf = CRF(c1=0.1, c2=0.1, max_iterations=100)
crf.fit(X_train, y_train)

# --- EVALUATION ---
from sklearn_crfsuite import metrics

y_pred = crf.predict(X_test)
print(metrics.flat_f1_score(y_test, y_pred, average='weighted'))
#print(metrics.flat_classification_report(y_test, y_pred))

# Optional: pick labels explicitly (often exclude 'O' for NER reports)
# labels = [l for l in crf.classes_ if l != 'O']

print(classification_report(
    flatten(y_test),
    flatten(y_pred),
    # labels=labels,             # uncomment if you want a specific label order
    zero_division=0              # avoids divide-by-zero warnings for rare labels
))

sentence: [('I', 'O'), ('bought', 'O'), ('the', 'O'), ('Giga', 'B-PRODUCT'), ('Phone', 'I-PRODUCT'), ('for', 'O'), ('999', 'B-PRICE'), ('USD', 'I-PRICE'), ('today', 'O'), ('.', 'O')]
sentence: [('She', 'O'), ('uses', 'O'), ('the', 'O'), ('Nova', 'B-PRODUCT'), ('Pad', 'I-PRODUCT'), ('daily', 'O'), ('.', 'O')]
0.5952380952380952
              precision    recall  f1-score   support

   B-PRODUCT       0.00      0.00      0.00         1
   I-PRODUCT       0.00      0.00      0.00         1
           O       0.71      1.00      0.83         5

    accuracy                           0.71         7
   macro avg       0.24      0.33      0.28         7
weighted avg       0.51      0.71      0.60         7



##### Explanation:

The CRF model learns weights for each feature, including transition weights (e.g., the probability of B-PRODUCT being followed by I-PRODUCT). Evaluation uses metrics like F1-score, Precision, and Recall, aggregated across all entity types.

#### Custom NER Deployment with spaCy EntityRuler

For practical, rule-based custom NER deployment, spaCy's **EntityRuler** is highly effective. It allows us to add custom patterns that run before or after the main statistical NER model.

**Defining Custom Entity Rules**

We define dictionary patterns using spaCy's token matching syntax.

In [10]:
# Load the base spaCy model
import spacy
from spacy import displacy

print(f"spaCy Version: {spacy.__version__}")

nlp = spacy.load("en_core_web_sm")

# Define a list of patterns for our custom entities (PRODUCT and PRICE)
patterns = [
    # Pattern 1: Product names (literal match)
    {"label": "PRODUCT", "pattern": "Giga Phone"},
    {"label": "PRODUCT", "pattern": [{"LOWER": "nova"}, {"LOWER": "pad"}]},

    # Pattern 2: Price expressions (using token attributes)
    {"label": "PRICE", "pattern": [{"SHAPE": "ddd"}, {"LOWER": "usd"}]}, # e.g., '999 USD'
    {"label": "PRICE", "pattern": [{"LIKE_NUM": True}, {"TEXT": "$"}, {"OP": "?"}]}, # e.g., '10$'
]

spaCy Version: 3.8.7


**Integrating the EntityRuler into the Pipeline**

The EntityRuler is added to the spaCy pipeline and configured to overwrite (or combine with) existing NER entities.

In [9]:
# 1. Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns(patterns)

# The new pipeline now includes the 'entity_ruler' before the default 'ner' component
print("\n--- spaCy Pipeline with Custom EntityRuler ---")
print(nlp.pipe_names)


--- spaCy Pipeline with Custom EntityRuler ---
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'entity_ruler', 'ner']



Here’s what each spaCy pipeline component does and why the order matters:

**tok2vec**

- Learns contextual token vectors (embeddings) from raw text.
- Provides shared features used by downstream components like tagger, parser, and NER.
- Runs first so everything else can use these representations.

**tagger**

- Predicts coarse/fine-grained part‑of‑speech tags (e.g., NOUN, VERB, PROPN) and often morphological features.
- Its output helps the parser and lemmatizer.

**parser**

- Predicts dependency parse (who modifies whom) and sentence boundaries.
- Useful for relation extraction and many downstream tasks.

**attribute_ruler**

- Rule-based assignment of token attributes such as LEMMA, NORM, or TAG based on patterns.
- Lets you normalize or fix attributes before statistical lemmatization runs.
- Typically comes before the lemmatizer to provide better hints.

**lemmatizer**

- Reduces words to their base form (e.g., “running” → “run”), using POS/morph info plus rules/lookups.
- Depends on the tagger and attribute_ruler for good accuracy.

**entity_ruler**

- Rule-based NER using patterns (string or token-based) that can create or overwrite entities.
- You placed it before 'ner', so its matches are set first; the statistical NER can then keep or adjust them depending on overwrite/ent_id settings.

**ner**

- Statistical Named Entity Recognizer that predicts entity spans and labels (e.g., PERSON, ORG, DATE).
- Benefits from tok2vec features and sometimes parser/tagger signals.
Runs after entity_ruler in your setup, so your custom rules prime or override the model’s decisions.

#### Deployment and Application

We apply the new spaCy pipeline, which combines the base NER model with our custom rules, on test sentences.

In [17]:
# Test sentences to evaluate custom NER
#--- spaCy Pipeline with Custom EntityRuler ---
# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'entity_ruler', 'ner']

from IPython.display import HTML, display

TEST_SENTENCE_1 = "I prefer the Nova Pad over the old Giga Phone, which costs $100 less."
TEST_SENTENCE_2 = "The CEO of Google announced the product on Tuesday." # Test overlap with base NER

doc_test = nlp(TEST_SENTENCE_1)

print("\n--- Custom NER Application (EntityRuler) ---")
#displacy.render(doc_test, style="ent", jupyter=True)

html = displacy.render(doc_test, style="ent", jupyter=False)
display(HTML(html))

# Programmatic extraction to verify custom labels
print(f"{'Entity':20} | {'Label':10}")
print("-" * 35)
for ent in doc_test.ents:
    print(f"{ent.text:20} | {ent.label_:10}")

# Example of IOB tags with custom entities:
print("\n--- IOB Tags with Custom Entities ---")
print(f"{'Token':10} | {'Full IOB Tag'}")
print("-" * 25)
for token in doc_test:
    iob_prefix = token.ent_iob_
    entity_type = token.ent_type_

    if iob_prefix == 'O':
        full_iob_tag = 'O'
    else:
        full_iob_tag = f"{iob_prefix}-{entity_type}"

    print(f"{token.text:10} | {full_iob_tag}")


--- Custom NER Application (EntityRuler) ---


Entity               | Label     
-----------------------------------
Giga Phone           | ORG       
100                  | MONEY     

--- IOB Tags with Custom Entities ---
Token      | Full IOB Tag
-------------------------
I          | O
prefer     | O
the        | O
Nova       | O
Pad        | O
over       | O
the        | O
old        | O
Giga       | B-ORG
Phone      | I-ORG
,          | O
which      | O
costs      | O
$          | O
100        | B-MONEY
less       | O
.          | O


#### Conclusion

This exercise, we explored the core concepts required for building a Custom NER system.

We first detailed the intensive data preparation, feature engineering, and training process needed for the traditional Conditional Random Fields (CRF) model. While CRFs provide highly accurate sequence modeling, they require significant manual effort in feature design.

We then showed the practical way to deploy custom rules within a modern framework by using spaCy's EntityRuler. This method is fast, easy to maintain, and seamlessly integrates with the existing statistical NER model, making it the preferred approach for quickly injecting domain-specific knowledge into an NLP pipeline. Custom NER is essential for extracting targeted, domain-specific information from text.        