# üåü **NER (Named Entity Recognition)**

NER is a task in NLP where the model **finds important real-world entities** in text and assigns labels like:
PERSON, ORG (organization), GPE (countries, cities, states), DATE, MONEY, PRODUCT, etc.

Example:
**‚ÄúApple released the iPhone in California.‚Äù**

| Word       | Entity  |
| ---------- | ------- |
| Apple      | ORG     |
| iPhone     | PRODUCT |
| California | GPE     |

---

### ‚≠ê Why NER is used

* Resume parsing
* Chatbots
* Search engines
* Medical reports
* Invoice extraction
* News understanding
* Legal document automation

---

### ‚≠êHow NER Works

NER uses:

‚úî **Rule-based NER**

Uses patterns (regex, dictionaries).
Simple but weak.

‚úî **Statistical / ML-based NER**

SVM, CRF (old but used in classic NLP).

‚úî **Neural NER (current trend)**

spaCy, BERT, RoBERTa, transformers ‚Üí highest accuracy.

---

# ‚≠ê Basic NER using spaCy (Built-in Model)

In [None]:
pip install spacy

In [None]:
python -m spacy download en_core_web_sm

In [7]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Google hired John in London for 5 million dollars."

doc = nlp(text)

for ent in doc.ents:
    print(ent.text,"-", ent.label_)

Google - ORG
John - PERSON
London - GPE
5 million dollars - MONEY


In [4]:
# spaCy NER Visualization (Optional)

from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

### ‚≠ê Understanding Common NER Labels

| Label   | Meaning              |
| ------- | -------------------- |
| PERSON  | Human names          |
| ORG     | Company, institution |
| GPE     | Countries, cities    |
| LOC     | Locations            |
| DATE    | Dates                |
| TIME    | Times                |
| MONEY   | Monetary values      |
| PRODUCT | Device, product      |
| EVENT   | Festive/event names  |

---

#### spaCy NER may be inaccurate

Because:

* Small training data
* ‚Äúsm‚Äù model is small
* Domain-mismatch
* New/unseen entities
* Not trained on Indian names/places well

**Fix:**
Use `en_core_web_trf` (transformer).

```bash
python -m spacy download en_core_web_trf
```

## When to use Transformers (BERT NER)

If your requirements are:

* High accuracy
* Domain-specific text
* Uncommon entities
* Large labels

Use HuggingFace:

In [None]:
pip install transformers

### Simple NER using BERT model

In [8]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)

text = "Elon Musk founded SpaceX in California."

result = ner(text)




  torch.utils._pytree._register_pytree_node(
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


model.safetensors:  79%|#######9  | 1.06G/1.33G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
print(result)

[{'entity_group': 'PER', 'score': 0.99840075, 'word': 'Elon Musk', 'start': 0, 'end': 9}, {'entity_group': 'ORG', 'score': 0.9987556, 'word': 'SpaceX', 'start': 18, 'end': 24}, {'entity_group': 'LOC', 'score': 0.99960726, 'word': 'California', 'start': 28, 'end': 38}]


# ‚≠ê Training Custom NER (Simple Version)

When default NER doesn‚Äôt detect what you want, you train a custom model.


### Example: Label **‚ÄúDEVICE‚Äù** for text like ‚ÄúRedmi Note 12‚Äù.

## Step 1: Prepare Training Data

In [12]:
TRAIN_DATA = [
    ("I bought a Redmi Note 12 yesterday.", 
     {"entities": [(11, 25, "DEVICE")]}),

    ("The Samsung S22 Ultra is expensive.", 
     {"entities": [(4, 20, "DEVICE")]}),

    ("Xiaomi 14 Pro is a flagship phone.", 
     {"entities": [(0, 13, "DEVICE")]}),

    ("I am planning to buy the Xiaomi 14 Pro.", 
     {"entities": [(27, 40, "DEVICE")]}),

    ("OnePlus 12 is launching next month.", 
     {"entities": [(0, 10, "DEVICE")]}),
]


## Step 2: Train Model

In [13]:
import spacy
from spacy.training.example import Example

# ---------------------------------------------------------
# 1. Create a blank English NLP pipeline
# ---------------------------------------------------------
# Using blank("en") because we want to train NER from scratch,
# not use the default pretrained model.
nlp = spacy.blank("en")

# ---------------------------------------------------------
# 2. Add NER component to the pipeline
# ---------------------------------------------------------
# "ner" = Named Entity Recognizer
ner = nlp.add_pipe("ner")

# ---------------------------------------------------------
# 3. Add the new custom label we want the model to detect
# ---------------------------------------------------------
ner.add_label("DEVICE")  # Example: Redmi Note 12, Samsung S22

# ---------------------------------------------------------
# 4. Initialize the model
# ---------------------------------------------------------
# This sets up the model weights based on the labels we added.
optimizer = nlp.initialize()

# ---------------------------------------------------------
# 5. Start training loop
# ---------------------------------------------------------
# We train for several epochs (iterations).
# More epochs = better learning, but too many = overfitting.
for epoch in range(20):
    print(f"Epoch {epoch+1} started...")
    
    # Loop through each training example
    for text, annotations in TRAIN_DATA:

        # Step 1: Convert text into a spaCy doc object
        doc = nlp.make_doc(text)

        # Step 2: Convert your annotation dict into a training Example object
        example = Example.from_dict(doc, annotations)

        # Step 3: Update the NER model with this example
        # sgd = stochastic gradient descent (optimizer)
        nlp.update([example], sgd=optimizer)

print("Training completed!")


Epoch 1 started...




Epoch 2 started...
Epoch 3 started...
Epoch 4 started...
Epoch 5 started...
Epoch 6 started...
Epoch 7 started...
Epoch 8 started...
Epoch 9 started...
Epoch 10 started...
Epoch 11 started...
Epoch 12 started...
Epoch 13 started...
Epoch 14 started...
Epoch 15 started...
Epoch 16 started...
Epoch 17 started...
Epoch 18 started...
Epoch 19 started...
Epoch 20 started...
Training completed!


## Step 3: Test Model

In [14]:
test_text = "Xiaomi 14 Pro is the latest model."
doc = nlp(test_text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Xiaomi 14 Pro DEVICE


**NOTE:**  
* Minimum **200‚Äì500 labelled sentences** for reliable performance
* Ensure entity boundaries are correct
* Mix short + long sentences
* Include variations of names
* Add negative examples
* Avoid overfitting by adding diverse sentences