# Learning how I can utilize spacy

- Spacy, a python lib, to be used in this project to train a model 

# Installation
- Installing spacy using pip
- Installing the pretrained EN NLP model for testing

In [None]:
pip install spacy

python -m spacy download en_core_web_sm

## Importing Spacy and doing some trials

- The bottom cell imports the `spacy` library and loads the English NLP model. It processes a sample text "Apple is looking at buying a UK startup for $1 billion." to create a `doc` object. The code then iterates over the named entities in the `doc` and prints each entity's text and label.

In [25]:
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("ANGA Hub is looking at buying a Western startup for $1 billion.")

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

ANGA Hub ORG
Western NORP
$1 billion MONEY


## Training with Spacy
### Libraries

- importing all the required libraries
    - `spacy` a powerful NLP framework used for training, fine-tuning, and deploying NER models
    - `DocBin` is a utility for efficiently storing and serializing multiple Doc objects. During training, labeled examples (text + entity annotations) need to be converted into **spaCy’s** `Doc` **objects**. `DocBin` helps **store**, **serialize**, and **load** these efficiently, reducing memory usage.
    - `Example` is used to create training instances that contain; the *input text* (`Doc`) and the correct *entity labels* (*annotations*). Essential for model training, as `Example` objects guide the model in learning patterns from labeled data.
    -  `filter_spans()` helps resolve **overlapping entity annotations** by keeping only the most relevant ones. In cases where multiple entity spans overlap, `filter_spans` ensures that only **non-overlapping** and **meaningful** annotations are kept, improving model performance.
    - `json` Allows handling of **JSON** files, which in this case have been used storing labeled *NER training data*.


In [1]:
import spacy
from spacy.tokens import DocBin
from spacy.training.example import Example
from spacy.util import filter_spans
import json

### Loading JSON data for training
- Using the JSON imported library to open the training data from the JSON file, `data.json`.

In [2]:
with open("data.json", "r", encoding="utf-8") as data_file:
    TRAIN_DATA = json.load(data_file)

print(TRAIN_DATA)

[{'text': 'Pick up the red box from table A and place it on table B at 4 PM.', 'entities': [[12, 19, 'OBJECT'], [25, 32, 'LOCATION'], [46, 53, 'LOCATION'], [57, 61, 'TIME']]}, {'text': 'Move the blue box to table 5 before 10 AM.', 'entities': [[9, 17, 'OBJECT'], [21, 28, 'LOCATION'], [36, 40, 'TIME']]}, {'text': 'Transport the yellow box to table C at 2:30 PM.', 'entities': [[13, 24, 'OBJECT'], [28, 35, 'LOCATION'], [39, 45, 'TIME']]}, {'text': 'Shift the small box from table 3 to table 7 by 5:15 PM.', 'entities': [[10, 20, 'OBJECT'], [26, 33, 'LOCATION'], [37, 44, 'LOCATION'], [48, 54, 'TIME']]}, {'text': 'Retrieve the green box from table X and set it up on table Y at noon.', 'entities': [[13, 23, 'OBJECT'], [29, 36, 'LOCATION'], [54, 61, 'LOCATION'], [65, 69, 'TIME']]}, {'text': 'Deliver the big box to table 12 by 3 PM.', 'entities': [[12, 20, 'OBJECT'], [24, 31, 'LOCATION'], [35, 38, 'TIME']]}, {'text': 'Move the brown box to table B before 8 AM.', 'entities': [[9, 18, 'OBJECT'], [

### Initialization
Initializing a blank **spaCy model** for *English* using `spacy.blank("en")`, which means it starts **without** any pre-trained components. Since *Named Entity Recognition (NER)* is not included by default in a blank model, the script checks whether `"ner"` exists in the pipeline using `nlp.pipe_names`. If it's missing, it is added as the last component using `nlp.add_pipe("ner", last=True)`; otherwise, it is retrieved with `nlp.get_pipe("ner")`. This setup ensures that the model is ready for **custom NER training**, allowing users to add new *entity labels* and *fine-tune* the pipeline for specific tasks. 

In [3]:
# Load a blank English model
nlp = spacy.blank("en")

# Add the NER component if it's not already present
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner", last=True)
else:
    ner = nlp.get_pipe("ner")

In [4]:
# Add labels to the NER component
for annotations in TRAIN_DATA:
    for ent in annotations["entities"]:
        ner.add_label(ent[2])

In [None]:
# Prepare training examples
doc_bin = DocBin()
for text, annotations in TRAIN_DATA:
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annotations["entities"]:
        span = doc.char_span(start, end, label=label)
        if span:
            ents.append(span)
    doc.ents = filter_spans(ents)
    doc_bin.add(doc)

In [None]:
# Training loop
nlp.begin_training()
optimizer = nlp.resume_training()

for i in range(30):  # Training for 30 iterations
    losses = {}
    for text, annotations in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example], drop=0.3, losses=losses)
    print(f"Iteration {i+1}: Losses - {losses}")



Iteration 1: Losses - {'ner': 101.67150656813548}
Iteration 2: Losses - {'ner': 37.60334010249971}
Iteration 3: Losses - {'ner': 32.16715796887578}
Iteration 4: Losses - {'ner': 39.83000926843897}
Iteration 5: Losses - {'ner': 37.16629142423336}
Iteration 6: Losses - {'ner': 17.3953892072243}
Iteration 7: Losses - {'ner': 13.633921551028623}
Iteration 8: Losses - {'ner': 12.374973254719581}
Iteration 9: Losses - {'ner': 11.464808898951325}
Iteration 10: Losses - {'ner': 16.02827886002787}
Iteration 11: Losses - {'ner': 14.738978374450632}
Iteration 12: Losses - {'ner': 13.269475577507045}
Iteration 13: Losses - {'ner': 9.196134885464357}
Iteration 14: Losses - {'ner': 6.996138207939732}
Iteration 15: Losses - {'ner': 7.794066351309302}
Iteration 16: Losses - {'ner': 2.618262086647651}
Iteration 17: Losses - {'ner': 1.8945431576479757}
Iteration 18: Losses - {'ner': 6.58648641994138}
Iteration 19: Losses - {'ner': 1.8793278869173617}
Iteration 20: Losses - {'ner': 3.9072475739671173}
It

In [None]:
# Save the trained model
nlp.to_disk("ner_GC_model")
print("Model saved successfully!")

Model saved successfully!


## Using Trained Model

In [5]:
import spacy

# Load the trained model
nlp = spacy.load("ner_GC_model")

# Test with a new sentence
sentence = "Pick the ketchup box from Shelf A to the table 3E at 12 PM"
doc = nlp(sentence)

# Print the recognized entities
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")

ketchup box - OBJECT
Shelf A - LOCATION
table 3E - LOCATION
