# Learning how I can utilize spacy

- Spacy, a python lib, to be used in this project to train a model 

# Installation
- Installing spacy using pip
- Installing the pretrained EN NLP model for testing

In [3]:
!pip install spacy

!python -m spacy download en_core_web_sm




[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: C:\Users\User\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 1.7 MB/s eta 0:00:08
     -- ------------------------------------- 0.8/12.8 MB 1.9 MB/s eta 0:00:07
     --- ------------------------------------ 1.0/12.8 MB 1.7 MB/s eta 0:00:08
     ---- ----------------------------------- 1.3/12.8 MB 1.7 MB/s eta 0:00:07
     ---- ----------------------------------- 1.6/12.8 MB 1.5 MB/s eta 0:00:08
     ------ --------------------------------- 2.1/12.8 MB 1.6 MB/s eta 0:00:07
     ------- -------------------------------- 2.4/12.8 MB 1.6 MB/s eta 0:00:07
     --------- ------------------------------ 2.9/12.8 MB 1.6 MB/s eta 0:00:07
     --------- ------------------------------ 3.


[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: C:\Users\User\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## Importing Spacy and doing some trials

- The bottom cell imports the `spacy` library and loads the English NLP model. It processes a sample text "Apple is looking at buying a UK startup for $1 billion." to create a `doc` object. The code then iterates over the named entities in the `doc` and prints each entity's text and label.

In [4]:
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("ANGA Hub is looking at buying a Western startup for $1 billion.")

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

ANGA Hub ORG
Western NORP
$1 billion MONEY


## Training with Spacy
### Libraries

- importing all the required libraries
    - `spacy` a powerful NLP framework used for training, fine-tuning, and deploying NER models
    - `DocBin` is a utility for efficiently storing and serializing multiple Doc objects. During training, labeled examples (text + entity annotations) need to be converted into **spaCy’s** `Doc` **objects**. `DocBin` helps **store**, **serialize**, and **load** these efficiently, reducing memory usage.
    - `Example` is used to create training instances that contain; the *input text* (`Doc`) and the correct *entity labels* (*annotations*). Essential for model training, as `Example` objects guide the model in learning patterns from labeled data.
    -  `filter_spans()` helps resolve **overlapping entity annotations** by keeping only the most relevant ones. In cases where multiple entity spans overlap, `filter_spans` ensures that only **non-overlapping** and **meaningful** annotations are kept, improving model performance.
    - `json` Allows handling of **JSON** files, which in this case have been used storing labeled *NER training data*.


In [5]:
import spacy
from spacy.tokens import DocBin
from spacy.training.example import Example
from spacy.util import filter_spans
import json

### Loading JSON data for training
- Using the JSON imported library to open the training data from the JSON file, `data.json`.

In [28]:
with open("data.json", "r", encoding="utf-8") as data_file:
    TRAIN_DATA = json.load(data_file)

print(TRAIN_DATA)

[{'text': 'Pick up the red box from table A and place it on table B at 4 PM.', 'entities': [[12, 19, 'OBJECT'], [25, 32, 'LOCATION'], [49, 56, 'LOCATION'], [60, 64, 'TIME']]}, {'text': 'Move the blue box to table 5 before 10 AM.', 'entities': [[9, 17, 'OBJECT'], [21, 28, 'LOCATION'], [36, 41, 'TIME']]}, {'text': 'Transport the yellow box to table C at 2:30 PM.', 'entities': [[13, 24, 'OBJECT'], [28, 35, 'LOCATION'], [39, 46, 'TIME']]}, {'text': 'Shift the small box from table 3 to table 7 by 5:15 PM.', 'entities': [[10, 19, 'OBJECT'], [25, 32, 'LOCATION'], [36, 43, 'LOCATION'], [47, 54, 'TIME']]}, {'text': 'Retrieve the green box from table X and set it up on table Y at noon.', 'entities': [[13, 22, 'OBJECT'], [28, 35, 'LOCATION'], [53, 60, 'LOCATION'], [64, 68, 'TIME']]}, {'text': 'Deliver the big box to table 12 by 3 PM.', 'entities': [[12, 19, 'OBJECT'], [23, 31, 'LOCATION'], [35, 39, 'TIME']]}, {'text': 'Move the brown box to table B before 8 AM.', 'entities': [[9, 18, 'OBJECT'], [

### Initialization
Initializing a blank **spaCy model** for *English* using `spacy.blank("en")`, which means it starts **without** any pre-trained components. Since *Named Entity Recognition (NER)* is not included by default in a blank model, the script checks whether `"ner"` exists in the pipeline using `nlp.pipe_names`. If it's missing, it is added as the last component using `nlp.add_pipe("ner", last=True)`; otherwise, it is retrieved with `nlp.get_pipe("ner")`. This setup ensures that the model is ready for **custom NER training**, allowing users to add new *entity labels* and *fine-tune* the pipeline for specific tasks. 

In [29]:
# Load a blank English model
nlp = spacy.blank("en")

# Add the NER component if it's not already present
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner", last=True)
else:
    ner = nlp.get_pipe("ner")

In [30]:
# Add labels to the NER component
for annotations in TRAIN_DATA:
    for ent in annotations["entities"]:
        ner.add_label(ent[2])

In [38]:
# Prepare training examples
doc_bin = DocBin()
for entry in TRAIN_DATA:
    text, entities = entry["text"], entry["entities"]
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in entities:
        span = doc.char_span(start, end, label=label)
        if span:
            ents.append(span)
    doc.ents = filter_spans(ents)
    doc_bin.add(doc)

In [40]:
# Training loop
nlp.begin_training()
optimizer = nlp.resume_training()

for i in range(50):  # Training for 30 iterations
    losses = {}
    for entry in TRAIN_DATA:
        text, entities = entry["text"], {"entities": [(start, end, label) for start, end, label in entry["entities"]]}
        example = Example.from_dict(nlp.make_doc(text), entities)
        nlp.update([example], drop=0.3, losses=losses)
    print(f"Iteration {i+1}: Losses - {losses}")



Iteration 1: Losses - {'ner': 217.74356665671075}
Iteration 2: Losses - {'ner': 130.74036908972164}
Iteration 3: Losses - {'ner': 73.74192956293571}
Iteration 4: Losses - {'ner': 13.106404100642681}
Iteration 5: Losses - {'ner': 8.707318455901177}
Iteration 6: Losses - {'ner': 4.418748851284166}
Iteration 7: Losses - {'ner': 7.081353269135576}
Iteration 8: Losses - {'ner': 2.715542732990426}
Iteration 9: Losses - {'ner': 0.9744601633906703}
Iteration 10: Losses - {'ner': 1.8425415525725655}
Iteration 11: Losses - {'ner': 1.6397290908444995}
Iteration 12: Losses - {'ner': 2.001732125031806}
Iteration 13: Losses - {'ner': 1.7655932470443647}
Iteration 14: Losses - {'ner': 1.0879585732324122}
Iteration 15: Losses - {'ner': 1.7804843550255665}
Iteration 16: Losses - {'ner': 0.0985103228113588}
Iteration 17: Losses - {'ner': 0.24461754339141772}
Iteration 18: Losses - {'ner': 0.004529335636171154}
Iteration 19: Losses - {'ner': 0.2785120410459396}
Iteration 20: Losses - {'ner': 3.6057925109

In [41]:
# Save the trained model
nlp.to_disk("ner_GC_model")
print("Model saved successfully!")

Model saved successfully!


## Using Trained Model

In [57]:
import spacy

# Load the trained model
nlp = spacy.load("ner_GC_model")

# Test with a new sentence
sentence = "Pick the green packahe from Shelf 12A to the loading 3 at midnight"
doc = nlp(sentence)

# Print the recognized entities
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")

green packahe - OBJECT
Shelf 12A - LOCATION
loading 3 - LOCATION
midnight - TIME
