# Train a Custom NER Model with SpaCy

This notebook demonstrates how to train a custom Named Entity Recognition (NER) model using the SpaCy library.

## What is Named Entity Recognition?

Named Entity Recognition (NER) is a natural language processing task that identifies and classifies entities in text into predefined categories such as:
- **People** (PER)
- **Organizations** (ORG)
- **Locations** (LOC)
- **Dates** (DATE)
- And other custom categories

## Overview

This tutorial will cover:
1. Installing required dependencies
2. Loading and preparing annotated training data
3. Converting data to SpaCy format
4. Training a custom NER model
5. Running inference on new text

## Step 1: Install Dependencies

First, we need to install SpaCy and the transformer components for better model performance.

In [None]:
!pip install -U spacy
!pip install spacy-transformers

## Step 2: Setup Environment and Import Libraries

Mount Google Drive to access your annotated dataset and import necessary libraries.

In [None]:
# Mount Google Drive to access project files
from google.colab import drive
drive.mount('/content/drive')

# Change to your project directory
%cd "/content/drive/MyDrive/"

# Create the directory tree 
!mkdir -p Custom_NER/annotations_dataset
!mkdir -p Custom_NER/config
!mkdir -p Custom_NER/trained_models

# Import required libraries
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import json

# Check SpaCy version
print(f"SpaCy version: {spacy.__version__}")

# Check GPU availability (important for faster training)
!nvidia-smi

## Step 3: Load Annotated Data

Load your annotated dataset from a JSON file. The annotations should be in the format: `[[text, {"entities": [[start, end, label], ...]}], ...]`

In [None]:
# Load the annotated data from JSON file
# Update this path to match your dataset location
cv_data = json.load(open('/content/drive/MyDrive/Custom_NER/annotations_dataset/annotations.json', 'r', encoding='utf-8-sig'))

# Display dataset statistics
print(f"Total number of annotated examples: {len(cv_data)}")

# Display the first example to verify the format
print("\nFirst example:")
print(cv_data[0])

## Step 4: Initialize SpaCy Configuration

Create a configuration file for training. SpaCy requires a config file that defines the model architecture, training parameters, and hyperparameters. You can find the base configuration in the [SpaCy documentation](https://spacy.io/usage/training).

In [None]:
# Initialize SpaCy config from base configuration
# This fills in all default values for training
!python -m spacy init fill-config /content/drive/MyDrive/Custom_NER/config/base_config.cfg /content/drive/MyDrive/Custom_NER/config/config.cfg

## Step 5: Define Data Conversion Function

This function converts your annotated data into SpaCy's binary format (DocBin), which is required for training.

In [None]:
def get_spacy_doc(file, data):
    """
    Convert annotated data to SpaCy DocBin format.
    
    Args:
        file: File handle for logging errors
        data: List of tuples (text, {"entities": [(start, end, label), ...]})
    
    Returns:
        DocBin: SpaCy DocBin object containing processed documents
    """
    # Create a blank Spanish language model
    nlp = spacy.blank('es')
    db = DocBin()
    
    # Process each annotated text
    for text, annot in tqdm(data):
        doc = nlp.make_doc(text)
        entities = annot['entities']
        
        ents = []
        entity_indices = []
        
        # Convert character offsets to SpaCy spans
        for start, end, label in entities:
            # Skip overlapping entities
            skip_entity = False
            for idx in range(start, end):
                if idx in entity_indices:
                    skip_entity = True
                    break
            if skip_entity:
                continue
            
            # Track character indices to detect overlaps
            entity_indices = entity_indices + list(range(start, end))
            
            try:
                # Create span with strict alignment
                span = doc.char_span(start, end, label=label, alignment_mode='strict')
            except Exception as e:
                # Log any errors during span creation
                continue
            
            if span is None:
                # Log annotations that couldn't be aligned
                err_data = f"{start},{end}: {text}\n"
                file.write(err_data)
            else:
                ents.append(span)
        
        # Add entities to document
        try:
            doc.ents = ents
            db.add(doc)
        except Exception as e:
            # Skip documents that cause errors
            pass
    
    return db

## Step 6: Prepare Training and Test Data

Split the data into training and test sets, then convert them to SpaCy format.

In [None]:
# Split data into training (80%) and testing (20%) sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(cv_data, test_size=0.2, random_state=42)

print(f"Training examples: {len(train)}")
print(f"Testing examples: {len(test)}")

# Define output directory for training files
output_dir = '/content/drive/MyDrive/Custom_NER/trained_models/'

# Open error log file
file = open(f'{output_dir}train_file.txt', 'w')

# Convert training data to SpaCy format
print("\nProcessing training data...")
db = get_spacy_doc(file, train)
db.to_disk(f'{output_dir}train_data.spacy')

# Convert test data to SpaCy format
print("Processing test data...")
db = get_spacy_doc(file, test)
db.to_disk(f'{output_dir}test_data.spacy')

# Close error log
file.close()

print("\nData preparation complete!")

## Step 7: Train the NER Model

Now we'll train the model using the SpaCy CLI. This process may take some time depending on your dataset size and GPU availability.

**Note:** Make sure all paths are consistent with your directory structure.

In [None]:
# Train the NER model
# Update paths to match your directory structure
!python -m spacy train \
  /content/drive/MyDrive/Custom_NER/config/config.cfg \
  --output /content/drive/MyDrive/Custom_NER/trained_models/output \
  --paths.train /content/drive/MyDrive/Custom_NER/trained_models/train_data.spacy \
  --paths.dev /content/drive/MyDrive/Custom_NER/trained_models/test_data.spacy \
  --gpu-id 0

---

# Inference and Testing

After training is complete, you can use the trained model to extract entities from new text.

## Step 8: Load the Trained Model

Load the best performing model from the training output.

In [None]:
# Load the trained NER model
# Use the same path as specified in the training step
nlp = spacy.load('/content/drive/MyDrive/Custom_NER/trained_models/output/model-best')

print("Model loaded successfully!")
print(f"Pipeline components: {nlp.pipe_names}")

## Step 9: Run Inference on Sample Text

Test the model by extracting named entities from a sample Spanish text.

In [None]:
# Sample Spanish text for testing
text = """Los ministros de Cultura, Leslie Urteaga, de Comercio Exterior y Turismo, Juan Carlos Mathews y de Ambiente, Albina Ruíz, no llegaron a obtener acuerdos con las organizaciones sociales que se mantienen en huelga por la venta de entradas virtuales a Machu Picchu. La ministra, Leslie Urteaga, indicó que no se ha solicitado una tregua, lo que se comunicó era que la mesa se instalaba siempre y cuando la huelga se paralizaba. La titular del sector, planteó que sea la PCM la instancia que se encargue de la venta de entradas virtuales. El primer ministro, Alberto Otárola, manifestó que debe haber orden y recordó que habrían sanciones penales para los que bloqueen carreteras. Otros países han recomendado no viajar a Cusco. La periodista María Teresa Braschi, indicó que la Embajada de Estados Unidos, también recomendaron no viajar a Cusco."""

# Process the text with the trained model
doc = nlp(text)

# Display extracted entities
print("Extracted Named Entities:\n")
print(f"{'Entity':<40} {'Label':<15}")
print("-" * 55)

for ent in doc.ents:
    print(f"{ent.text:<40} {ent.label_:<15}")

# Display entity counts
print(f"\n\nTotal entities found: {len(doc.ents)}")
entity_counts = {}
for ent in doc.ents:
    entity_counts[ent.label_] = entity_counts.get(ent.label_, 0) + 1

print("\nEntity distribution:")
for label, count in entity_counts.items():
    print(f"  {label}: {count}")