# HEL-21 NER Model Training on Google Colab

This notebook trains the Named Entity Recognition model for the HelixGraph project.

**Training Data:**
- 680 training examples
- 170 validation examples
- 8 entity types: SUPPLIER, PRODUCT, CAMPAIGN, CONTRACT, PO, INVOICE, ROLE, SKILL

**Model:** RoBERTa-base transformer with spaCy

**Expected Training Time:**
- CPU: 2-3 hours
- GPU (T4): 30-45 minutes

## Step 1: Mount Google Drive

In [1]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Navigate to project directory
PROJECT_PATH = '/content/drive/MyDrive/Helixgraph'

# Create project directory if it doesn't exist
!mkdir -p "$PROJECT_PATH"

print(f"‚úÖ Google Drive mounted")
print(f"üìÅ Project path: {PROJECT_PATH}")

Mounted at /content/drive
‚úÖ Google Drive mounted
üìÅ Project path: /content/drive/MyDrive/Helixgraph


## Step 2: Install Dependencies

In [2]:
# Install spaCy with transformers support
!pip install -U spacy[transformers]

# Download RoBERTa model
!python -m spacy download en_core_web_trf

print("\n‚úÖ Dependencies installed successfully!")

Collecting spacy_transformers<1.4.0,>=1.1.2 (from spacy[transformers])
  Downloading spacy_transformers-1.3.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting transformers<4.50.0,>=3.4.0 (from spacy_transformers<1.4.0,>=1.1.2->spacy[transformers])
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy_transformers<1.4.0,>=1.1.2->spacy[transformers])
  Downloading spacy_alignments-0.9.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.6 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers<4.50.0,>=3.4.0->spacy_transformers<1.4.0,>=1.1.2->spacy[transformers])
  Downloading tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 k

## Step 3: Check GPU Availability

In [None]:
import torch

# Check if GPU is available
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"‚úÖ GPU Available: {gpu_name}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    use_gpu = 0
else:
    print("‚ö†Ô∏è  No GPU available, will use CPU (slower)")
    print("   To enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")
    use_gpu = -1

print(f"\nüéØ Training will use: {'GPU' if use_gpu >= 0 else 'CPU'}")

‚úÖ GPU Available: Tesla T4
   Memory: 15.83 GB

üéØ Training will use: GPU


## Step 4: Verify Training Files

All files are already synced to Google Drive! Let's verify they exist.

In [None]:
# Verify all files are in Google Drive
print("üìÅ Checking training files...\n")

# Check training data
print("Training Data:")
!ls -lh "$PROJECT_PATH/nlp/training_data/spacy/"

print("\nConfiguration:")
!ls -lh "$PROJECT_PATH/nlp/configs/"

print("\n‚úÖ All files are ready in Google Drive!")

üìÅ Checking training files...

Training Data:
total 130K
-rw------- 1 root root  30K Nov 21 18:42 dev.spacy
-rw------- 1 root root 100K Nov 22 20:26 train.spacy

Configuration:
total 8.0K
-rw------- 1 root root 7.8K Nov 22 20:01 config.cfg

‚úÖ All files are ready in Google Drive!


## Step 5: Final Verification

Let's double-check all required files are present and ready for training.

In [None]:
import os

# Check all required files
required_files = [
    f"{PROJECT_PATH}/nlp/training_data/spacy/train.spacy",
    f"{PROJECT_PATH}/nlp/training_data/spacy/dev.spacy",
    f"{PROJECT_PATH}/nlp/configs/config.cfg"
]

print("üìã Checking required files:\n")
all_present = True
for filepath in required_files:
    if os.path.exists(filepath):
        size = os.path.getsize(filepath) / 1024
        print(f"‚úÖ {filepath} ({size:.1f} KB)")
    else:
        print(f"‚ùå {filepath} (MISSING)")
        all_present = False

if all_present:
    print("\nüéâ All files present! Ready to train.")
else:
    print("\n‚ö†Ô∏è  Some files are missing. Please upload them first.")

üìã Checking required files:

‚úÖ /content/drive/MyDrive/Helixgraph/nlp/training_data/spacy/train.spacy (99.7 KB)
‚úÖ /content/drive/MyDrive/Helixgraph/nlp/training_data/spacy/dev.spacy (29.4 KB)
‚úÖ /content/drive/MyDrive/Helixgraph/nlp/configs/config.cfg (7.7 KB)

üéâ All files present! Ready to train.


## Step 6: Start Training! üöÄ

This will take 30-45 minutes on GPU, or 2-3 hours on CPU.

In [None]:
# Change to project directory
os.chdir(PROJECT_PATH)

# Create output directory
!mkdir -p nlp/models/ner_model

# Start training
print("üöÄ Starting NER model training...\n")
print(f"   Using: {'GPU' if use_gpu >= 0 else 'CPU'}")
print(f"   Config: {PROJECT_PATH}/nlp/configs/config.cfg")
print(f"   Output: {PROJECT_PATH}/nlp/models/ner_model\n")
print("=" * 60)

!python -m spacy train \
    nlp/configs/config.cfg \
    --output nlp/models/ner_model \
    --paths.train nlp/training_data/spacy/train.spacy \
    --paths.dev nlp/training_data/spacy/dev.spacy \
    --gpu-id $use_gpu

print("\n" + "=" * 60)
print("‚úÖ Training completed!")

üöÄ Starting NER model training...

   Using: GPU
   Config: /content/drive/MyDrive/Helixgraph/nlp/configs/config.cfg
   Output: /content/drive/MyDrive/Helixgraph/nlp/models/ner_model

[38;5;4m‚Ñπ Saving to output directory: nlp/models/ner_model[0m
[38;5;4m‚Ñπ Using GPU: 0[0m
[1m
tokenizer_config.json: 100% 25.0/25.0 [00:00<00:00, 135kB/s]
config.json: 100% 481/481 [00:00<00:00, 3.78MB/s]
vocab.json: 100% 899k/899k [00:00<00:00, 4.10MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 2.13MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 3.21MB/s]
2025-11-22 21:00:49.817601: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763845249.849842    5240 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763845249.859786    5240 cuda_blas.cc:1407] U

## Step 7: Evaluate Model

In [None]:
print("üìä Evaluating model on dev set...\n")

!python -m spacy evaluate \
    nlp/models/ner_model/model-best \
    nlp/training_data/spacy/dev.spacy \
    --gpu-id $use_gpu

print("\n‚úÖ Evaluation complete!")

üìä Evaluating model on dev set...

[38;5;4m‚Ñπ Using GPU: 0[0m
2025-11-22 22:05:26.572554: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763849126.605692   21600 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763849126.616036   21600 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1763849126.640238   21600 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763849126.640279   21600 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:17638491

## Step 8: Test Model with Sample Inputs

In [None]:
import spacy
from spacy import displacy

# Load trained model
print("üì¶ Loading trained model...")
nlp = spacy.load(f"{PROJECT_PATH}/nlp/models/ner_model/model-best")
print("‚úÖ Model loaded!\n")

# Test sentences
test_sentences = [
    "The Marketing Coordinator managed the Nike Summer Sale campaign successfully.",
    "Invoice INV-123456 from Tech Suppliers Ltd was paid via PO-789012 last month.",
    "Our Software Engineer with Python expertise led the campaign optimization project.",
    "Contract CTR-445566 with Global Procurement Inc covers delivery of office supplies."
]

print("üß™ Testing model with sample sentences:\n")
print("=" * 80)

for i, text in enumerate(test_sentences, 1):
    doc = nlp(text)

    print(f"\n{i}. {text}")
    print(f"   Entities found: {len(doc.ents)}")

    if doc.ents:
        for ent in doc.ents:
            print(f"     - [{ent.label_}] '{ent.text}'")
    else:
        print("     (No entities detected)")

    print("-" * 80)

print("\n‚úÖ Model testing complete!")

üì¶ Loading trained model...
‚úÖ Model loaded!

üß™ Testing model with sample sentences:


1. The Marketing Coordinator managed the Nike Summer Sale campaign successfully.
   Entities found: 2
     - [ROLE] 'Marketing Coordinator'
     - [CAMPAIGN] 'Nike Summer Sale'
--------------------------------------------------------------------------------

2. Invoice INV-123456 from Tech Suppliers Ltd was paid via PO-789012 last month.
   Entities found: 3
     - [INVOICE] 'INV-123456'
     - [SUPPLIER] 'Tech Suppliers Ltd'
     - [PO] 'PO-789012'
--------------------------------------------------------------------------------

3. Our Software Engineer with Python expertise led the campaign optimization project.
   Entities found: 2
     - [ROLE] 'Software Engineer'
     - [SKILL] 'Python'
--------------------------------------------------------------------------------

4. Contract CTR-445566 with Global Procurement Inc covers delivery of office supplies.
   Entities found: 2
     - [CONTRACT

## Step 9: Visualize Entity Recognition

In [None]:
# Visualize entities in a sample sentence
sample_text = "The Product Manager with SQL and Leadership skills managed the Apple iPhone Launch campaign and approved invoice INV-998877 from Tech Solutions via PO-112233."

doc = nlp(sample_text)

print("üé® Entity Visualization:\n")
displacy.render(doc, style="ent", jupyter=True)

print("\nüìã Detected Entities:")
for ent in doc.ents:
    print(f"   [{ent.label_:12}] {ent.text}")

üé® Entity Visualization:




üìã Detected Entities:
   [ROLE        ] Product Manager
   [SKILL       ] SQL and Leadership
   [CAMPAIGN    ] Apple iPhone Launch
   [INVOICE     ] INV-998877
   [SUPPLIER    ] Tech Solutions
   [PO          ] PO-112233


## Step 10: View Training Metrics

In [None]:
import json
from datetime import datetime

# Check if training metrics file exists
metrics_file = f"{PROJECT_PATH}/nlp/models/ner_model/model-best/meta.json"

if os.path.exists(metrics_file):
    with open(metrics_file, 'r') as f:
        meta = json.load(f)

    print("üìä Training Metrics:")
    print("=" * 60)

    if 'performance' in meta:
        perf = meta['performance']
        print(f"\nüéØ Overall Performance:")
        for key, value in perf.items():
            if isinstance(value, float):
                print(f"   {key:20} : {value:.4f}")

    print("\n" + "=" * 60)
    print(f"‚úÖ Model saved to: {PROJECT_PATH}/nlp/models/ner_model/model-best")
else:
    print("‚ö†Ô∏è  Metrics file not found")

üìä Training Metrics:

üéØ Overall Performance:
   ents_f               : 0.9979
   ents_p               : 0.9979
   ents_r               : 0.9979
   transformer_loss     : 1039.9667
   ner_loss             : 616.7777

‚úÖ Model saved to: /content/drive/MyDrive/Helixgraph/nlp/models/ner_model/model-best


## Step 11: Download Trained Model (Optional)

Download the trained model to your local machine for later use.

In [None]:
from google.colab import files

# Create a zip file of the trained model
model_path = f"{PROJECT_PATH}/nlp/models/ner_model/model-best"
zip_path = f"{PROJECT_PATH}/ner_model_trained.zip"

print("üì¶ Creating zip file...")
!cd "$PROJECT_PATH" && zip -r ner_model_trained.zip nlp/models/ner_model/model-best/

print("\nüì• Download the model:")
files.download(zip_path)

print("‚úÖ Model downloaded!")

üì¶ Creating zip file...
updating: nlp/models/ner_model/model-best/ (stored 0%)
updating: nlp/models/ner_model/model-best/tokenizer (deflated 81%)
updating: nlp/models/ner_model/model-best/meta.json (deflated 69%)
updating: nlp/models/ner_model/model-best/config.cfg (deflated 62%)
updating: nlp/models/ner_model/model-best/transformer/ (stored 0%)
updating: nlp/models/ner_model/model-best/transformer/cfg (stored 0%)
updating: nlp/models/ner_model/model-best/transformer/model (deflated 15%)
updating: nlp/models/ner_model/model-best/ner/ (stored 0%)
updating: nlp/models/ner_model/model-best/ner/model (deflated 8%)
updating: nlp/models/ner_model/model-best/ner/moves (deflated 68%)
updating: nlp/models/ner_model/model-best/ner/cfg (deflated 33%)
updating: nlp/models/ner_model/model-best/vocab/ (stored 0%)
updating: nlp/models/ner_model/model-best/vocab/strings.json (deflated 72%)
updating: nlp/models/ner_model/model-best/vocab/vectors (deflated 45%)
updating: nlp/models/ner_model/model-bes

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Model downloaded!


## Summary

### ‚úÖ Training Complete!

Your NER model has been trained successfully on Google Colab.

**Model Location:** `{PROJECT_PATH}/nlp/models/ner_model/model-best`

**Next Steps:**
1. Review evaluation metrics above
2. Test with your own sentences
3. Download model for local use
4. Integrate into FastAPI (Phase 4)

**Model Capabilities:**
- Recognizes 8 entity types across 3 business domains
- Cross-domain entity recognition
- Based on RoBERTa transformer
- Ready for production use