# HEL-21 NER Model Training on Google Colab

This notebook trains the Named Entity Recognition model for the HelixGraph project.

**Training Data:**
- 680 training examples
- 170 validation examples
- 8 entity types: SUPPLIER, PRODUCT, CAMPAIGN, CONTRACT, PO, INVOICE, ROLE, SKILL

**Model:** RoBERTa-base transformer with spaCy

**Expected Training Time:** 
- CPU: 2-3 hours
- GPU (T4): 30-45 minutes

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Navigate to project directory
PROJECT_PATH = '/content/drive/MyDrive/Helixgraph'

# Create project directory if it doesn't exist
!mkdir -p "$PROJECT_PATH"

print(f"‚úÖ Google Drive mounted")
print(f"üìÅ Project path: {PROJECT_PATH}")

## Step 2: Install Dependencies

In [None]:
# Install spaCy with transformers support
!pip install -U spacy[transformers]

# Download RoBERTa model
!python -m spacy download en_core_web_trf

print("\n‚úÖ Dependencies installed successfully!")

## Step 3: Check GPU Availability

In [None]:
import torch

# Check if GPU is available
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"‚úÖ GPU Available: {gpu_name}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    use_gpu = 0
else:
    print("‚ö†Ô∏è  No GPU available, will use CPU (slower)")
    print("   To enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")
    use_gpu = -1

print(f"\nüéØ Training will use: {'GPU' if use_gpu >= 0 else 'CPU'}")

## Step 4: Verify Training Files

All files are already synced to Google Drive! Let's verify they exist.

In [None]:
# Verify all files are in Google Drive
print("üìÅ Checking training files...\n")

# Check training data
print("Training Data:")
!ls -lh "$PROJECT_PATH/nlp/training_data/spacy/"

print("\nConfiguration:")
!ls -lh "$PROJECT_PATH/nlp/configs/"

print("\n‚úÖ All files are ready in Google Drive!")

## Step 5: Final Verification

Let's double-check all required files are present and ready for training.

In [None]:
import os

# Check all required files
required_files = [
    f"{PROJECT_PATH}/nlp/training_data/spacy/train.spacy",
    f"{PROJECT_PATH}/nlp/training_data/spacy/dev.spacy",
    f"{PROJECT_PATH}/nlp/configs/config.cfg"
]

print("üìã Checking required files:\n")
all_present = True
for filepath in required_files:
    if os.path.exists(filepath):
        size = os.path.getsize(filepath) / 1024
        print(f"‚úÖ {filepath} ({size:.1f} KB)")
    else:
        print(f"‚ùå {filepath} (MISSING)")
        all_present = False

if all_present:
    print("\nüéâ All files present! Ready to train.")
else:
    print("\n‚ö†Ô∏è  Some files are missing. Please upload them first.")

## Step 6: Start Training! üöÄ

This will take 30-45 minutes on GPU, or 2-3 hours on CPU.

In [None]:
# Change to project directory
os.chdir(PROJECT_PATH)

# Create output directory
!mkdir -p nlp/models/ner_model

# Start training
print("üöÄ Starting NER model training...\n")
print(f"   Using: {'GPU' if use_gpu >= 0 else 'CPU'}")
print(f"   Config: {PROJECT_PATH}/nlp/configs/config.cfg")
print(f"   Output: {PROJECT_PATH}/nlp/models/ner_model\n")
print("=" * 60)

!python -m spacy train \
    nlp/configs/config.cfg \
    --output nlp/models/ner_model \
    --paths.train nlp/training_data/spacy/train.spacy \
    --paths.dev nlp/training_data/spacy/dev.spacy \
    --gpu-id $use_gpu

print("\n" + "=" * 60)
print("‚úÖ Training completed!")

## Step 7: Evaluate Model

In [None]:
print("üìä Evaluating model on dev set...\n")

!python -m spacy evaluate \
    nlp/models/ner_model/model-best \
    nlp/training_data/spacy/dev.spacy \
    --gpu-id $use_gpu

print("\n‚úÖ Evaluation complete!")

## Step 8: Test Model with Sample Inputs

In [None]:
import spacy
from spacy import displacy

# Load trained model
print("üì¶ Loading trained model...")
nlp = spacy.load(f"{PROJECT_PATH}/nlp/models/ner_model/model-best")
print("‚úÖ Model loaded!\n")

# Test sentences
test_sentences = [
    "The Marketing Coordinator managed the Nike Summer Sale campaign successfully.",
    "Invoice INV-123456 from Tech Suppliers Ltd was paid via PO-789012 last month.",
    "Our Software Engineer with Python expertise led the campaign optimization project.",
    "Contract CTR-445566 with Global Procurement Inc covers delivery of office supplies."
]

print("üß™ Testing model with sample sentences:\n")
print("=" * 80)

for i, text in enumerate(test_sentences, 1):
    doc = nlp(text)
    
    print(f"\n{i}. {text}")
    print(f"   Entities found: {len(doc.ents)}")
    
    if doc.ents:
        for ent in doc.ents:
            print(f"     - [{ent.label_}] '{ent.text}'")
    else:
        print("     (No entities detected)")
    
    print("-" * 80)

print("\n‚úÖ Model testing complete!")

## Step 9: Visualize Entity Recognition

In [None]:
# Visualize entities in a sample sentence
sample_text = "The Product Manager with SQL and Leadership skills managed the Apple iPhone Launch campaign and approved invoice INV-998877 from Tech Solutions via PO-112233."

doc = nlp(sample_text)

print("üé® Entity Visualization:\n")
displacy.render(doc, style="ent", jupyter=True)

print("\nüìã Detected Entities:")
for ent in doc.ents:
    print(f"   [{ent.label_:12}] {ent.text}")

## Step 10: View Training Metrics

In [None]:
import json
from datetime import datetime

# Check if training metrics file exists
metrics_file = f"{PROJECT_PATH}/nlp/models/ner_model/model-best/meta.json"

if os.path.exists(metrics_file):
    with open(metrics_file, 'r') as f:
        meta = json.load(f)
    
    print("üìä Training Metrics:")
    print("=" * 60)
    
    if 'performance' in meta:
        perf = meta['performance']
        print(f"\nüéØ Overall Performance:")
        for key, value in perf.items():
            if isinstance(value, float):
                print(f"   {key:20} : {value:.4f}")
    
    print("\n" + "=" * 60)
    print(f"‚úÖ Model saved to: {PROJECT_PATH}/nlp/models/ner_model/model-best")
else:
    print("‚ö†Ô∏è  Metrics file not found")

## Step 11: Download Trained Model (Optional)

Download the trained model to your local machine for later use.

In [None]:
# Create a zip file of the trained model
model_path = f"{PROJECT_PATH}/nlp/models/ner_model/model-best"
zip_path = f"{PROJECT_PATH}/ner_model_trained.zip"

print("üì¶ Creating zip file...")
!cd "$PROJECT_PATH" && zip -r ner_model_trained.zip nlp/models/ner_model/model-best/

print("\nüì• Download the model:")
files.download(zip_path)

print("‚úÖ Model downloaded!")

## Summary

### ‚úÖ Training Complete!

Your NER model has been trained successfully on Google Colab.

**Model Location:** `{PROJECT_PATH}/nlp/models/ner_model/model-best`

**Next Steps:**
1. Review evaluation metrics above
2. Test with your own sentences
3. Download model for local use
4. Integrate into FastAPI (Phase 4)

**Model Capabilities:**
- Recognizes 8 entity types across 3 business domains
- Cross-domain entity recognition
- Based on RoBERTa transformer
- Ready for production use