# Ingredient NER Model Training

This notebook trains a spaCy transformer-based Named Entity Recognition (NER) model to extract ingredients from raw text.

## Overview

The `run_ingredient_ner.py` script trains a custom NER model using:
- **Transformer backbone**: DistilBERT (configurable)
- **Training data**: Normalized ingredient lists from the preprocessing pipeline
- **Output**: Trained spaCy model saved to `models/ingredient_ner_trf/model-best/`

## Training Pipeline

1. **Data Preparation**: Converts ingredient lists into spaCy DocBin format for training
2. **Train/Validation Split**: Splits data into training and validation sets
3. **Model Training**: Trains transformer-based NER model with early stopping
4. **Model Evaluation**: Validates on held-out validation set
5. **Model Export**: Saves the best model checkpoint

## Key Features

- Uses pre-normalized ingredients from the normalization pipeline
- Applies deduplication map to ensure consistent ingredient forms
- Supports transformer models (DistilBERT, BERT, etc.)
- Includes early stopping to prevent overfitting
- Cleans up intermediate training artifacts after completion


In [None]:
# Setup: Add pipeline to path
import sys
from pathlib import Path

# Add pipeline directory to path
pipeline_root = Path.cwd().parent.parent / "pipeline"
if str(pipeline_root) not in sys.path:
    sys.path.insert(0, str(pipeline_root))

print(f"Pipeline root: {pipeline_root}")
print(f"Python path includes: {pipeline_root.exists()}")


## Step 1: Configure Training

Load the training configuration file. This specifies:
- Training data path
- Model architecture (transformer model, window size, etc.)
- Training hyperparameters (learning rate, epochs, batch size)
- Output directory for the trained model


In [None]:
import yaml
from pathlib import Path

# Configuration file
config_path = Path("./pipeline/config/ingredient_ner.yaml")

# Load and display config
if config_path.exists():
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    print("Training configuration loaded:")
    ner_cfg = config.get('ner', {})
    
    print(f"\n  Data:")
    print(f"    - Train path: {ner_cfg.get('train_path', 'N/A')}")
    print(f"    - NER list column: {ner_cfg.get('ner_list_col', 'N/A')}")
    print(f"    - Max rows: {ner_cfg.get('max_rows', 'all')}")
    
    print(f"\n  Model:")
    print(f"    - Transformer: {ner_cfg.get('transformer_model', 'N/A')}")
    print(f"    - Window size: {ner_cfg.get('window', 'N/A')}")
    print(f"    - Stride: {ner_cfg.get('stride', 'N/A')}")
    
    print(f"\n  Training:")
    print(f"    - Epochs: {ner_cfg.get('n_epochs', 'N/A')}")
    print(f"    - Learning rate: {ner_cfg.get('lr', 'N/A')}")
    print(f"    - Batch size: {ner_cfg.get('batch_size', 'N/A')}")
    print(f"    - Validation fraction: {ner_cfg.get('valid_fraction', 'N/A')}")
    print(f"    - Early stopping patience: {ner_cfg.get('early_stopping_patience', 'N/A')}")
    
    print(f"\n  Output:")
    print(f"    - Model directory: {ner_cfg.get('model_dir', 'N/A')}")
else:
    print(f"Config file not found: {config_path}")


## Step 2: Check Training Data

Verify that the training data exists and is in the expected format before starting training.


In [None]:
import pandas as pd
from pathlib import Path

# Check training data
train_path = Path(ner_cfg.get('train_path', './data/normalized/recipes_data_clean.parquet'))
ner_col = ner_cfg.get('ner_list_col', 'NER_clean')

if train_path.exists():
    print(f"✓ Training data found: {train_path}")
    
    # Load a sample to inspect
    df_sample = pd.read_parquet(train_path, nrows=10)
    print(f"  Shape: {df_sample.shape}")
    print(f"  Columns: {list(df_sample.columns)}")
    
    if ner_col in df_sample.columns:
        print(f"\n  Sample {ner_col} data:")
        for idx in range(min(5, len(df_sample))):
            ingredients = df_sample[ner_col].iloc[idx]
            print(f"    Row {idx}: {ingredients}")
    else:
        print(f"\n  ✗ Column '{ner_col}' not found in dataset")
        print(f"    Available columns: {list(df_sample.columns)}")
else:
    print(f"✗ Training data not found: {train_path}")
    print("  Make sure you've run the normalization pipeline first!")


## Step 3: Run Training

Execute the training script. This will:
1. Prepare training data in spaCy DocBin format
2. Split into train/validation sets
3. Train the transformer-based NER model
4. Save the best model checkpoint

**Note**: Training can take a significant amount of time depending on:
- Dataset size
- Number of epochs
- Transformer model size
- Hardware (CPU vs GPU)

The script will show progress and validation metrics during training.


In [None]:
import subprocess
import sys

# Run the training script
cmd = [
    sys.executable,
    str(Path("./pipeline/scripts/run_ingredient_ner.py")),
    "--config", str(config_path),
]

print("Starting NER model training...")
print(f"Command: {' '.join(cmd)}")
print("\n" + "="*60)
print("This may take a while. Training progress will be shown below.")
print("="*60 + "\n")

result = subprocess.run(cmd, capture_output=True, text=True)

# Print output
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)
print(f"\nReturn code: {result.returncode}")

if result.returncode == 0:
    print("\n✓ Training completed successfully!")
else:
    print("\n✗ Training failed. Check the error messages above.")


## Step 4: Verify Trained Model

Check that the model was saved correctly and inspect its location.


In [None]:
from pathlib import Path

# Check model directory
model_dir = Path(ner_cfg.get('model_dir', './models/ingredient_ner_trf/model-best'))

if model_dir.exists():
    print(f"✓ Trained model found: {model_dir}")
    
    # List model files
    model_files = list(model_dir.rglob("*"))
    print(f"\n  Model files ({len(model_files)} total):")
    for f in sorted(model_files)[:20]:  # Show first 20 files
        if f.is_file():
            size = f.stat().st_size / (1024 * 1024)  # Size in MB
            print(f"    - {f.relative_to(model_dir)} ({size:.2f} MB)")
    
    # Check for key model components
    key_files = ['config.cfg', 'meta.json']
    for key_file in key_files:
        key_path = model_dir / key_file
        if key_path.exists():
            print(f"\n  ✓ {key_file} exists")
        else:
            print(f"\n  ✗ {key_file} not found")
    
    # Check for transformer model
    transformer_dir = model_dir / "transformer"
    if transformer_dir.exists():
        print(f"\n  ✓ Transformer model directory exists")
    else:
        print(f"\n  ✗ Transformer model directory not found")
        
else:
    print(f"✗ Model directory not found: {model_dir}")
    print("  Training may have failed or model was saved to a different location.")


## Step 5: Test the Trained Model (Optional)

Load and test the trained model on a sample text to verify it works correctly.


In [None]:
# Optional: Test the trained model
if model_dir.exists():
    try:
        import spacy
        
        print("Loading trained model...")
        nlp = spacy.load(str(model_dir))
        print("✓ Model loaded successfully")
        
        # Test on sample text
        test_texts = [
            "2 cups all-purpose flour, 1 teaspoon salt, 3 eggs",
            "chicken breast, olive oil, garlic, lemon juice",
            "tomatoes, basil, mozzarella cheese, balsamic vinegar"
        ]
        
        print("\nTesting model on sample texts:")
        for text in test_texts:
            doc = nlp(text)
            ingredients = [ent.text for ent in doc.ents if ent.label_ == "INGREDIENT"]
            print(f"\n  Text: {text}")
            print(f"  Extracted ingredients: {ingredients}")
    except Exception as e:
        print(f"Error loading/testing model: {e}")
        print("This is okay - the model may still be valid for inference scripts")
else:
    print("Model not found - skipping test")
