# Ingredient Normalization Pipeline

This notebook runs the ingredient normalization pipeline, which processes raw ingredient lists through multiple stages to create a clean, deduplicated, and encoded dataset.

## Overview

The `run_ingrnorm.py` script performs a 4-stage normalization pipeline:

1. **Stage 1: spaCy Normalization** - Cleans and normalizes ingredient text using spaCy NLP
2. **Stage 2: Deduplication** - Identifies and maps duplicate ingredients using SBERT (sentence embeddings) or Word2Vec
3. **Stage 3: Apply Dedupe Map** - Applies the deduplication mapping to create canonical ingredient forms
4. **Stage 4: Encoding** - Encodes normalized ingredients to integer IDs for efficient storage and processing

## Input/Output

- **Input**: Parquet file with ingredient lists (e.g., from NER inference or raw data)
- **Output**: 
  - Normalized parquet files at each stage
  - Dedupe mapping (JSONL)
  - Encoder maps (JSON): `ingredient_token_to_id.json`, `ingredient_id_to_token.json`
  - Final encoded dataset


In [None]:
# Setup: Add pipeline to path
import sys
from pathlib import Path

# Add pipeline directory to path
pipeline_root = Path.cwd().parent.parent / "pipeline"
if str(pipeline_root) not in sys.path:
    sys.path.insert(0, str(pipeline_root))

print(f"Pipeline root: {pipeline_root}")
print(f"Python path includes: {pipeline_root.exists()}")


## Step 1: Configure Normalization Pipeline

Set the configuration file path. The config file specifies:
- Input data path
- Output paths for each stage
- Normalization parameters (SBERT model, thresholds, etc.)
- Which stages to run


In [None]:
import yaml
from pathlib import Path

# Configuration file
config_path = Path("./pipeline/config/ingrnorm.yaml")

# Load and display config
if config_path.exists():
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    print("Configuration loaded:")
    print(f"  Input path: {config.get('data', {}).get('input_path', 'N/A')}")
    print(f"  Baseline parquet: {config.get('output', {}).get('baseline_parquet', 'N/A')}")
    print(f"  Dedupe map: {config.get('output', {}).get('cosine_map_path', 'N/A')}")
    print(f"  Unified parquet: {config.get('output', {}).get('unified_parquet', 'N/A')}")
    print(f"\n  Stages enabled:")
    stages = config.get('stages', {})
    print(f"    - Write parquet: {stages.get('write_parquet', False)}")
    print(f"    - SBERT dedupe: {stages.get('sbert_dedupe', False)}")
    print(f"    - Apply map: {stages.get('apply_cosine_map', False)}")
    print(f"    - Encode IDs: {stages.get('encode_ids', False)}")
else:
    print(f"Config file not found: {config_path}")


## Step 2: Run Normalization Pipeline

Execute the full normalization pipeline. This will:
1. Normalize ingredients using spaCy
2. Build deduplication map using SBERT embeddings
3. Apply the dedupe map to create canonical forms
4. Encode ingredients to integer IDs

**Note**: This process can take significant time depending on dataset size. The pipeline is designed to skip stages if output files already exist (unless `--force` is used).


In [None]:
import subprocess
import sys

# Run the normalization script
cmd = [
    sys.executable,
    str(Path("./pipeline/scripts/run_ingrnorm.py")),
    "--config", str(config_path),
    # Uncomment to force rebuild all artifacts:
    # "--force",
]

print("Running ingredient normalization pipeline...")
print(f"Command: {' '.join(cmd)}")
print("\n" + "="*60)

result = subprocess.run(cmd, capture_output=True, text=True)

# Print output
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)
print(f"\nReturn code: {result.returncode}")


## Step 3: Inspect Normalization Results

Examine the outputs from each stage to verify the normalization process worked correctly.


In [None]:
import pandas as pd
import json
from pathlib import Path

# Load config to get output paths
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

output_cfg = config.get('output', {})

# Check baseline (normalized) parquet
baseline_path = Path(output_cfg.get('baseline_parquet', './data/normalized/recipes_data_clean.parquet'))
if baseline_path.exists():
    print(f"✓ Baseline parquet exists: {baseline_path}")
    df_baseline = pd.read_parquet(baseline_path, nrows=5)
    print(f"  Shape: {df_baseline.shape}")
    print(f"  Columns: {list(df_baseline.columns)}")
    if 'NER_clean' in df_baseline.columns:
        print(f"  Sample NER_clean: {df_baseline['NER_clean'].iloc[0]}")
else:
    print(f"✗ Baseline parquet not found: {baseline_path}")

# Check dedupe map
dedupe_map_path = Path(output_cfg.get('cosine_map_path', './data/normalized/cosine_dedupe_map.jsonl'))
if dedupe_map_path.exists():
    print(f"\n✓ Dedupe map exists: {dedupe_map_path}")
    # Count lines in JSONL file
    with open(dedupe_map_path, 'r') as f:
        lines = f.readlines()
    print(f"  Number of mappings: {len(lines)}")
    # Show first few mappings
    if lines:
        import json
        print(f"  Sample mapping: {json.loads(lines[0])}")
else:
    print(f"\n✗ Dedupe map not found: {dedupe_map_path}")

# Check encoder maps
token_to_id_path = Path(output_cfg.get('ingredient_token_to_id', './data/encoded/ingredient_token_to_id.json'))
id_to_token_path = Path(output_cfg.get('ingredient_id_to_token', './data/encoded/ingredient_id_to_token.json'))

if token_to_id_path.exists():
    print(f"\n✓ Token-to-ID map exists: {token_to_id_path}")
    with open(token_to_id_path, 'r') as f:
        tok2id = json.load(f)
    print(f"  Vocabulary size: {len(tok2id):,} tokens")
    # Show sample tokens
    sample_tokens = list(tok2id.keys())[:10]
    print(f"  Sample tokens: {sample_tokens}")
else:
    print(f"\n✗ Token-to-ID map not found: {token_to_id_path}")

# Check final encoded dataset
unified_path = Path(output_cfg.get('unified_parquet', './data/encoded/datasets_unified.parquet'))
if unified_path.exists():
    print(f"\n✓ Unified encoded parquet exists: {unified_path}")
    df_unified = pd.read_parquet(unified_path, nrows=5)
    print(f"  Shape: {df_unified.shape}")
    print(f"  Columns: {list(df_unified.columns)}")
    if 'Ingredients' in df_unified.columns:
        print(f"  Sample encoded ingredients: {df_unified['Ingredients'].iloc[0]}")
else:
    print(f"\n✗ Unified parquet not found: {unified_path}")
