# Cuisine Normalization Pipeline

This notebook runs the cuisine normalization pipeline, which applies the same normalization workflow used for ingredients to cuisine labels.

## Overview

The `run_cuisine_norm.py` script performs normalization for cuisine labels:

1. **Step 1: Prepare Cuisine Lists** - Splits multi-cuisine entries (e.g., "[American, Italian]") into separate list items
2. **Step 2: spaCy Normalization** - Cleans and normalizes cuisine text
3. **Step 3: SBERT Deduplication** - Identifies duplicate cuisine names (e.g., "American" vs "american")
4. **Step 4: Apply Dedupe Map** - Creates canonical cuisine forms
5. **Step 5: Encoding** - Encodes cuisines to integer IDs
6. **Step 6: Apply to Combined Dataset** - Updates the main combined dataset with normalized and encoded cuisines

## Key Features

- Handles multi-cuisine entries (splits on commas, "&", "and")
- Removes "recipes" suffix from cuisine names
- Deduplicates similar cuisine names using semantic similarity
- Creates encoder maps for downstream use


In [None]:
# Setup: Add pipeline to path
import sys
from pathlib import Path

# Add pipeline directory to path
pipeline_root = Path.cwd().parent.parent / "pipeline"
if str(pipeline_root) not in sys.path:
    sys.path.insert(0, str(pipeline_root))

print(f"Pipeline root: {pipeline_root}")
print(f"Python path includes: {pipeline_root.exists()}")


## Step 1: Configure Cuisine Normalization

Load the configuration file that specifies input/output paths and normalization parameters.


In [None]:
import yaml
from pathlib import Path

# Configuration file
config_path = Path("./pipeline/config/cuisnorm.yaml")

# Load and display config
if config_path.exists():
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    print("Configuration loaded:")
    data_cfg = config.get('data', {})
    output_cfg = config.get('output', {})
    
    print(f"  Input path: {data_cfg.get('input_path', 'N/A')}")
    print(f"  Cuisine column: {data_cfg.get('cuisine_col', 'cuisine')}")
    print(f"\n  Output paths:")
    print(f"    - Baseline: {output_cfg.get('baseline_parquet', 'N/A')}")
    print(f"    - Dedupe map: {output_cfg.get('cosine_map_path', 'N/A')}")
    print(f"    - Unified: {output_cfg.get('unified_parquet', 'N/A')}")
    print(f"    - Token→ID: {output_cfg.get('cuisine_token_to_id', 'N/A')}")
else:
    print(f"Config file not found: {config_path}")


## Step 2: Run Cuisine Normalization Pipeline

Execute the cuisine normalization pipeline. This will process all cuisine labels through the normalization stages and update the combined dataset.


In [None]:
import subprocess
import sys

# Run the cuisine normalization script
cmd = [
    sys.executable,
    str(Path("./pipeline/scripts/run_cuisine_norm.py")),
    "--config", str(config_path),
    # Uncomment to force rebuild all artifacts:
    # "--force",
]

print("Running cuisine normalization pipeline...")
print(f"Command: {' '.join(cmd)}")
print("\n" + "="*60)

result = subprocess.run(cmd, capture_output=True, text=True)

# Print output
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)
print(f"\nReturn code: {result.returncode}")


## Step 3: Inspect Cuisine Normalization Results

Examine the outputs to verify cuisine normalization worked correctly.


In [None]:
import pandas as pd
import json
from pathlib import Path

# Load config to get output paths
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

output_cfg = config.get('output', {})

# Check dedupe map
dedupe_map_path = Path(output_cfg.get('cosine_map_path', './data/cuisine_normalized/cuisine_dedupe_map.jsonl'))
if dedupe_map_path.exists():
    print(f"✓ Cuisine dedupe map exists: {dedupe_map_path}")
    with open(dedupe_map_path, 'r') as f:
        lines = f.readlines()
    print(f"  Number of mappings: {len(lines)}")
    if lines:
        print(f"  Sample mapping: {json.loads(lines[0])}")
else:
    print(f"✗ Cuisine dedupe map not found: {dedupe_map_path}")

# Check encoder maps
token_to_id_path = Path(output_cfg.get('cuisine_token_to_id', './data/cuisine_encoded/cuisine_token_to_id.json'))
id_to_token_path = Path(output_cfg.get('cuisine_id_to_token', './data/cuisine_encoded/cuisine_id_to_token.json'))

if token_to_id_path.exists():
    print(f"\n✓ Cuisine token-to-ID map exists: {token_to_id_path}")
    with open(token_to_id_path, 'r') as f:
        tok2id = json.load(f)
    print(f"  Vocabulary size: {len(tok2id):,} unique cuisines")
    # Show sample cuisines
    sample_cuisines = list(tok2id.keys())[:20]
    print(f"  Sample cuisines: {sample_cuisines}")
else:
    print(f"\n✗ Cuisine token-to-ID map not found: {token_to_id_path}")

# Check if combined dataset was updated
data_cfg = config.get('data', {})
combined_path = Path(data_cfg.get('input_path', './data/combined_raw_datasets.parquet'))
updated_path = combined_path.parent / f"{combined_path.stem}_with_cuisine_encoded.parquet"

if updated_path.exists():
    print(f"\n✓ Updated combined dataset exists: {updated_path}")
    df = pd.read_parquet(updated_path, nrows=10)
    print(f"  Shape: {df.shape}")
    print(f"  Columns: {list(df.columns)}")
    
    # Show cuisine transformations
    if 'cuisine' in df.columns and 'cuisine_encoded' in df.columns:
        print(f"\n  Sample cuisine transformations:")
        for idx in range(min(5, len(df))):
            orig = df['cuisine'].iloc[idx]
            encoded = df['cuisine_encoded'].iloc[idx]
            print(f"    Row {idx}: {orig} → {encoded}")
else:
    print(f"\n✗ Updated combined dataset not found: {updated_path}")
