# Unified Preprocessing Pipeline

This notebook demonstrates the unified pipeline that supports multiple modes:
- **train**: Train NER model from raw data
- **inference**: Run inference and normalization pipeline
- **full**: Train model then run inference pipeline

## Overview

The unified pipeline (`run_pipeline.py`) integrates:
1. **Raw Data Combination**: Ingests CSVs/Parquets from a directory
2. **Model Training**: Trains transformer-based NER model (optional)
3. **Inference & Normalization**: Processes columns through normalization, deduplication, and encoding

## Pipeline Modes

### Train Mode
- Combines raw datasets
- Trains NER model from labeled data
- Saves model to `config.pipeline.model_dir`

### Inference Mode
- Loads existing model (or uses pre-trained)
- Runs inference on input data
- Applies normalization, deduplication, encoding

### Full Mode
- Runs train mode, then inference mode
- Complete end-to-end pipeline


## Setup


In [1]:
# Setup: Add pipeline to path
import sys
from pathlib import Path
# set sys.path to the parent directory
sys.path.insert(0, str(Path.cwd().parent))


## Load Configuration


In [None]:

from pipeline.config import PipelineConfig
from pathlib import Path
# Load configuration
# preprocess_pipeline\pipeline\config\ingrnorm_unified_example.yaml

config_path = Path("../pipeline/config/config.yaml")
if not config_path.exists():
    print(f"Config not found at {config_path}")
    print("Please ensure the config file exists.")
else:
    config = PipelineConfig.from_yaml(config_path)
    print(f"Loaded config from: {config_path}")
    print(f"Pipeline mode: {config.pipeline.mode if config.pipeline else 'not set'}")
    print(f"Columns to process: {config.pipeline.columns if config.pipeline else 'default'}")


Loaded config from: ..\pipeline\config\config.yaml


AttributeError: 'PipelineConfig' object has no attribute 'pipeline'

## Example: Run Unified Pipeline

The unified pipeline can run in different modes. Configure the mode and execute.


In [None]:
from pipeline.run_pipeline import UnifiedPipelineOrchestrator
from pipeline.common.logging_setup import setup_logging
import logging

# Setup logging
setup_logging(config.to_dict())
logger = logging.getLogger(__name__)

# Set pipeline mode (train, inference, or full)
if config.pipeline is None:
    from pipeline.config import PipelineModeConfig
    config.pipeline = PipelineModeConfig()

# Uncomment the mode you want to use:
# config.pipeline.mode = "train"      # Train model only
# config.pipeline.mode = "inference"  # Run inference only
config.pipeline.mode = "full"         # Train + inference

print(f"Pipeline mode: {config.pipeline.mode}")
print(f"Input path: {config.pipeline.input_path if config.pipeline else config.data.input_path}")
print(f"Model dir: {config.pipeline.model_dir if config.pipeline else 'not set'}")


  from .autonotebook import tqdm as notebook_tqdm


Pipeline mode: full
Input path: data/raw/
Model dir: models/ingredient_ner/


In [None]:
# Create orchestrator and setup steps
orchestrator = UnifiedPipelineOrchestrator(config)
orchestrator.setup_steps()

print(f"Configured {len(orchestrator.steps)} step(s):")
for i, step in enumerate(orchestrator.steps, 1):
    print(f"  {i}. {step.name}")
    if hasattr(step, 'input_path') and step.input_path:
        print(f"      Input: {step.input_path}")
    if hasattr(step, 'output_path') and step.output_path:
        print(f"      Output: {step.output_path}")


[2025-11-25 15:13:59] INFO pipeline.run_pipeline.Orchestrator: Setting up pipeline steps for mode: full
[2025-11-25 15:13:59] INFO pipeline.run_pipeline.Orchestrator: Combining data for column 'ingredients' from data\raw
[2025-11-25 15:13:59] INFO pipeline.base.RawDataCombiner: Initialized with data_dir=data\raw
[2025-11-25 15:13:59] INFO pipeline.run_pipeline.Orchestrator: Combining data for column 'cuisine' from data\raw
[2025-11-25 15:13:59] INFO pipeline.base.RawDataCombiner: Initialized with data_dir=data\raw
[2025-11-25 15:13:59] INFO pipeline.base.NERModelTrainer: Initialized with target_column=NER, base_model=roberta-base, epochs=10
[2025-11-25 15:13:59] INFO pipeline.run_pipeline.Orchestrator: Processing column: ingredients
[2025-11-25 15:14:00] INFO pipeline.base.IngredientNormalizer: Initialized with model=en_core_web_sm, batch_size=1024, n_process=4
[2025-11-25 15:14:00] INFO pipeline.base.DeduplicationStep_SBERT: Initialized with method=sbert
[2025-11-25 15:14:00] INFO pip

Configured 9 step(s):
  1. RawDataCombiner
      Input: data\raw
      Output: data\normalized\combined_ingredients_data.parquet
  2. RawDataCombiner
      Input: data\raw
      Output: data\normalized\combined_cuisine_data.parquet
  3. NERModelTrainer
      Input: data\normalized\combined_ingredients_data.parquet
      Output: models\ingredient_ner
  4. IngredientNormalizer
      Input: data\normalized\combined_ingredients_data.parquet
      Output: data\normalized\recipes_data_clean.parquet
  5. DeduplicationStep_SBERT
      Input: data\normalized\recipes_data_clean.parquet
      Output: data\normalized\recipes_data_clean_spell_dedup.parquet
  6. EncodingStep
      Input: data\normalized\recipes_data_clean_spell_dedup.parquet
      Output: data\encoded\datasets_unified.parquet
  7. IngredientNormalizer
      Input: data\normalized\combined_cuisine_data.parquet
      Output: data\normalized\cuisine_baseline.parquet
  8. DeduplicationStep_SBERT
      Input: data\normalized\cuisine_base

In [None]:
# Execute pipeline (uncomment to run)
# orchestrator.run(force=False)

# Or run with force to rebuild all artifacts:
# orchestrator.run(force=True)

print("Pipeline configured. Uncomment orchestrator.run() above to execute.")


Pipeline configured. Uncomment orchestrator.run() above to execute.


## Inspect Results


In [None]:
import pandas as pd

# Check output files
output_path = Path(config.output.unified_parquet)
if output_path.exists():
    df = pd.read_parquet(output_path)
    print(f"Output file: {output_path}")
    print(f"Rows: {len(df):,}")
    print(f"Columns: {list(df.columns)}")
    print("\nFirst few rows:")
    print(df.head())
else:
    print(f"Output file not found: {output_path}")
    print("Run the pipeline first to generate output.")


Output file not found: data\encoded\datasets_unified.parquet
Run the pipeline first to generate output.
