# Ingredients Pipeline (End-to-End)

Narrative guide for newcomers. We start with messy ingredients and end with encoded IDs and a dedupe summary. Default runner order: normalization → NER train → combine raw → infer → encode → dedupe summary. Uncomment code cells to run with `src/recipe_pipeline/config/pipeline.yaml`.

## 1) Normalize Ingredients
**What:** Clean ~2M raw recipe rows, standardize spellings, build vocab, create cosine dedupe map.
**Why:** Consistent tokens reduce noise for encoding and modeling.
Outputs: baseline/deduped parquets, cosine map, vocab/ID maps.

In [None]:
from recipe_pipeline.runner import PipelineRunner
from pathlib import Path
import yaml

config_path = Path('src/recipe_pipeline/config/pipeline.yaml')
runner = PipelineRunner.from_file(config_path)
# runner.run(stages=["ingredient_normalization"])


## 2) Train Ingredient NER
**What:** Train transformer NER to label ingredient spans.
**Why:** Learned boundaries handle messy text better than rules.


In [None]:
# runner.run(stages=["ingredient_ner_train"])


## 3) Combine Raw Data
**What:** Merge raw datasets (with cuisines/country) into one parquet.
**Why:** Downstream steps consume a single consolidated file.


In [None]:
# runner.run(stages=["combine_raw"])


## 4) Infer on Raw Data
**What:** Run NER inference to add `inferred_ingredients` to each recipe.
**Why:** Structured ingredients are needed for encoding/analysis.


In [None]:
# runner.run(stages=["ingredient_ner_infer"])


## 5) Encode Ingredients
**What:** Convert `inferred_ingredients` into integer IDs.
**Why:** Compact, consistent representation for models and joins.


In [None]:
# runner.run(stages=["ingredient_encoding"])


## 6) Ingredient Dedupe Summary
**What:** Summarize dedupe map (identity vs merges), list top merge targets, plot.
**Why:** QA deduping effectiveness; ensure we’re not just passing identities.


In [None]:
# runner.run(stages=["ingredients_summary"])


## Inspect Artifacts (optional)
Uncomment to peek at deduped parquets, maps, and summary plots.

In [None]:
# import pandas as pd, json
# from pathlib import Path
# from IPython.display import Image
# cfg = yaml.safe_load(open(config_path, 'r', encoding='utf-8'))
# dedup_parquet = Path(cfg['pipeline']['ingredient_normalization']['output']['dedup_parquet'])
# map_path = Path(cfg['pipeline']['ingredient_normalization']['output']['cosine_map_path'])
# summary = Path('reports/ingredients/dedupe_summary.json')
# plot = Path('reports/ingredients/dedupe_top_targets.png')
# if dedup_parquet.exists():
#     print(dedup_parquet, pd.read_parquet(dedup_parquet).head())
# if map_path.exists():
#     print('map entries', sum(1 for _ in open(map_path, 'r', encoding='utf-8')))
# if summary.exists():
#     print(json.load(open(summary)))
# if plot.exists():
#     display(Image(filename=str(plot)))
