# Combine Raw Datasets

This notebook combines multiple raw CSV datasets into a unified format and runs ingredient NER inference to extract ingredients from raw text.

## Overview

The `combine_raw_datasets.py` script:
1. Reads multiple CSV files from `data/raw/`
2. Extracts ingredients and cuisine columns (with intelligent column detection)
3. Combines all datasets into a single DataFrame with standardized columns
4. Runs ingredient NER inference to extract normalized ingredients from raw text
5. Outputs a combined parquet file with `inferred_ingredients` and `encoded_ingredients` columns

## Workflow Steps

1. **Dataset Discovery**: Finds all CSV files in the raw data directory
2. **Column Detection**: Automatically detects ingredients and cuisine columns (case-insensitive)
3. **Cuisine Extraction**: Extracts cuisine labels from various formats (lists, strings, text columns)
4. **Data Combination**: Merges all datasets with Dataset_ID tracking
5. **NER Inference**: Applies trained ingredient NER model to extract ingredients from raw text
6. **Output**: Saves combined dataset as parquet file


In [1]:
# Setup: Add pipeline to path
import sys
from pathlib import Path

# Add pipeline directory to path
pipeline_root = Path.cwd().parent.parent / "pipeline"
if str(pipeline_root) not in sys.path:
    sys.path.insert(0, str(pipeline_root))

print(f"Pipeline root: {pipeline_root}")
print(f"Python path includes: {pipeline_root.exists()}")


Pipeline root: c:\Users\georg.DESKTOP-2FS9VF1\source\repos\699-capstone-team14\preprocess_pipeline\pipeline
Python path includes: True


## Step 1: Import Required Modules


In [2]:
import argparse
import logging
from pathlib import Path
import pandas as pd

# Import from pipeline scripts
from common.logging_setup import setup_logging

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s] %(levelname)s %(name)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)


## Step 2: Configure Paths

Set the input directory (raw CSV files) and output path for the combined dataset.


In [3]:
# Configuration
data_dir = Path("./data/raw")
output_path = Path("./data/combined_raw_datasets.parquet")
inference_config = Path("./pipeline/config/ingredient_ner_inference.yaml")

# Check if directories exist
print(f"Data directory: {data_dir} (exists: {data_dir.exists()})")
print(f"Output path: {output_path}")
print(f"Inference config: {inference_config} (exists: {inference_config.exists()})")


Data directory: data\raw (exists: False)
Output path: data\combined_raw_datasets.parquet
Inference config: pipeline\config\ingredient_ner_inference.yaml (exists: False)


## Step 3: Run Dataset Combination Script

This cell runs the actual combination script. The script will:
- Find all CSV files in the data directory
- Process each file to extract ingredients and cuisine columns
- Combine all datasets
- Run NER inference to extract ingredients from raw text
- Save the combined dataset

**Note**: This can take a while depending on the size of your datasets and whether NER inference is enabled.


In [None]:
# Import the main function from combine_raw_datasets
import subprocess
import sys

# Run the script using subprocess to capture output
cmd = [
    sys.executable,
    str(Path("./pipeline/scripts/combine_raw_datasets.py")),
    "--data-dir", str(data_dir),
    "--output", str(output_path),
    "--inference-config", str(inference_config),
    # Uncomment to skip inference if you just want to combine datasets:
    # "--skip-inference",
]

print("Running dataset combination script...")
print(f"Command: {' '.join(cmd)}")
print("\n" + "="*60)

# Run the script
result = subprocess.run(cmd, capture_output=True, text=True)

# Print output
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)
print(f"\nReturn code: {result.returncode}")

# Alternative: Import and run directly (uncomment to use)
# from pipeline.scripts.combine_raw_datasets import main
# import sys
# sys.argv = ['combine_raw_datasets.py', 
#              '--data-dir', str(data_dir),
#              '--output', str(output_path),
#              '--inference-config', str(inference_config)]
# main()


## Step 4: Inspect Combined Dataset

Load and examine the combined dataset to verify the results.


In [None]:
# Load the combined dataset
if output_path.exists():
    df = pd.read_parquet(output_path)
    
    print(f"Dataset shape: {df.shape}")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nDataset_ID distribution:")
    print(df['Dataset_ID'].value_counts().sort_index())
    
    print(f"\nFirst few rows:")
    print(df.head())
    
    # Check inferred ingredients
    if 'inferred_ingredients' in df.columns:
        print(f"\nSample inferred ingredients:")
        sample_idx = df['inferred_ingredients'].notna().idxmax() if df['inferred_ingredients'].notna().any() else 0
        print(f"Row {sample_idx}: {df.loc[sample_idx, 'inferred_ingredients']}")
    
    # Check cuisine distribution
    if 'cuisine' in df.columns:
        print(f"\nCuisine distribution (top 10):")
        print(df['cuisine'].value_counts().head(10))
else:
    print(f"Output file not found: {output_path}")
