# Village Roadshows - Test/Holdout Dataset Cleanup Pipeline

This notebook contains a sequential pipeline to process raw data and generate the final `test_dataset.xlsx` file. This final file is used by the `cinema_dashboard.py` application for evaluating recommendation performance.

**Workflow:**
1.  **Inventory Transactions Cleanup:** Reads all raw `Inventory Transaction Data` Excel files from the `input/` directory, cleans them, imputes prices, and saves a consolidated `inventory_transactions_clean.xlsx` to the `output/` directory.
2.  **One-Hot Encoding (OHE) of Transactions:** Takes the cleaned inventory data and transforms it into a wide, one-hot encoded format based on item class and product name. This creates several intermediate OHE files in `output/`.
3.  **Movie Sessions Cleanup:** Reads the raw `Movie_sessions` file from `input/`, cleans it, and applies business rules (e.g., filtering session times, binning durations).
4.  **Hourly Session Expansion:** Expands the cleaned session data so that each session is represented across every hour it is active (including buffer times).
5.  **Final Merge:** Combines the OHE transaction data with the expanded session data on a timestamp key to produce the final `test_dataset.xlsx` in the `output/` folder.

In [1]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Village Roadshows: F&B Analytics Data Processing Pipeline.

This script serves as the master data engineering workflow, transforming raw
transaction and movie session data into a unified, analysis-ready feature matrix.
It is designed to be run from a project directory containing 'input/' and
'output/' subdirectories.

Pipeline Overview:
------------------
The script executes five sequential stages:

1.  **Inventory Transaction Cleanup:**
    -   Discovers and loads all raw `Inventory Transaction Data*.xlsx` files.
    -   Normalizes inconsistent headers, filters for allowed food & beverage
        classes, and standardizes data types.
    -   Imputes missing prices using a robust 5-level hierarchical median.
    -   Outputs: `inventory_transactions_clean.xlsx` and per-year splits.

2.  **Transaction One-Hot Encoding (OHE):**
    -   Converts the cleaned, long-format transaction data into a wide-format,
        hourly matrix where columns represent individual products.
    -   Values are the count of items sold per hour.
    -   Prepends a `total_price_aud` column for hourly revenue.
    -   Outputs: `ohe_trx_item_class_product.xlsx`

3.  **Movie Session Cleanup:**
    -   Discovers and loads all raw `Movie_sessions*.xlsx` files.
    -   Applies business rules: filters out irrelevant genres, invalid runtimes,
        and sessions outside of primary operating hours (09:00 - 21:59).
    -   Engineers new features like duration categories and time-of-day slots.
    -   Outputs: `movie_sessions_clean.xlsx` and per-year splits.

4.  **Hourly Session Expansion & OHE:**
    -   Expands each movie session to represent every hour patrons are on-site,
        including a pre-session buffer for early arrivals.
    -   One-hot encodes categorical session features (genre, rating, etc.).
    -   Aggregates data to the hour, calculating total concurrent admissions
        and the admit-weighted average movie duration for that hour.
    -   Outputs: `ohe_movie_sessions_hourly_expanded.xlsx`

5.  **Final Dataset Merge:**
    -   Performs a left join, combining the hourly transaction data (Stage 2)
        with the expanded hourly session data (Stage 4) on the `timestamp` key.
    -   This creates the final feature matrix, keeping all hours with sales.
    -   Outputs: `train_dataset.xlsx` (or a similar name).

Execution:
----------
1. Place all raw Excel files into an `input/` directory.
2. Run this script from the parent directory.
3. All intermediate and final files will be saved to an `output/` directory.
"""
from __future__ import annotations

import re
import warnings
from datetime import timedelta
from pathlib import Path
from typing import Dict, Final, List, Tuple

import pandas as pd

# =============================================================================
# --- 1. CONFIGURATION CONSTANTS ---
# =============================================================================
# All constants, business rules, and file paths are defined here for easy
# maintenance and modification without altering the core script logic.

# --- Path Configuration ---
BASE_DIR: Final[Path] = Path(".")
INPUT_DIR: Final[Path] = BASE_DIR / "input"
OUTPUT_DIR: Final[Path] = BASE_DIR / "output"

# --- Regex Patterns for File Discovery ---
# Matches files like "Inventory Transaction Data 2023 v0.1.xlsx" or "Inventory Transaction Data Feb 2025 v1.xlsx"
WORKBOOK_RX_INV: Final[re.Pattern[str]] = re.compile(
    r"Inventory Transaction Data "
    r"(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{4}|\d{4})"
    r"(?:\s+v\d+(?:\.\d+)?)?"
    r"\.xlsx$",
    re.I,
)
# Matches files like "Movie_sessions v1.2.xlsx" or "Movie_sessions_Jan2025.xlsx"
SOURCE_RX_SESS: Final[re.Pattern[str]] = re.compile(
    r"Movie_sessions"
    r"(?:_(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s?\d{4}))?"
    r"(?:\s+v\d+(?:\.\d+)?)?"
    r"\.xlsx$",
    re.I,
)

# --- Inventory Cleaning Business Rules ---
ALLOWED_CLASSES: Final[set[str]] = {
    "SNACK - CHIPS", "FOOD - VJUNIOR", "ICE CREAMS - OTHER", "ICE CREAMS - CHOC TO",
    "DRINKS - EXTRA LARGE", "DRINKS - LARGE", "DRINKS - MEDIUM", "DRINKS - SMALL",
    "DRINKS - NO ICE", "DRINKS", "POPCORN",
}
# Regex to find and remove "NO ICE" or "NO SUGAR" flags from product names
NO_FLAG_RX: Final[re.Pattern[str]] = re.compile(r"\bNO\s+(ICE|SUGAR)\b", re.I)

# --- Session Cleaning Business Rules ---
EXCLUDE_GENRES: Final[set[str]] = {"GAMING", "TO BE ADVISED"}
# Session runtimes of exactly 960 minutes are considered placeholder data
PLACEHOLDER_RUNTIME: Final[int] = 960

# --- Feature Engineering Constants ---
# Time-of-day slots, defined by minutes from midnight: [start, end)
SLOT_WINDOWS: Final[Dict[str, Tuple[int, int]]] = {
    "morning": (9 * 60, 11 * 60), "early_noon": (11 * 60, 13 * 60),
    "noon": (13 * 60, 15 * 60), "late_noon": (15 * 60, 17 * 60),
    "evening_1": (17 * 60, 17 * 60 + 30), "evening_2": (17 * 60 + 30, 18 * 60),
    "evening_3": (18 * 60, 18 * 60 + 15), "evening_4": (18 * 60 + 15, 18 * 60 + 30),
    "evening_5": (18 * 60 + 30, 18 * 60 + 45), "evening_6": (18 * 60 + 45, 19 * 60),
    "night_1": (19 * 60, 19 * 60 + 15), "night_2": (19 * 60 + 15, 19 * 60 + 30),
    "night_3": (19 * 60 + 30, 20 * 60), "night_4": (20 * 60, 20 * 60 + 30),
    "night_5": (20 * 60 + 30, 21 * 60), "night_6": (21 * 60, 22 * 60),
}
# Duration categories for movies (in minutes)
SHORT_MAX_SESS: Final[int] = 120
MEDIUM_MAX_SESS: Final[int] = 160

# Buffer window (in hours) for session expansion
PRE_BUFFER_HRS: Final[int] = 1  # How many hours before a session patrons might arrive
POST_BUFFER_HRS: Final[int] = 0  # How many hours after a session patrons might linger

# Categorical columns to one-hot encode during session expansion
CAT_COLS_SESS: Final[List[str]] = [
    "language", "genre", "rating", "slot", "duration_category"
]

# =============================================================================
# --- 2. HELPER FUNCTIONS ---
# =============================================================================
# Small, reusable utility functions used across multiple pipeline stages.
# Underscore prefix indicates they are intended for internal use in this module.

def _find_header_row(xl_path: Path, sheet_name: str, start_col_name: str) -> int:
    """Finds the header row index by searching for a specific column name.

    This is necessary because source Excel files may have variable numbers of
    title rows before the actual data table.

    Args:
        xl_path: Path to the Excel workbook.
        sheet_name: The name of the sheet to search within.
        start_col_name: The text the key header cell starts with (case-insensitive).

    Returns:
        The 0-based index of the header row.

    Raises:
        ValueError: If no row contains the specified header text.
    """
    raw = pd.read_excel(xl_path, sheet_name=sheet_name, header=None, dtype=str)
    for idx, row in raw.iterrows():
        if any(str(cell).lower().strip().startswith(start_col_name) for cell in row):
            return idx
    raise ValueError(f"{xl_path.name}: Header with '{start_col_name}' not found.")

def _remove_no_flag(text: str) -> str:
    """Strips 'NO ICE'/'NO SUGAR' flags to get the base product name."""
    return NO_FLAG_RX.sub("", str(text)).replace("  ", " ").strip()

def _map_timestamp_to_slot(ts: pd.Timestamp) -> str:
    """Maps a timestamp to its defined time-of-day slot."""
    minutes = ts.hour * 60 + ts.minute
    for slot, (start, end) in SLOT_WINDOWS.items():
        if start <= minutes < end:
            return slot
    return "out_of_range"

def _get_duration_category(minutes: int) -> str:
    """Categorizes a movie's runtime into 'short', 'medium', or 'long'."""
    if minutes <= SHORT_MAX_SESS: return "short"
    if minutes <= MEDIUM_MAX_SESS: return "medium"
    return "long"

def _parse_runtime_from_text(duration_text: str) -> int:
    """Extracts the first integer from a text string (e.g., '145 min')."""
    match = re.search(r"(\d+)", str(duration_text))
    return int(match.group(1)) if match else 0

def _build_ohe_matrix(df: pd.DataFrame, cat_cols: list[str], *, keep_prefix: bool = False) -> pd.DataFrame:
    """Creates a timestamp-indexed, quantity-weighted OHE matrix.

    Args:
        df: The input DataFrame, must contain 'timestamp' and 'quantity'.
        cat_cols: A list of categorical columns to encode.
        keep_prefix: If True, OHE columns will be prefixed (e.g., 'item_class_POPCORN').
                     If False, they will not (e.g., 'POPCORN').

    Returns:
        A DataFrame indexed by 'timestamp' with OHE columns.
    """
    if "quantity" not in df.columns:
        raise KeyError("Input DataFrame must have a 'quantity' column.")

    dummy_kwargs = {} if keep_prefix else dict(prefix="", prefix_sep="")
    dummies = pd.get_dummies(df[cat_cols], columns=cat_cols, dtype="uint32", **dummy_kwargs)

    # Weight each OHE flag by the number of items sold in that transaction
    weighted_dummies = dummies.mul(df["quantity"].values, axis=0)

    # Aggregate by hour to get total counts for each category
    return weighted_dummies.groupby(df["timestamp"]).sum().astype("uint32")

# =============================================================================
# --- 3. CORE DATA PROCESSING FUNCTIONS ---
# =============================================================================
# These functions contain the detailed logic for cleaning a single workbook.

def _clean_one_inventory_workbook(xl_path: Path) -> pd.DataFrame:
    """Loads and cleans a single raw inventory transaction workbook."""
    print(f"→ Cleaning inventory file: {xl_path.name}")
    header_row = _find_header_row(xl_path, "Inventory Trans", "transaction date")
    df_raw = pd.read_excel(xl_path, sheet_name="Inventory Trans", header=header_row, dtype=str)

    # Filter out summary/footer rows
    df = df_raw[~df_raw["Transaction Date"].str.contains("result", case=False, na=False)].copy()

    # Standardize column names and drop empty 'Unnamed' columns
    rename_map = {}
    for col in df.columns:
        key = str(col).lower().strip()
        if "no of items" in key or key == "ea": rename_map[col] = "quantity"
        elif "sell price" in key or key == "aud": rename_map[col] = "price_aud"
    df.rename(columns=rename_map, inplace=True)
    df.drop(columns=[c for c in df.columns if str(c).startswith('Unnamed')], errors="ignore", inplace=True)

    # Parse and build timestamp
    df["Transaction Date"] = pd.to_datetime(df["Transaction Date"], dayfirst=True, errors="coerce")
    df["Transaction Hour"] = pd.to_numeric(df["Transaction Hour"], errors="coerce")
    df.dropna(subset=["Transaction Date", "Transaction Hour"], inplace=True)
    df["timestamp"] = (
        df["Transaction Date"].dt.normalize()
        + pd.to_timedelta(df["Transaction Hour"].astype(int), unit="h")
    )

    # Normalize text fields and filter for allowed item classes
    for col in ["Item Class", "VISTA Item"]:
        df[col] = df[col].str.strip()
    df["Item Class"] = df["Item Class"].str.replace(r"\s+", " ", regex=True)
    df = df[df["Item Class"].str.upper().isin(ALLOWED_CLASSES)].copy()

    # Clean numeric columns
    df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce").fillna(1).clip(lower=1).astype(int)
    df["price_aud"] = pd.to_numeric(df["price_aud"], errors="coerce")

    # --- Price Imputation Hierarchy ---
    df["unit_price"] = df["price_aud"] / df["quantity"]
    unit_price_map = df[df["unit_price"] > 0].drop_duplicates("VISTA Item").set_index("VISTA Item")["unit_price"]

    # Fill missing prices using a 5-level cascade
    for level in range(5):
        missing_mask = df["unit_price"].isna() | (df["unit_price"] <= 0)
        if not missing_mask.any(): break

        if level == 0: # 1. Exact product name match
            df.loc[missing_mask, "unit_price"] = df.loc[missing_mask, "VISTA Item"].map(unit_price_map)
        elif level == 1: # 2. 'NO ICE'/'NO SUGAR' proxy match
            df.loc[missing_mask, "unit_price"] = df.loc[missing_mask, "VISTA Item"].apply(_remove_no_flag).map(unit_price_map)
        elif level == 2: # 3. Median price for 'SNACK - CHIPS'
            chips_mask = missing_mask & (df["Item Class"] == "SNACK - CHIPS")
            if chips_mask.any():
                chips_median = df.loc[df["Item Class"] == "SNACK - CHIPS", "unit_price"].median()
                df.loc[chips_mask, "unit_price"] = chips_median
        elif level == 3: # 4. Median price for the item's class
            class_medians = df[df["unit_price"] > 0].groupby("Item Class")["unit_price"].median()
            df.loc[missing_mask, "unit_price"] = df.loc[missing_mask, "Item Class"].map(class_medians)
        elif level == 4: # 5. Global median price
            df.loc[missing_mask, "unit_price"] = df["unit_price"].median(skipna=True)
    
    # Recalculate total price after imputation
    df["price_aud"] = (df["unit_price"] * df["quantity"]).round(2)

    # Final selection and renaming
    return df.rename(columns={"Item Class": "item_class", "VISTA Item": "product_name"})[
        ["timestamp", "item_class", "product_name", "quantity", "price_aud"]
    ].reset_index(drop=True)

def _clean_one_session_workbook(xl_path: Path) -> pd.DataFrame:
    """Loads, cleans, and engineers features for a single session workbook."""
    print(f"→ Cleaning session file: {xl_path.name}")
    header_row = _find_header_row(xl_path, "Sheet1", "session date")
    df = pd.read_excel(xl_path, sheet_name="Sheet1", header=header_row, dtype=str)

    # Initial cleanup and type conversion
    df = df.drop(columns={"Film"}, errors="ignore")
    df["Session Date"] = pd.to_datetime(df["Session Date"], dayfirst=True, errors="coerce")
    df["Session Hour"] = pd.to_numeric(df["Session Hour"], errors="coerce")
    df = df.dropna(subset=["Session Date", "Session Hour", "Genre", "Total Admits", "Duration"])

    # Apply business rule filters
    df = df[~df["Genre"].str.upper().str.strip().isin(EXCLUDE_GENRES)]
    df["duration_min"] = df["Duration"].apply(_parse_runtime_from_text)
    df = df[df["duration_min"] != PLACEHOLDER_RUNTIME]
    
    # Feature Engineering
    df["timestamp"] = (
        df["Session Date"].dt.normalize()
        + pd.to_timedelta(df["Session Hour"].astype(int), unit="h")
    )
    df["duration_category"] = df["duration_min"].apply(_get_duration_category)
    df["slot"] = df["timestamp"].apply(_map_timestamp_to_slot)
    
    # Keep only sessions within defined trading hours
    df = df[df["slot"] != "out_of_range"].copy()

    # Final renaming and column selection
    df.rename(columns={
        "Session Audio Language": "language", "Genre": "genre",
        "Censor Rating": "rating", "Total Admits": "admits"
    }, inplace=True)
    df["admits"] = pd.to_numeric(df["admits"], errors='coerce').fillna(0).astype(int)
    
    return df[
        ["timestamp", "language", "genre", "rating", "admits",
         "duration_min", "duration_category", "slot"]
    ].sort_values("timestamp").reset_index(drop=True)

# =============================================================================
# --- 4. PIPELINE STAGE FUNCTIONS ---
# =============================================================================
# Each function here corresponds to a major stage in the data pipeline.

def run_inventory_cleaning_stage(input_dir: Path, output_dir: Path) -> Path | None:
    """Stage 1: Discovers, cleans, and consolidates all inventory files."""
    print("\n--- Stage 1: Cleaning Inventory Transactions ---")
    source_files = sorted(p for p in input_dir.iterdir() if WORKBOOK_RX_INV.fullmatch(p.name))
    if not source_files:
        print(f"📭 No source inventory workbooks found in '{input_dir}'. Skipping.")
        return None

    all_frames = [_clean_one_inventory_workbook(p) for p in source_files]
    master_df = pd.concat(all_frames, ignore_index=True).sort_values("timestamp").reset_index(drop=True)
    print(f"\n✅ Cleaned inventory rows total: {len(master_df):,}")

    # Save the consolidated master file
    master_output_path = output_dir / "inventory_transactions_clean.xlsx"
    master_df.to_excel(master_output_path, index=False)
    print(f"  • {master_output_path.name} written")

    # Save per-year splits for easier ad-hoc analysis
    for year, group in master_df.groupby(master_df["timestamp"].dt.year):
        fname = f"inventory_transactions_clean_{year}.xlsx"
        group.to_excel(output_dir / fname, index=False)
        print(f"  • {fname}  ({len(group):,} rows)")
        
    return master_output_path

def run_transaction_ohe_stage(input_path: Path, output_dir: Path) -> None:
    """Stage 2: Creates a one-hot encoded matrix from cleaned transactions."""
    print("\n--- Stage 2: One-Hot Encoding Transactions ---")
    if not input_path or not input_path.exists():
        print(f"⚠️ Source file not found at '{input_path}'. Skipping OHE stage.")
        return

    transactions = pd.read_excel(input_path, parse_dates=["timestamp"])
    transactions["item_class"] = transactions["item_class"].str.upper().str.strip()
    transactions["product_name"] = transactions["product_name"].str.upper().str.strip()
    
    hourly_revenue = transactions.groupby("timestamp")["price_aud"].sum().round(2).rename("total_price_aud")
    ohe_combined = _build_ohe_matrix(transactions, ["item_class", "product_name"], keep_prefix=True)
    ohe_combined.insert(0, "total_price_aud", hourly_revenue)

    output_path = output_dir / "ohe_trx_item_class_product.xlsx"
    ohe_combined.to_excel(output_path)
    print(f"  • {output_path.name} written")

def run_session_cleaning_stage(input_dir: Path, output_dir: Path) -> Path | None:
    """Stage 3: Discovers, cleans, and consolidates all movie session files."""
    print("\n--- Stage 3: Cleaning Movie Sessions ---")
    source_files = sorted(p for p in input_dir.iterdir() if SOURCE_RX_SESS.fullmatch(p.name))
    if not source_files:
        print(f"📭 No source movie session workbooks found in '{input_dir}'. Skipping.")
        return None

    all_frames = [_clean_one_session_workbook(p) for p in source_files]
    master_df = pd.concat(all_frames, ignore_index=True).sort_values("timestamp")
    print(f"\n✅ Total cleaned sessions: {len(master_df):,}")

    master_output_path = output_dir / "movie_sessions_clean.xlsx"
    master_df.to_excel(master_output_path, index=False)
    print(f"• {master_output_path.name} written")

    for year, group in master_df.groupby(master_df["timestamp"].dt.year):
        fname = f"movie_sessions_clean_{year}.xlsx"
        group.to_excel(output_dir / fname, index=False)
        print(f"• {fname} ({len(group):,} rows)")
        
    return master_output_path

def run_session_expansion_stage(input_path: Path, output_dir: Path) -> Path | None:
    """Stage 4: Expands session data to an hourly level and OHE encodes it."""
    print("\n--- Stage 4: Expanding & OHE Movie Sessions ---")
    if not input_path or not input_path.exists():
        print(f"⚠️ Source file not found at '{input_path}'. Skipping expansion stage.")
        return None

    sessions = pd.read_excel(input_path, parse_dates=["timestamp"])
    sessions["end_time"] = sessions["timestamp"] + pd.to_timedelta(sessions["duration_min"], unit="m")

    # Explode each session row into multiple rows, one for each hour it's active
    expanded_rows = []
    for _, sess in sessions.iterrows():
        site_start = sess["timestamp"] - timedelta(hours=PRE_BUFFER_HRS)
        site_end = sess["end_time"] + timedelta(hours=POST_BUFFER_HRS)
        for hr in pd.date_range(site_start, site_end, freq="h", inclusive="left"):
            row = sess.copy()
            row["timestamp"] = hr
            row["in_show"] = int(sess["timestamp"] <= hr < sess["end_time"]) # Flag if movie is running
            expanded_rows.append(row)
    expanded = pd.DataFrame(expanded_rows).drop(columns=["end_time"])
    
    # OHE and prepare for aggregation
    ohe = pd.get_dummies(expanded, columns=CAT_COLS_SESS, prefix=CAT_COLS_SESS, prefix_sep="_", dtype="uint8")
    cat_dummies = [c for c in ohe.columns if any(c.startswith(f"{cat}_") for cat in CAT_COLS_SESS)]
    ohe[cat_dummies] = ohe[cat_dummies].mul(ohe["in_show"], axis=0)
    
    # Calculate admits-weighted metrics for aggregation
    ohe["admits_in_show"] = ohe["admits"] * ohe["in_show"]
    ohe["dur_x_admits"] = ohe["duration_min"] * ohe["admits_in_show"]

    # Aggregate all data to the hourly level
    agg_map = {col: "sum" for col in cat_dummies}
    agg_map.update({"admits": "sum", "admits_in_show": "sum", "dur_x_admits": "sum"})
    hourly = ohe.groupby("timestamp", as_index=False).agg(agg_map)
    
    # Calculate final hourly metrics
    hourly.rename(columns={"admits": "total_admits"}, inplace=True)
    hourly["avg_duration_min"] = hourly["dur_x_admits"].div(hourly["admits_in_show"]).round(1).fillna(0)
    
    # Final cleanup and column ordering
    final_cols = ["timestamp", "total_admits", "avg_duration_min"] + cat_dummies
    hourly = hourly[final_cols]
    
    output_path = output_dir / "ohe_movie_sessions_hourly_expanded.xlsx"
    hourly.to_excel(output_path, index=False)
    print(f"✓ Saved hourly-expanded sessions → {output_path.name}  ({len(hourly)} rows)")
    return output_path

def run_final_merge_stage(trx_ohe_path: Path, sess_ohe_path: Path, output_dir: Path, output_filename: str) -> None:
    """Stage 5: Merges OHE transaction and session data into the final dataset."""
    print("\n--- Stage 5: Merging to Final Dataset ---")
    if not (trx_ohe_path and trx_ohe_path.exists()):
        print(f"⚠️ Transaction OHE file not found at '{trx_ohe_path}'. Cannot perform merge.")
        return
    if not (sess_ohe_path and sess_ohe_path.exists()):
        print(f"⚠️ Session OHE file not found at '{sess_ohe_path}'. Cannot perform merge.")
        return

    trx_df = pd.read_excel(trx_ohe_path, index_col=0, parse_dates=[0]).reset_index().rename(columns={"index": "timestamp"})
    sess_df = pd.read_excel(sess_ohe_path, parse_dates=["timestamp"])
    
    # Validate data integrity before merging
    if trx_df["timestamp"].duplicated().any(): raise ValueError("Duplicate timestamps found in transaction data.")
    if sess_df["timestamp"].duplicated().any(): raise ValueError("Duplicate timestamps found in session data.")

    # Left join to keep every hour with a sale, and add corresponding session data
    merged = pd.merge(trx_df, sess_df, on="timestamp", how="left", validate="one_to_one")
    print(f"✓ Merged table size: {len(merged):,} rows × {merged.shape[1]} columns")
    
    # Reorder key columns to the front for better readability
    lead_cols = ["timestamp", "total_admits", "total_price_aud", "avg_duration_min"]
    lead_cols_exist = [c for c in lead_cols if c in merged.columns]
    merged = merged[lead_cols_exist + [c for c in merged.columns if c not in lead_cols_exist]]
    
    final_output_path = output_dir / output_filename
    merged.to_excel(final_output_path, index=False)
    print(f"• Saved final dataset to: {final_output_path}")

# =============================================================================
# --- 5. MAIN PIPELINE WORKFLOW ---
# =============================================================================

def run_pipeline(output_filename: str = "train_dataset.xlsx"):
    """
    Executes the full data processing pipeline from raw files to the final dataset.

    Args:
        output_filename: The name for the final merged dataset file.
    """
    print(f"\n--- Starting Data Cleanup Pipeline for '{output_filename}' ---\n")
    
    # Ensure input/output directories exist
    INPUT_DIR.mkdir(exist_ok=True)
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

    # Stage 1: Clean Inventory
    cleaned_inventory_path = run_inventory_cleaning_stage(INPUT_DIR, OUTPUT_DIR)
    
    # Stage 2: OHE Transactions
    run_transaction_ohe_stage(cleaned_inventory_path, OUTPUT_DIR)
    
    # Stage 3: Clean Sessions
    cleaned_sessions_path = run_session_cleaning_stage(INPUT_DIR, OUTPUT_DIR)
    
    # Stage 4: Expand and OHE Sessions
    expanded_sessions_path = run_session_expansion_stage(cleaned_sessions_path, OUTPUT_DIR)
    
    # Stage 5: Final Merge
    trx_ohe_final_path = OUTPUT_DIR / "ohe_trx_item_class_product.xlsx"
    run_final_merge_stage(trx_ohe_final_path, expanded_sessions_path, OUTPUT_DIR, output_filename)
    
    print(f"\n🎉 Pipeline complete. Final file '{output_filename}' is in '{OUTPUT_DIR}/'.")

# =============================================================================
# --- SCRIPT ENTRY POINT ---
# =============================================================================

if __name__ == '__main__':
    # Determines the output filename based on the name of the directory it's in.
    # This allows the same script to be used for 'training' and 'testing' datasets.
    current_folder_name = BASE_DIR.resolve().name.lower()
    
    if "train" in current_folder_name:
        final_file = "train_dataset.xlsx"
    elif "test" in current_folder_name:
        final_file = "test_dataset.xlsx"
    else:
        final_file = "processed_dataset.xlsx"
        warnings.warn(f"Could not determine context from folder name. Defaulting output to '{final_file}'.")
    
    run_pipeline(output_filename=final_file)


--- Starting Data Cleanup Pipeline for 'test_dataset.xlsx' ---


--- Stage 1: Cleaning Inventory Transactions ---
→ Cleaning inventory file: Inventory Transaction Data Mar 2025 v1.xlsx

✅ Cleaned inventory rows total: 7,417
  • inventory_transactions_clean.xlsx written
  • inventory_transactions_clean_2025.xlsx  (7,417 rows)

--- Stage 2: One-Hot Encoding Transactions ---
  • ohe_trx_item_class_product.xlsx written

--- Stage 3: Cleaning Movie Sessions ---
→ Cleaning session file: Movie_sessions_Mar 2025 v1.xlsx

✅ Total cleaned sessions: 1,433
• movie_sessions_clean.xlsx written
• movie_sessions_clean_2025.xlsx (1,433 rows)

--- Stage 4: Expanding & OHE Movie Sessions ---
✓ Saved hourly-expanded sessions → ohe_movie_sessions_hourly_expanded.xlsx  (435 rows)

--- Stage 5: Merging to Final Dataset ---
✓ Merged table size: 381 rows × 176 columns
• Saved final dataset to: output/test_dataset.xlsx

🎉 Pipeline complete. Final file 'test_dataset.xlsx' is in 'output/'.
