# Scientific Marketplace Pricing Analysis

This notebook fetches marketplace prices and applies scientific outlier detection using **MAD (Median Absolute Deviation)** to produce clean, valid price ranges per SKU.

## Key Features:
- **MAD-based outlier detection**: Robust to up to 50% outliers (Iglewicz & Hoaglin, 1993)
- **Configurable percentile ranges**: Default P10-P90, easily adjustable
- **Transparent logging**: Track rows removed at each filtering step
- **WAC validation**: Filter prices within acceptable cost margins


In [None]:
# =============================================================================
# CONFIGURATION - All adjustable parameters in one place
# =============================================================================

# MAD Outlier Detection Settings
# Threshold of 3.5 is the standard recommendation by Iglewicz and Hoaglin (1993)
# Lower values = more aggressive outlier removal, Higher values = more permissive
MAD_THRESHOLD = 3.5

# WAC (Weighted Average Cost) Filter Settings
# Prices outside this percentage range from WAC4 will be excluded
WAC_LOWER_BOUND = -40  # Minimum acceptable % difference from WAC4
WAC_UPPER_BOUND = 40   # Maximum acceptable % difference from WAC4

# Percentile Settings for Price Bounds
# Define multiple percentiles to get a full price distribution (4-5 price points)
PERCENTILES = [10, 25, 50, 75, 90]  # P10, P25, P50 (median), P75, P90

# Sales Data Settings
SALES_LOOKBACK_DAYS = 100  # Days to look back for historical sales
RECENT_SALES_DAYS = 5      # Days to consider as "recent" sales

# Display Settings
SAMPLE_PRODUCT_ID = 1309  # Product ID to use for sample output verification

print("Configuration loaded successfully!")
print(f"  - MAD Threshold: {MAD_THRESHOLD}")
print(f"  - WAC Range: {WAC_LOWER_BOUND}% to {WAC_UPPER_BOUND}%")
print(f"  - Percentiles: {PERCENTILES}")


In [None]:
# =============================================================================
# IMPORTS AND ENVIRONMENT SETUP
# =============================================================================

import pandas as pd
import numpy as np
from tqdm import tqdm
import warnings
from IPython.display import display
import os

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Initialize environment (for Snowflake credentials)
import setup_environment_2
import importlib
importlib.reload(setup_environment_2)
setup_environment_2.initialize_env()

print("Environment initialized successfully!")


In [None]:
# =============================================================================
# HELPER FUNCTIONS
# =============================================================================

def query_snowflake(query, columns=[]):
    """
    Execute a Snowflake query and return results as a DataFrame.
    
    Parameters:
    -----------
    query : str
        SQL query to execute
    columns : list
        Column names for the resulting DataFrame
        
    Returns:
    --------
    pd.DataFrame
        Query results
    """
    import snowflake.connector
    
    con = snowflake.connector.connect(
        user=os.environ["SNOWFLAKE_USERNAME"],
        account=os.environ["SNOWFLAKE_ACCOUNT"],
        password=os.environ["SNOWFLAKE_PASSWORD"],
        database=os.environ["SNOWFLAKE_DATABASE"]
    )
    try:
        cur = con.cursor()
        cur.execute("USE WAREHOUSE COMPUTE_WH")
        cur.execute(query)
        if len(columns) == 0:
            out = pd.DataFrame(np.array(cur.fetchall()))
        else:
            out = pd.DataFrame(np.array(cur.fetchall()), columns=columns)
        return out
    except Exception as e:
        print(f"Error: {e}")
        return pd.DataFrame()
    finally:
        cur.close()
        con.close()


def mad_filter(data, threshold=3.5):
    """
    Apply MAD (Median Absolute Deviation) outlier detection.
    
    MAD is more robust than standard deviation for non-normal distributions.
    It can handle up to 50% outliers, while std breaks down at ~25%.
    
    The modified Z-score formula:
        M_i = 0.6745 * (x_i - median) / MAD
    
    Where 0.6745 is the consistency constant that makes MAD comparable 
    to standard deviation for normally distributed data.
    
    Reference: Iglewicz, B. and Hoaglin, D. (1993), 
               "How to Detect and Handle Outliers"
    
    Parameters:
    -----------
    data : array-like
        Numeric data to filter
    threshold : float
        Modified Z-score threshold (default 3.5 per Iglewicz & Hoaglin)
        
    Returns:
    --------
    np.ndarray
        Boolean mask where True = inlier, False = outlier
    """
    data = np.array(data, dtype=float)
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    
    # Handle edge case where MAD is 0 (all values are the same)
    if mad == 0:
        return np.ones(len(data), dtype=bool)
    
    # Calculate modified Z-scores
    # 0.6745 is the consistency constant for normal distribution
    modified_z_scores = 0.6745 * (data - median) / mad
    
    return np.abs(modified_z_scores) < threshold


def apply_mad_filter_to_group(group, column, threshold=3.5):
    """
    Apply MAD filter to a specific column within a group.
    
    Parameters:
    -----------
    group : pd.DataFrame
        Grouped DataFrame
    column : str
        Column name to filter on
    threshold : float
        MAD threshold
        
    Returns:
    --------
    pd.DataFrame
        Filtered group with outliers removed
    """
    if len(group) < 3:
        # Not enough data points for meaningful outlier detection
        return group
    
    mask = mad_filter(group[column].values, threshold)
    return group[mask]


def get_percentile_prices(group, column='price', percentiles=[10, 25, 50, 75, 90]):
    """
    Calculate multiple percentile-based prices for a group.
    
    Parameters:
    -----------
    group : pd.DataFrame
        Grouped DataFrame
    column : str
        Column to calculate percentiles on
    percentiles : list
        List of percentiles to calculate (e.g., [10, 25, 50, 75, 90])
        
    Returns:
    --------
    pd.Series
        Statistics including all percentile values and count
    """
    values = group[column].astype(float)
    
    # Build result dictionary with percentile prices
    result = {}
    for p in percentiles:
        result[f'price_p{p}'] = values.quantile(p / 100)
    
    # Add additional statistics
    result['price_count'] = len(values)
    result['true_min'] = values.min()
    result['true_max'] = values.max()
    
    return pd.Series(result)


def log_filtering_step(df_before, df_after, step_name):
    """
    Log the results of a filtering step.
    
    Parameters:
    -----------
    df_before : pd.DataFrame
        DataFrame before filtering
    df_after : pd.DataFrame
        DataFrame after filtering
    step_name : str
        Name of the filtering step
    """
    rows_removed = len(df_before) - len(df_after)
    pct_removed = (rows_removed / len(df_before) * 100) if len(df_before) > 0 else 0
    
    print(f"  {step_name}:")
    print(f"    - Before: {len(df_before):,} rows")
    print(f"    - After: {len(df_after):,} rows")
    print(f"    - Removed: {rows_removed:,} rows ({pct_removed:.2f}%)")


print("Helper functions loaded successfully!")


## Step 1: Data Fetching

Fetch marketplace prices, packing unit mappings, and WAC (cost) data from Snowflake.


In [None]:
# =============================================================================
# FETCH MARKETPLACE PRICES
# =============================================================================
# Get active seller prices with region mapping

print("Fetching marketplace prices...")

mp_query = '''
WITH seller_region AS (
    SELECT 
        seller_retailer.retailer_id,
        CASE 
            WHEN regions.name_en = 'Greater Cairo' THEN cities.name_en 
            ELSE regions.name_en 
        END AS region,
        seller_id,
        seller_retailer.POLYGON_ID
    FROM MATERIALIZED_VIEWS.SELLERS_RETAILERS_MAPPING seller_retailer
    JOIN retailers ON retailers.id = seller_retailer.retailer_id
    JOIN materialized_views.retailer_polygon ON materialized_views.retailer_polygon.retailer_id = retailers.id
    JOIN districts ON districts.id = materialized_views.retailer_polygon.district_id
    JOIN cities ON cities.id = districts.city_id
    JOIN states ON states.id = cities.state_id
    JOIN regions ON regions.id = states.region_id
    JOIN egypt_marketplace.sellers ON sellers.id = seller_retailer.seller_id AND sellers.status = 'ACTIVE'
),

recent_price AS (
    SELECT
        wp.product_id AS product_id,
        wp.packing_unit_id AS product_pu,
        wp.price AS price,
        wp.max_per_order,
        warehouses.seller_id AS seller_id,
        warehouses.MIN_TICKET_SIZE
    FROM egypt_marketplace.warehouse_products wp
    LEFT JOIN egypt_marketplace.warehouses ON warehouses.id = wp.warehouse_id 
    WHERE wp.AVAILABLE > 0 
        AND wp.total_stock > 0
        AND activation = 'true'
)

SELECT DISTINCT
    seller_region.region,
    recent_price.*
FROM recent_price
JOIN seller_region ON seller_region.seller_id = recent_price.seller_id
'''

mp_raw = query_snowflake(
    mp_query, 
    columns=['region', 'product_id', 'product_pu', 'price', 'max_per_order', 'seller_id', 'min_ticket_size']
)

# Convert data types
for col in ['product_id', 'product_pu', 'price', 'max_per_order', 'seller_id', 'min_ticket_size']:
    mp_raw[col] = pd.to_numeric(mp_raw[col])

print(f"  Fetched {len(mp_raw):,} marketplace price records")
print(f"  Unique products: {mp_raw['product_id'].nunique():,}")
print(f"  Unique sellers: {mp_raw['seller_id'].nunique():,}")
print(f"  Regions: {mp_raw['region'].unique().tolist()}")


In [None]:
# =============================================================================
# FETCH PACKING UNIT DATA
# =============================================================================
# Get packing unit to product mapping with basic unit count (BUC)

print("Fetching packing unit data...")

pu_query = '''
SELECT
    product_id,
    PACKING_UNIT_ID AS pu_id,
    BASIC_UNIT_COUNT AS buc
FROM PACKING_UNIT_PRODUCTS
'''

packing_units = query_snowflake(pu_query, columns=['product_id', 'pu_id', 'buc'])

# Convert data types
packing_units['product_id'] = pd.to_numeric(packing_units['product_id'])
packing_units['pu_id'] = pd.to_numeric(packing_units['pu_id'])
packing_units['buc'] = pd.to_numeric(packing_units['buc'])

print(f"  Fetched {len(packing_units):,} packing unit mappings")
print(f"  Unique products: {packing_units['product_id'].nunique():,}")


In [None]:
# =============================================================================
# FETCH WAC (WEIGHTED AVERAGE COST) DATA
# =============================================================================
# Get current cost data for price validation

print("Fetching WAC data...")

wac_query = '''
SELECT 
    f.product_id,
    f.wac1,
    f.wac4,
    f.wac_p
FROM finance.all_cogs f
WHERE current_timestamp BETWEEN f.from_date AND f.to_date
'''

wac_data = query_snowflake(wac_query, columns=['product_id', 'wac1', 'wac4', 'wac_p'])

# Convert data types
wac_data['product_id'] = pd.to_numeric(wac_data['product_id'])
wac_data['wac1'] = pd.to_numeric(wac_data['wac1'])
wac_data['wac4'] = pd.to_numeric(wac_data['wac4'])
wac_data['wac_p'] = pd.to_numeric(wac_data['wac_p'])

print(f"  Fetched WAC data for {len(wac_data):,} products")

# Create packing unit WAC by joining with BUC
pu_wac = pd.merge(packing_units, wac_data, on='product_id', how='left')
pu_wac['pu_wac1'] = pu_wac['buc'] * pu_wac['wac1']
pu_wac['pu_wac4'] = pu_wac['buc'] * pu_wac['wac4']

print(f"  Created PU-level WAC for {len(pu_wac):,} product-PU combinations")


## Step 2: WAC Mapping and Initial Filtering

Map prices to packing units and filter based on WAC4 percentage bounds.


In [None]:
# =============================================================================
# WAC MAPPING AND INITIAL FILTER
# =============================================================================
# Join marketplace prices with WAC data and filter by acceptable cost margins

print("=" * 60)
print("STEP 2: WAC Mapping and Initial Filtering")
print("=" * 60)

# Rename column for clarity
mp_data = mp_raw.copy()
mp_data.rename(columns={'product_pu': 'mp_pu_id'}, inplace=True)

# Join with packing unit WAC data
mp_with_wac = pd.merge(
    mp_data,
    pu_wac,
    how='inner',
    on='product_id'
)

print(f"\nAfter joining with WAC data: {len(mp_with_wac):,} rows")

# Calculate percentage difference from WAC4
# Formula: (price - pu_wac4) / pu_wac4 * 100
mp_with_wac['wac4_pct_diff'] = (
    (mp_with_wac['price'].astype(float) - mp_with_wac['pu_wac4'].astype(float)) 
    / mp_with_wac['pu_wac4'].astype(float) * 100
).round(2)

# Apply WAC filter using configurable bounds
print(f"\nApplying WAC4 filter: {WAC_LOWER_BOUND}% to {WAC_UPPER_BOUND}%")

before_wac_filter = len(mp_with_wac)
mp_wac_filtered = mp_with_wac[
    (mp_with_wac['wac4_pct_diff'] >= WAC_LOWER_BOUND) & 
    (mp_with_wac['wac4_pct_diff'] <= WAC_UPPER_BOUND)
].copy()

log_filtering_step(mp_with_wac, mp_wac_filtered, "WAC4 Filter")

# Select relevant columns for further processing
mp_clean = mp_wac_filtered[[
    'region', 'product_id', 'price', 'max_per_order', 
    'seller_id', 'min_ticket_size', 'pu_id'
]].copy()

print(f"\nData ready for outlier detection: {len(mp_clean):,} rows")


## Step 3: Scientific Outlier Removal (MAD)

Apply MAD-based outlier detection to ticket size, max order, and price columns.
MAD (Median Absolute Deviation) is robust to up to 50% outliers, making it ideal for marketplace data.


In [None]:
# =============================================================================
# SCIENTIFIC OUTLIER REMOVAL USING MAD
# =============================================================================
# Apply MAD filter to each metric per group (region, product_id, pu_id)
# MAD is more robust than IQR for non-normal distributions

print("=" * 60)
print("STEP 3: Scientific Outlier Removal (MAD)")
print("=" * 60)
print(f"Using MAD threshold: {MAD_THRESHOLD}")

# Define grouping columns
GROUP_COLS = ['region', 'product_id', 'pu_id']

# Start with WAC-filtered data
df = mp_clean.copy()
initial_count = len(df)

# -----------------------------------------------------------------------------
# 3.1 Filter outliers in min_ticket_size
# -----------------------------------------------------------------------------
print("\n3.1 Filtering min_ticket_size outliers...")

df_before = df.copy()
df = df.groupby(GROUP_COLS, group_keys=False).apply(
    lambda g: apply_mad_filter_to_group(g, 'min_ticket_size', MAD_THRESHOLD)
)
log_filtering_step(df_before, df, "MAD filter on min_ticket_size")

# -----------------------------------------------------------------------------
# 3.2 Filter outliers in max_per_order
# -----------------------------------------------------------------------------
print("\n3.2 Filtering max_per_order outliers...")

df_before = df.copy()
df = df.groupby(GROUP_COLS, group_keys=False).apply(
    lambda g: apply_mad_filter_to_group(g, 'max_per_order', MAD_THRESHOLD)
)
log_filtering_step(df_before, df, "MAD filter on max_per_order")

# -----------------------------------------------------------------------------
# 3.3 Filter outliers in price
# -----------------------------------------------------------------------------
print("\n3.3 Filtering price outliers...")

df_before = df.copy()
df = df.groupby(GROUP_COLS, group_keys=False).apply(
    lambda g: apply_mad_filter_to_group(g, 'price', MAD_THRESHOLD)
)
log_filtering_step(df_before, df, "MAD filter on price")

# -----------------------------------------------------------------------------
# Summary
# -----------------------------------------------------------------------------
mp_filtered = df.copy()
final_count = len(mp_filtered)
total_removed = initial_count - final_count
total_pct = (total_removed / initial_count * 100) if initial_count > 0 else 0

print("\n" + "=" * 60)
print("OUTLIER REMOVAL SUMMARY")
print("=" * 60)
print(f"Initial records: {initial_count:,}")
print(f"Final records: {final_count:,}")
print(f"Total removed: {total_removed:,} ({total_pct:.2f}%)")
print(f"Unique SKUs (region-product-pu): {mp_filtered.groupby(GROUP_COLS).ngroups:,}")


## Step 4: Multi-Percentile Price Calculation

Calculate 5 percentile-based prices (P10, P25, P50, P75, P90) for each SKU to get a full price distribution.


In [None]:
# =============================================================================
# CALCULATE PRICE RANGES (MULTIPLE PERCENTILES)
# =============================================================================
# Compute multiple percentile-based prices per SKU

print("=" * 60)
print("STEP 4: Price Range Calculation")
print("=" * 60)
print(f"Calculating percentiles: {PERCENTILES}")

# Calculate price statistics per group with multiple percentiles
price_bounds = (
    mp_filtered
    .groupby(GROUP_COLS)
    .apply(lambda g: get_percentile_prices(
        g, 
        column='price', 
        percentiles=PERCENTILES
    ))
    .reset_index()
)

# Calculate mode price per group (most common price point)
def calculate_mode_price(group):
    """Get the mode (most frequent) price for a group."""
    mode_result = group['price'].mode()
    return mode_result.iloc[0] if len(mode_result) > 0 else np.nan

mode_prices = (
    mp_filtered
    .groupby(GROUP_COLS)
    .apply(calculate_mode_price)
    .reset_index(name='price_mode')
)

# Merge mode prices into price bounds
price_bounds = pd.merge(price_bounds, mode_prices, on=GROUP_COLS, how='left')

# Round all numeric columns
numeric_cols = [f'price_p{p}' for p in PERCENTILES] + ['true_min', 'true_max', 'price_mode']
for col in numeric_cols:
    if col in price_bounds.columns:
        price_bounds[col] = price_bounds[col].round(2)

print(f"\nGenerated price ranges for {len(price_bounds):,} SKUs")
print(f"\nOutput columns: {price_bounds.columns.tolist()}")


## Step 5: Sales-Weighted Average (Optional)

Calculate weighted average prices based on actual sales NMV to understand market-validated pricing.


In [None]:
# =============================================================================
# FETCH SALES DATA FOR WEIGHTED AVERAGES
# =============================================================================
# Get historical and recent sales to calculate NMV-weighted average prices

print("=" * 60)
print("STEP 5: Sales-Weighted Average Calculation")
print("=" * 60)

# -----------------------------------------------------------------------------
# 5.1 Fetch historical sales data
# -----------------------------------------------------------------------------
print(f"\nFetching historical sales (last {SALES_LOOKBACK_DAYS} days)...")

historical_sales_query = f'''
SELECT
    seller_id,
    product_id,
    packing_unit_id AS pu_id,
    item_price,
    SUM(sop.total_price) AS nmv
FROM egypt_marketplace.sales_orders so
JOIN egypt_marketplace.sales_order_products sop ON sop.order_id = so.id
WHERE so.status = 6 
    AND so.created_at::date >= current_date - {SALES_LOOKBACK_DAYS}
GROUP BY ALL
'''

mp_sales_historical = query_snowflake(
    historical_sales_query, 
    columns=['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']
)

for col in ['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']:
    mp_sales_historical[col] = pd.to_numeric(mp_sales_historical[col])

print(f"  Fetched {len(mp_sales_historical):,} historical sales records")

# -----------------------------------------------------------------------------
# 5.2 Fetch recent sales data
# -----------------------------------------------------------------------------
print(f"\nFetching recent sales (last {RECENT_SALES_DAYS} days)...")

recent_sales_query = f'''
SELECT
    seller_id,
    product_id,
    packing_unit_id AS pu_id,
    item_price,
    SUM(sop.total_price) AS nmv
FROM egypt_marketplace.sales_orders so
JOIN egypt_marketplace.sales_order_products sop ON sop.order_id = so.id
WHERE so.status NOT IN (3, 7, 8)
    AND so.created_at::date >= current_date - {RECENT_SALES_DAYS}
GROUP BY ALL
'''

mp_sales_recent = query_snowflake(
    recent_sales_query, 
    columns=['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']
)

for col in ['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']:
    mp_sales_recent[col] = pd.to_numeric(mp_sales_recent[col])

print(f"  Fetched {len(mp_sales_recent):,} recent sales records")


In [None]:
# =============================================================================
# CALCULATE WEIGHTED AVERAGE PRICES
# =============================================================================
# Join sales with filtered MP data and compute NMV-weighted averages

# -----------------------------------------------------------------------------
# 5.3 Calculate historical weighted average
# -----------------------------------------------------------------------------
print("\nCalculating historical weighted average prices...")

# Join filtered data with historical sales
filtered_with_sales = pd.merge(
    mp_filtered,
    mp_sales_historical,
    how='inner',
    on=['seller_id', 'product_id', 'pu_id']
)

# Calculate weighted average: sum(price * nmv) / sum(nmv)
filtered_with_sales['price'] = pd.to_numeric(filtered_with_sales['price'])
filtered_with_sales['nmv'] = pd.to_numeric(filtered_with_sales['nmv'])

weighted_avg_historical = (
    filtered_with_sales
    .groupby(GROUP_COLS)
    .apply(lambda g: (g['price'] * g['nmv']).sum() / g['nmv'].sum())
    .reset_index(name='weighted_avg_price')
)

print(f"  Calculated weighted average for {len(weighted_avg_historical):,} SKUs")

# -----------------------------------------------------------------------------
# 5.4 Calculate recent weighted average
# -----------------------------------------------------------------------------
print("\nCalculating recent weighted average prices...")

# Join filtered data with recent sales
filtered_with_recent = pd.merge(
    mp_filtered,
    mp_sales_recent,
    how='inner',
    on=['seller_id', 'product_id', 'pu_id']
)

filtered_with_recent['price'] = pd.to_numeric(filtered_with_recent['price'])
filtered_with_recent['nmv'] = pd.to_numeric(filtered_with_recent['nmv'])

weighted_avg_recent = (
    filtered_with_recent
    .groupby(GROUP_COLS)
    .apply(lambda g: (g['price'] * g['nmv']).sum() / g['nmv'].sum())
    .reset_index(name='weighted_avg_price_recent')
)

print(f"  Calculated recent weighted average for {len(weighted_avg_recent):,} SKUs")

# -----------------------------------------------------------------------------
# 5.5 Merge weighted averages
# -----------------------------------------------------------------------------
weighted_averages = pd.merge(
    weighted_avg_historical, 
    weighted_avg_recent, 
    on=GROUP_COLS, 
    how='left'
)

# Round values
weighted_averages['weighted_avg_price'] = weighted_averages['weighted_avg_price'].round(2)
weighted_averages['weighted_avg_price_recent'] = weighted_averages['weighted_avg_price_recent'].round(2)

print(f"\nMerged weighted averages: {len(weighted_averages):,} SKUs")


## Step 6: Final Output and Validation

Merge all data and produce the final clean marketplace price dataset with validation.


In [None]:
# =============================================================================
# FINAL OUTPUT: MERGE ALL DATA
# =============================================================================
# Combine price bounds with weighted averages for the final dataset

print("=" * 60)
print("STEP 6: Final Output Generation")
print("=" * 60)

# Merge price bounds with weighted averages
mp_final = pd.merge(
    price_bounds, 
    weighted_averages, 
    on=GROUP_COLS, 
    how='left'  # Keep all price bounds, even without sales data
)

# Remove duplicates if any
mp_final = mp_final.drop_duplicates()

# Final rounding
numeric_cols = mp_final.select_dtypes(include=[np.number]).columns
mp_final[numeric_cols] = mp_final[numeric_cols].round(2)

print(f"\nFinal dataset: {len(mp_final):,} SKUs")
print(f"\nColumn summary:")
for col in mp_final.columns:
    null_count = mp_final[col].isnull().sum()
    null_pct = (null_count / len(mp_final) * 100)
    print(f"  - {col}: {null_count:,} nulls ({null_pct:.1f}%)")


In [None]:
# =============================================================================
# VALIDATION: SAMPLE OUTPUT
# =============================================================================
# Display sample data for verification

print("=" * 60)
print("VALIDATION: Sample Output")
print("=" * 60)

# Show sample for a specific product
print(f"\n1. Sample output for product_id = {SAMPLE_PRODUCT_ID}:")
sample_product = mp_final[mp_final['product_id'] == SAMPLE_PRODUCT_ID]
if len(sample_product) > 0:
    display(sample_product)
else:
    print(f"   No data found for product_id {SAMPLE_PRODUCT_ID}")

# Show summary statistics using min and max percentiles
p_min, p_max = PERCENTILES[0], PERCENTILES[-1]
print("\n2. Overall price range statistics:")
print(f"   - Min P{p_min} price: {mp_final[f'price_p{p_min}'].min():.2f}")
print(f"   - Max P{p_max} price: {mp_final[f'price_p{p_max}'].max():.2f}")
print(f"   - Average spread (P{p_max}-P{p_min}): {(mp_final[f'price_p{p_max}'] - mp_final[f'price_p{p_min}']).mean():.2f}")

# Show region distribution
print("\n3. SKUs per region:")
region_counts = mp_final.groupby('region').size().sort_values(ascending=False)
for region, count in region_counts.items():
    print(f"   - {region}: {count:,}")

# Show data quality metrics
print("\n4. Data quality metrics:")
print(f"   - Total SKUs with price ranges: {len(mp_final):,}")
print(f"   - SKUs with weighted avg price: {mp_final['weighted_avg_price'].notna().sum():,}")
print(f"   - SKUs with recent weighted avg: {mp_final['weighted_avg_price_recent'].notna().sum():,}")
print(f"   - Average seller count per SKU: {mp_final['price_count'].mean():.1f}")


In [None]:
# =============================================================================
# FINAL OUTPUT: mp_final DataFrame
# =============================================================================
# The main output is the mp_final DataFrame containing:
#   - region: Geographic region
#   - product_id: Product identifier
#   - pu_id: Packing unit identifier
#   - price_p10, price_p25, price_p50, price_p75, price_p90: Percentile prices
#   - price_count: Number of sellers
#   - true_min: Actual minimum price
#   - true_max: Actual maximum price
#   - price_mode: Most common price
#   - weighted_avg_price: NMV-weighted average (historical)
#   - weighted_avg_price_recent: NMV-weighted average (recent)

print("=" * 60)
print("FINAL OUTPUT: mp_final")
print("=" * 60)
print(f"\nDataFrame shape: {mp_final.shape}")
print(f"\nFirst 10 rows:")
display(mp_final.head(10))

print("\n" + "=" * 60)
print("NOTEBOOK COMPLETE")
print("=" * 60)
print(f"""
The mp_final DataFrame is ready for use in pricing logic.

Key outputs (5 price points per SKU):
- price_p10: Floor price (10th percentile)
- price_p25: Low price (25th percentile)  
- price_p50: Median price (50th percentile)
- price_p75: High price (75th percentile)
- price_p90: Ceiling price (90th percentile)
- price_mode: Most common price
- weighted_avg_price: Market-validated price based on actual sales

Configuration used:
- MAD Threshold: {MAD_THRESHOLD}
- WAC Range: {WAC_LOWER_BOUND}% to {WAC_UPPER_BOUND}%
- Percentiles: {PERCENTILES}
""")
