# Scientific Marketplace Pricing Analysis

This notebook fetches marketplace prices, normalizes them to **unit prices** (per basic unit), and applies scientific outlier detection using **MAD (Median Absolute Deviation)** to produce clean, valid price ranges per product-region.

## Key Features:
- **Unit price normalization**: All packing units converted to basic unit price (price / BUC)
- **Larger sample pools**: All PUs contribute to each product-region analysis
- **MAD-based outlier detection**: Robust to up to 50% outliers (Iglewicz & Hoaglin, 1993)
- **5 percentile price points**: P10, P25, P50, P75, P90 for full distribution
- **WAC validation**: Filter prices within acceptable cost margins

## Output Format:
- Grouped by **(region, product_id)** only - NO packing unit
- All prices are **unit prices** (per basic unit)


In [1]:
# =============================================================================
# CONFIGURATION - All adjustable parameters in one place
# =============================================================================

# MAD Outlier Detection Settings
# Threshold of 3.5 is the standard recommendation by Iglewicz and Hoaglin (1993)
# Lower values = more aggressive outlier removal, Higher values = more permissive
MAD_THRESHOLD = 3.5

# WAC (Weighted Average Cost) Filter Settings
# Prices outside this percentage range from WAC4 will be excluded
WAC_LOWER_BOUND = -40  # Minimum acceptable % difference from WAC4
WAC_UPPER_BOUND = 40   # Maximum acceptable % difference from WAC4

# Percentile Settings for Price Bounds
# Define multiple percentiles to get a full price distribution (4-5 price points)
PERCENTILES = [10, 25, 50, 75, 90]  # P10, P25, P50 (median), P75, P90

# Minimum Price Points Threshold
# Product-regions with fewer price points than this will be excluded
MIN_PRICE_POINTS = 3  # Minimum for reliable statistics

# Sales Data Settings
SALES_LOOKBACK_DAYS = 100  # Days to look back for historical sales
RECENT_SALES_DAYS = 5      # Days to consider as "recent" sales

# Display Settings
SAMPLE_PRODUCT_ID = 1309  # Product ID to use for sample output verification

print("Configuration loaded successfully!")
print(f"  - MAD Threshold: {MAD_THRESHOLD}")
print(f"  - WAC Range: {WAC_LOWER_BOUND}% to {WAC_UPPER_BOUND}%")
print(f"  - Percentiles: {PERCENTILES}")
print(f"  - Min Price Points: {MIN_PRICE_POINTS}")


Configuration loaded successfully!
  - MAD Threshold: 3.5
  - WAC Range: -40% to 40%
  - Percentiles: [10, 25, 50, 75, 90]
  - Min Price Points: 3


In [2]:
# =============================================================================
# IMPORTS AND ENVIRONMENT SETUP
# =============================================================================

import pandas as pd
import numpy as np
from tqdm import tqdm
import warnings
from IPython.display import display
import os

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Initialize environment (for Snowflake credentials)
import setup_environment_2
import importlib
importlib.reload(setup_environment_2)
setup_environment_2.initialize_env()

print("Environment initialized successfully!")


/home/ec2-user/.Renviron
/home/ec2-user/service_account_key.json
Environment initialized successfully!


In [26]:
# =============================================================================
# HELPER FUNCTIONS
# =============================================================================

def query_snowflake(query, columns=[]):
    """
    Execute a Snowflake query and return results as a DataFrame.
    
    Parameters:
    -----------
    query : str
        SQL query to execute
    columns : list
        Column names for the resulting DataFrame
        
    Returns:
    --------
    pd.DataFrame
        Query results
    """
    import snowflake.connector
    
    con = snowflake.connector.connect(
        user=os.environ["SNOWFLAKE_USERNAME"],
        account=os.environ["SNOWFLAKE_ACCOUNT"],
        password=os.environ["SNOWFLAKE_PASSWORD"],
        database=os.environ["SNOWFLAKE_DATABASE"]
    )
    try:
        cur = con.cursor()
        cur.execute("USE WAREHOUSE COMPUTE_WH")
        cur.execute(query)
        if len(columns) == 0:
            out = pd.DataFrame(np.array(cur.fetchall()))
        else:
            out = pd.DataFrame(np.array(cur.fetchall()), columns=columns)
        return out
    except Exception as e:
        print(f"Error: {e}")
        return pd.DataFrame()
    finally:
        cur.close()
        con.close()


def mad_filter(data, threshold=3.5):
    """
    Apply MAD (Median Absolute Deviation) outlier detection.
    
    MAD is more robust than standard deviation for non-normal distributions.
    It can handle up to 50% outliers, while std breaks down at ~25%.
    
    The modified Z-score formula:
        M_i = 0.6745 * (x_i - median) / MAD
    
    Where 0.6745 is the consistency constant that makes MAD comparable 
    to standard deviation for normally distributed data.
    
    Reference: Iglewicz, B. and Hoaglin, D. (1993), 
               "How to Detect and Handle Outliers"
    
    Parameters:
    -----------
    data : array-like
        Numeric data to filter
    threshold : float
        Modified Z-score threshold (default 3.5 per Iglewicz & Hoaglin)
        
    Returns:
    --------
    np.ndarray
        Boolean mask where True = inlier, False = outlier
    """
    data = np.array(data, dtype=float)
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    
    # Handle edge case where MAD is 0 (all values are the same)
    if mad == 0:
        return np.ones(len(data), dtype=bool)
    
    # Calculate modified Z-scores
    # 0.6745 is the consistency constant for normal distribution
    modified_z_scores = 0.6745 * (data - median) / mad
    
    return np.abs(modified_z_scores) < threshold


def apply_mad_filter_to_group(group, column, threshold=3.5):
    """
    Apply MAD filter to a specific column within a group.
    
    Parameters:
    -----------
    group : pd.DataFrame
        Grouped DataFrame
    column : str
        Column name to filter on
    threshold : float
        MAD threshold
        
    Returns:
    --------
    pd.DataFrame
        Filtered group with outliers removed
    """
    if len(group) < 3:
        # Not enough data points for meaningful outlier detection
        return group
    
    mask = mad_filter(group[column].values, threshold)
    return group[mask]


def get_percentile_prices(group, column='unit_price', percentiles=[10, 25, 50, 75, 90]):
    """
    Calculate multiple percentile-based unit prices for a group.
    
    Parameters:
    -----------
    group : pd.DataFrame
        Grouped DataFrame
    column : str
        Column to calculate percentiles on (default: unit_price)
    percentiles : list
        List of percentiles to calculate (e.g., [10, 25, 50, 75, 90])
        
    Returns:
    --------
    pd.Series
        Statistics including all percentile values and count
    """
    values = group[column].astype(float)
    
    # Build result dictionary with percentile unit prices
    result = {}
    for p in percentiles:
        result[f'unit_price_p{p}'] = values.quantile(p / 100)
    
    # Add additional statistics
    result['price_count'] = len(values)
    result['true_min'] = np.nan
    result['true_max'] = values.max()
    
    return pd.Series(result)


def log_filtering_step(df_before, df_after, step_name):
    """
    Log the results of a filtering step.
    
    Parameters:
    -----------
    df_before : pd.DataFrame
        DataFrame before filtering
    df_after : pd.DataFrame
        DataFrame after filtering
    step_name : str
        Name of the filtering step
    """
    rows_removed = len(df_before) - len(df_after)
    pct_removed = (rows_removed / len(df_before) * 100) if len(df_before) > 0 else 0
    
    print(f"  {step_name}:")
    print(f"    - Before: {len(df_before):,} rows")
    print(f"    - After: {len(df_after):,} rows")
    print(f"    - Removed: {rows_removed:,} rows ({pct_removed:.2f}%)")


print("Helper functions loaded successfully!")


Helper functions loaded successfully!


## Step 1: Data Fetching

Fetch marketplace prices, packing unit mappings, and WAC (cost) data from Snowflake.


In [27]:
# =============================================================================
# FETCH MARKETPLACE PRICES
# =============================================================================
# Get active seller prices with region mapping

print("Fetching marketplace prices...")

mp_query = '''
WITH seller_region AS (
    SELECT 
        seller_retailer.retailer_id,
        CASE 
            WHEN regions.name_en = 'Greater Cairo' THEN cities.name_en 
            ELSE regions.name_en 
        END AS region,
        seller_id,
        seller_retailer.POLYGON_ID
    FROM MATERIALIZED_VIEWS.SELLERS_RETAILERS_MAPPING seller_retailer
    JOIN retailers ON retailers.id = seller_retailer.retailer_id
    JOIN materialized_views.retailer_polygon ON materialized_views.retailer_polygon.retailer_id = retailers.id
    JOIN districts ON districts.id = materialized_views.retailer_polygon.district_id
    JOIN cities ON cities.id = districts.city_id
    JOIN states ON states.id = cities.state_id
    JOIN regions ON regions.id = states.region_id
    JOIN egypt_marketplace.sellers ON sellers.id = seller_retailer.seller_id AND sellers.status = 'ACTIVE'
),

recent_price AS (
    SELECT
        wp.product_id AS product_id,
        wp.packing_unit_id AS product_pu,
        wp.price AS price,
        wp.max_per_order,
        warehouses.seller_id AS seller_id,
        warehouses.MIN_TICKET_SIZE
    FROM egypt_marketplace.warehouse_products wp
    LEFT JOIN egypt_marketplace.warehouses ON warehouses.id = wp.warehouse_id 
    WHERE wp.AVAILABLE > 0 
        AND wp.total_stock > 0
        AND activation = 'true'
)

SELECT DISTINCT
    seller_region.region,
    recent_price.*
FROM recent_price
JOIN seller_region ON seller_region.seller_id = recent_price.seller_id
'''

mp_raw = query_snowflake(
    mp_query, 
    columns=['region', 'product_id', 'product_pu', 'price', 'max_per_order', 'seller_id', 'min_ticket_size']
)

# Convert data types
for col in ['product_id', 'product_pu', 'price', 'max_per_order', 'seller_id', 'min_ticket_size']:
    mp_raw[col] = pd.to_numeric(mp_raw[col])

print(f"  Fetched {len(mp_raw):,} marketplace price records")
print(f"  Unique products: {mp_raw['product_id'].nunique():,}")
print(f"  Unique sellers: {mp_raw['seller_id'].nunique():,}")
print(f"  Regions: {mp_raw['region'].unique().tolist()}")


Fetching marketplace prices...
  Fetched 134,632 marketplace price records
  Unique products: 6,978
  Unique sellers: 131
  Regions: ['Delta West', 'Giza', 'Cairo', 'Upper Egypt', 'Delta East', 'Alexandria']


In [28]:
# =============================================================================
# FETCH PACKING UNIT DATA
# =============================================================================
# Get packing unit to product mapping with basic unit count (BUC)

print("Fetching packing unit data...")

pu_query = '''
SELECT
    product_id,
    PACKING_UNIT_ID AS pu_id,
    BASIC_UNIT_COUNT AS buc
FROM PACKING_UNIT_PRODUCTS
'''

packing_units = query_snowflake(pu_query, columns=['product_id', 'pu_id', 'buc'])

# Convert data types
packing_units['product_id'] = pd.to_numeric(packing_units['product_id'])
packing_units['pu_id'] = pd.to_numeric(packing_units['pu_id'])
packing_units['buc'] = pd.to_numeric(packing_units['buc'])

print(f"  Fetched {len(packing_units):,} packing unit mappings")
print(f"  Unique products: {packing_units['product_id'].nunique():,}")


Fetching packing unit data...
  Fetched 35,071 packing unit mappings
  Unique products: 24,774


In [29]:
# =============================================================================
# FETCH WAC (WEIGHTED AVERAGE COST) DATA
# =============================================================================
# Get current cost data for price validation

print("Fetching WAC data...")

wac_query = '''
SELECT 
    f.product_id,
    f.wac1,
    f.wac4,
    f.wac_p
FROM finance.all_cogs f
WHERE current_timestamp BETWEEN f.from_date AND f.to_date
'''

wac_data = query_snowflake(wac_query, columns=['product_id', 'wac1', 'wac4', 'wac_p'])

# Convert data types
wac_data['product_id'] = pd.to_numeric(wac_data['product_id'])
wac_data['wac1'] = pd.to_numeric(wac_data['wac1'])
wac_data['wac4'] = pd.to_numeric(wac_data['wac4'])
wac_data['wac_p'] = pd.to_numeric(wac_data['wac_p'])

print(f"  Fetched WAC data for {len(wac_data):,} products")

# Create packing unit WAC by joining with BUC
pu_wac = pd.merge(packing_units, wac_data, on='product_id', how='left')
pu_wac['pu_wac1'] = pu_wac['buc'] * pu_wac['wac1']
pu_wac['pu_wac4'] = pu_wac['buc'] * pu_wac['wac4']

print(f"  Created PU-level WAC for {len(pu_wac):,} product-PU combinations")


Fetching WAC data...
  Fetched WAC data for 8,157 products
  Created PU-level WAC for 35,071 product-PU combinations


## Step 2: WAC Mapping and Initial Filtering

Map prices to packing units and filter based on WAC4 percentage bounds.


In [30]:
# =============================================================================
# WAC MAPPING AND INITIAL FILTER
# =============================================================================
# Join marketplace prices with WAC data and filter by acceptable cost margins

print("=" * 60)
print("STEP 2: WAC Mapping and Initial Filtering")
print("=" * 60)

# Rename column for clarity
mp_data = mp_raw.copy()
mp_data.rename(columns={'product_pu': 'mp_pu_id'}, inplace=True)

# Join with packing unit WAC data
mp_with_wac = pd.merge(
    mp_data,
    pu_wac,
    how='inner',
    on='product_id'
)

print(f"\nAfter joining with WAC data: {len(mp_with_wac):,} rows")

# Calculate percentage difference from WAC4
# Formula: (price - pu_wac4) / pu_wac4 * 100
mp_with_wac['wac4_pct_diff'] = (
    (mp_with_wac['price'].astype(float) - mp_with_wac['pu_wac4'].astype(float)) 
    / mp_with_wac['pu_wac4'].astype(float) * 100
).round(2)

# Apply WAC filter using configurable bounds
print(f"\nApplying WAC4 filter: {WAC_LOWER_BOUND}% to {WAC_UPPER_BOUND}%")

before_wac_filter = len(mp_with_wac)
mp_wac_filtered = mp_with_wac[
    (mp_with_wac['wac4_pct_diff'] >= WAC_LOWER_BOUND) & 
    (mp_with_wac['wac4_pct_diff'] <= WAC_UPPER_BOUND)
].copy()

log_filtering_step(mp_with_wac, mp_wac_filtered, "WAC4 Filter")

# =============================================================================
# NORMALIZE PRICES TO BASIC UNIT
# =============================================================================
# Convert all prices to unit price: unit_price = price / basic_unit_count
# This pools all packing units into a single analysis per (region, product_id)

print("\n" + "=" * 60)
print("STEP 2b: Price Normalization to Basic Unit")
print("=" * 60)

mp_wac_filtered['unit_price'] = (
    mp_wac_filtered['price'].astype(float) / mp_wac_filtered['buc'].astype(float)
).round(4)

print(f"Converted {len(mp_wac_filtered):,} prices to unit prices")
print(f"  - Original price range: {mp_wac_filtered['price'].min():.2f} - {mp_wac_filtered['price'].max():.2f}")
print(f"  - Unit price range: {mp_wac_filtered['unit_price'].min():.4f} - {mp_wac_filtered['unit_price'].max():.4f}")

# Select relevant columns for further processing (NO pu_id - grouped by product only)
mp_clean = mp_wac_filtered[[
    'region', 'product_id', 'unit_price', 'max_per_order', 
    'seller_id', 'min_ticket_size'
]].copy()

print(f"\nData ready for outlier detection: {len(mp_clean):,} rows")
print(f"Unique products per region: {mp_clean.groupby('region')['product_id'].nunique().sum():,}")


STEP 2: WAC Mapping and Initial Filtering

After joining with WAC data: 182,624 rows

Applying WAC4 filter: -40% to 40%
  WAC4 Filter:
    - Before: 182,624 rows
    - After: 71,111 rows
    - Removed: 111,513 rows (61.06%)

STEP 2b: Price Normalization to Basic Unit
Converted 71,111 prices to unit prices
  - Original price range: 13.00 - 9600.00
  - Unit price range: 3.1998 - 1925.0000

Data ready for outlier detection: 71,111 rows
Unique products per region: 11,625


## Step 3: Scientific Outlier Removal (MAD)

Apply MAD-based outlier detection to ticket size, max order, and price columns.
MAD (Median Absolute Deviation) is robust to up to 50% outliers, making it ideal for marketplace data.


In [31]:
# =============================================================================
# SCIENTIFIC OUTLIER REMOVAL USING MAD
# =============================================================================
# Apply MAD filter to each metric per group (region, product_id)
# MAD is more robust than IQR for non-normal distributions

print("=" * 60)
print("STEP 3: Scientific Outlier Removal (MAD)")
print("=" * 60)
print(f"Using MAD threshold: {MAD_THRESHOLD}")

# Define grouping columns - NO pu_id since we normalized to unit prices
GROUP_COLS = ['region', 'product_id']

# Start with WAC-filtered data
df = mp_clean.copy()
initial_count = len(df)

# -----------------------------------------------------------------------------
# 3.1 Filter outliers in min_ticket_size
# -----------------------------------------------------------------------------
print("\n3.1 Filtering min_ticket_size outliers...")

df_before = df.copy()
df = df.groupby(GROUP_COLS, group_keys=False).apply(
    lambda g: apply_mad_filter_to_group(g, 'min_ticket_size', MAD_THRESHOLD)
)
log_filtering_step(df_before, df, "MAD filter on min_ticket_size")

# -----------------------------------------------------------------------------
# 3.2 Filter outliers in max_per_order
# -----------------------------------------------------------------------------
print("\n3.2 Filtering max_per_order outliers...")

df_before = df.copy()
df = df.groupby(GROUP_COLS, group_keys=False).apply(
    lambda g: apply_mad_filter_to_group(g, 'max_per_order', MAD_THRESHOLD)
)
log_filtering_step(df_before, df, "MAD filter on max_per_order")

# -----------------------------------------------------------------------------
# 3.3 Filter outliers in unit_price
# -----------------------------------------------------------------------------
print("\n3.3 Filtering unit_price outliers...")

df_before = df.copy()
df = df.groupby(GROUP_COLS, group_keys=False).apply(
    lambda g: apply_mad_filter_to_group(g, 'unit_price', MAD_THRESHOLD)
)
log_filtering_step(df_before, df, "MAD filter on unit_price")

# -----------------------------------------------------------------------------
# Summary
# -----------------------------------------------------------------------------
mp_filtered = df.copy()
final_count = len(mp_filtered)
total_removed = initial_count - final_count
total_pct = (total_removed / initial_count * 100) if initial_count > 0 else 0

print("\n" + "=" * 60)
print("OUTLIER REMOVAL SUMMARY")
print("=" * 60)
print(f"Initial records: {initial_count:,}")
print(f"Final records: {final_count:,}")
print(f"Total removed: {total_removed:,} ({total_pct:.2f}%)")
print(f"Unique products (region-product): {mp_filtered.groupby(GROUP_COLS).ngroups:,}")


STEP 3: Scientific Outlier Removal (MAD)
Using MAD threshold: 3.5

3.1 Filtering min_ticket_size outliers...
  MAD filter on min_ticket_size:
    - Before: 71,111 rows
    - After: 67,065 rows
    - Removed: 4,046 rows (5.69%)

3.2 Filtering max_per_order outliers...
  MAD filter on max_per_order:
    - Before: 67,065 rows
    - After: 63,945 rows
    - Removed: 3,120 rows (4.65%)

3.3 Filtering unit_price outliers...
  MAD filter on unit_price:
    - Before: 63,945 rows
    - After: 61,782 rows
    - Removed: 2,163 rows (3.38%)

OUTLIER REMOVAL SUMMARY
Initial records: 71,111
Final records: 61,782
Total removed: 9,329 (13.12%)
Unique products (region-product): 11,625


## Step 4: Multi-Percentile Unit Price Calculation

Calculate 5 percentile-based **unit prices** (P10, P25, P50, P75, P90) for each product-region. All packing units are pooled together for a larger sample size.


In [32]:
# =============================================================================
# CALCULATE UNIT PRICE RANGES (MULTIPLE PERCENTILES)
# =============================================================================
# Compute multiple percentile-based unit prices per product-region

print("=" * 60)
print("STEP 4: Unit Price Range Calculation")
print("=" * 60)
print(f"Calculating percentiles: {PERCENTILES}")
print(f"Grouping by: {GROUP_COLS} (no packing unit)")

# Calculate unit price statistics per group with multiple percentiles
price_bounds = (
    mp_filtered
    .groupby(GROUP_COLS)
    .apply(lambda g: get_percentile_prices(
        g, 
        column='unit_price', 
        percentiles=PERCENTILES
    ))
    .reset_index()
)

# Calculate mode unit price per group (most common price point)
def calculate_mode_price(group):
    """Get the mode (most frequent) unit price for a group."""
    mode_result = group['unit_price'].mode()
    return mode_result.iloc[0] if len(mode_result) > 0 else np.nan

mode_prices = (
    mp_filtered
    .groupby(GROUP_COLS)
    .apply(calculate_mode_price)
    .reset_index(name='unit_price_mode')
)

# Merge mode prices into price bounds
price_bounds = pd.merge(price_bounds, mode_prices, on=GROUP_COLS, how='left')

# Round all numeric columns
numeric_cols = [f'unit_price_p{p}' for p in PERCENTILES] + ['true_min', 'true_max', 'unit_price_mode']
for col in numeric_cols:
    if col in price_bounds.columns:
        price_bounds[col] = price_bounds[col].round(4)

print(f"\nGenerated unit price ranges for {len(price_bounds):,} product-regions")
print(f"\nOutput columns: {price_bounds.columns.tolist()}")


STEP 4: Unit Price Range Calculation
Calculating percentiles: [10, 25, 50, 75, 90]
Grouping by: ['region', 'product_id'] (no packing unit)

Generated unit price ranges for 11,625 product-regions

Output columns: ['region', 'product_id', 'unit_price_p10', 'unit_price_p25', 'unit_price_p50', 'unit_price_p75', 'unit_price_p90', 'price_count', 'true_min', 'true_max', 'unit_price_mode']


## Step 5: Sales-Weighted Average (Optional)

Calculate weighted average prices based on actual sales NMV to understand market-validated pricing.


In [33]:
# =============================================================================
# FETCH SALES DATA FOR WEIGHTED AVERAGES
# =============================================================================
# Get historical and recent sales to calculate NMV-weighted average prices

print("=" * 60)
print("STEP 5: Sales-Weighted Average Calculation")
print("=" * 60)

# -----------------------------------------------------------------------------
# 5.1 Fetch historical sales data
# -----------------------------------------------------------------------------
print(f"\nFetching historical sales (last {SALES_LOOKBACK_DAYS} days)...")

historical_sales_query = f'''
SELECT
    seller_id,
    product_id,
    packing_unit_id AS pu_id,
    item_price,
    SUM(sop.total_price) AS nmv
FROM egypt_marketplace.sales_orders so
JOIN egypt_marketplace.sales_order_products sop ON sop.order_id = so.id
WHERE so.status = 6 
    AND so.created_at::date >= current_date - {SALES_LOOKBACK_DAYS}
GROUP BY ALL
'''

mp_sales_historical = query_snowflake(
    historical_sales_query, 
    columns=['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']
)

for col in ['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']:
    mp_sales_historical[col] = pd.to_numeric(mp_sales_historical[col])

print(f"  Fetched {len(mp_sales_historical):,} historical sales records")

# -----------------------------------------------------------------------------
# 5.2 Fetch recent sales data
# -----------------------------------------------------------------------------
print(f"\nFetching recent sales (last {RECENT_SALES_DAYS} days)...")

recent_sales_query = f'''
SELECT
    seller_id,
    product_id,
    packing_unit_id AS pu_id,
    item_price,
    SUM(sop.total_price) AS nmv
FROM egypt_marketplace.sales_orders so
JOIN egypt_marketplace.sales_order_products sop ON sop.order_id = so.id
WHERE so.status NOT IN (3, 7, 8)
    AND so.created_at::date >= current_date - {RECENT_SALES_DAYS}
GROUP BY ALL
'''

mp_sales_recent = query_snowflake(
    recent_sales_query, 
    columns=['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']
)

for col in ['seller_id', 'product_id', 'pu_id', 'item_price', 'nmv']:
    mp_sales_recent[col] = pd.to_numeric(mp_sales_recent[col])

print(f"  Fetched {len(mp_sales_recent):,} recent sales records")


STEP 5: Sales-Weighted Average Calculation

Fetching historical sales (last 100 days)...
  Fetched 51,598 historical sales records

Fetching recent sales (last 5 days)...
  Fetched 6,112 recent sales records


In [34]:
# =============================================================================
# CALCULATE WEIGHTED AVERAGE UNIT PRICES
# =============================================================================
# Join sales with filtered MP data and compute NMV-weighted averages

# -----------------------------------------------------------------------------
# 5.3 Calculate historical weighted average (using unit_price)
# -----------------------------------------------------------------------------
print("\nCalculating historical weighted average unit prices...")

# Join filtered data with historical sales (on seller_id and product_id only)
filtered_with_sales = pd.merge(
    mp_filtered,
    mp_sales_historical,
    how='inner',
    on=['seller_id', 'product_id']
)

# Calculate weighted average: sum(unit_price * nmv) / sum(nmv)
filtered_with_sales['unit_price'] = pd.to_numeric(filtered_with_sales['unit_price'])
filtered_with_sales['nmv'] = pd.to_numeric(filtered_with_sales['nmv'])

weighted_avg_historical = (
    filtered_with_sales
    .groupby(GROUP_COLS)
    .apply(lambda g: (g['unit_price'] * g['nmv']).sum() / g['nmv'].sum())
    .reset_index(name='weighted_avg_unit_price')
)

print(f"  Calculated weighted average for {len(weighted_avg_historical):,} product-regions")

# -----------------------------------------------------------------------------
# 5.4 Calculate recent weighted average (using unit_price)
# -----------------------------------------------------------------------------
print("\nCalculating recent weighted average unit prices...")

# Join filtered data with recent sales
filtered_with_recent = pd.merge(
    mp_filtered,
    mp_sales_recent,
    how='inner',
    on=['seller_id', 'product_id']
)

filtered_with_recent['unit_price'] = pd.to_numeric(filtered_with_recent['unit_price'])
filtered_with_recent['nmv'] = pd.to_numeric(filtered_with_recent['nmv'])

weighted_avg_recent = (
    filtered_with_recent
    .groupby(GROUP_COLS)
    .apply(lambda g: (g['unit_price'] * g['nmv']).sum() / g['nmv'].sum())
    .reset_index(name='weighted_avg_unit_price_recent')
)

print(f"  Calculated recent weighted average for {len(weighted_avg_recent):,} product-regions")

# -----------------------------------------------------------------------------
# 5.5 Merge weighted averages
# -----------------------------------------------------------------------------
weighted_averages = pd.merge(
    weighted_avg_historical, 
    weighted_avg_recent, 
    on=GROUP_COLS, 
    how='left'
)

# Round values
weighted_averages['weighted_avg_unit_price'] = weighted_averages['weighted_avg_unit_price'].round(4)
weighted_averages['weighted_avg_unit_price_recent'] = weighted_averages['weighted_avg_unit_price_recent'].round(4)

print(f"\nMerged weighted averages: {len(weighted_averages):,} product-regions")



Calculating historical weighted average unit prices...
  Calculated weighted average for 9,453 product-regions

Calculating recent weighted average unit prices...
  Calculated recent weighted average for 4,784 product-regions

Merged weighted averages: 9,453 product-regions


## Step 6: Final Output and Validation

Merge all data and produce the final clean marketplace price dataset with validation.


In [35]:
# =============================================================================
# FINAL OUTPUT: MERGE ALL DATA
# =============================================================================
# Combine price bounds with weighted averages for the final dataset

print("=" * 60)
print("STEP 6: Final Output Generation")
print("=" * 60)

# Merge price bounds with weighted averages
mp_final = pd.merge(
    price_bounds, 
    weighted_averages, 
    on=GROUP_COLS, 
    how='left'  # Keep all price bounds, even without sales data
)

# Remove duplicates if any
mp_final = mp_final.drop_duplicates()

# =============================================================================
# FILTER BY MINIMUM PRICE POINTS
# =============================================================================
# Remove product-regions with too few price points for reliable statistics

before_min_filter = len(mp_final)
mp_final = mp_final[mp_final['price_count'] >= MIN_PRICE_POINTS].copy()
after_min_filter = len(mp_final)
removed_count = before_min_filter - after_min_filter

print(f"\nMinimum price points filter (>= {MIN_PRICE_POINTS}):")
print(f"  - Before: {before_min_filter:,} product-regions")
print(f"  - After: {after_min_filter:,} product-regions")
print(f"  - Removed: {removed_count:,} ({(removed_count/before_min_filter*100):.1f}%)")

# Final rounding for unit prices (4 decimal places)
numeric_cols = mp_final.select_dtypes(include=[np.number]).columns
mp_final[numeric_cols] = mp_final[numeric_cols].round(4)

print(f"\nFinal dataset: {len(mp_final):,} product-regions (NO packing unit)")
print(f"\nColumn summary:")
for col in mp_final.columns:
    null_count = mp_final[col].isnull().sum()
    null_pct = (null_count / len(mp_final) * 100)
    print(f"  - {col}: {null_count:,} nulls ({null_pct:.1f}%)")


STEP 6: Final Output Generation

Minimum price points filter (>= 3):
  - Before: 11,625 product-regions
  - After: 5,852 product-regions
  - Removed: 5,773 (49.7%)

Final dataset: 5,852 product-regions (NO packing unit)

Column summary:
  - region: 0 nulls (0.0%)
  - product_id: 0 nulls (0.0%)
  - unit_price_p10: 0 nulls (0.0%)
  - unit_price_p25: 0 nulls (0.0%)
  - unit_price_p50: 0 nulls (0.0%)
  - unit_price_p75: 0 nulls (0.0%)
  - unit_price_p90: 0 nulls (0.0%)
  - price_count: 0 nulls (0.0%)
  - true_min: 5,852 nulls (100.0%)
  - true_max: 0 nulls (0.0%)
  - unit_price_mode: 0 nulls (0.0%)
  - weighted_avg_unit_price: 166 nulls (2.8%)
  - weighted_avg_unit_price_recent: 2,261 nulls (38.6%)


In [36]:
# =============================================================================
# VALIDATION: SAMPLE OUTPUT
# =============================================================================
# Display sample data for verification

print("=" * 60)
print("VALIDATION: Sample Output")
print("=" * 60)

# Show sample for a specific product
print(f"\n1. Sample output for product_id = {SAMPLE_PRODUCT_ID}:")
sample_product = mp_final[mp_final['product_id'] == SAMPLE_PRODUCT_ID]
if len(sample_product) > 0:
    display(sample_product)
else:
    print(f"   No data found for product_id {SAMPLE_PRODUCT_ID}")

# Show summary statistics using min and max percentiles
p_min, p_max = PERCENTILES[0], PERCENTILES[-1]
print("\n2. Overall unit price range statistics:")
print(f"   - Min P{p_min} unit price: {mp_final[f'unit_price_p{p_min}'].min():.4f}")
print(f"   - Max P{p_max} unit price: {mp_final[f'unit_price_p{p_max}'].max():.4f}")
print(f"   - Average spread (P{p_max}-P{p_min}): {(mp_final[f'unit_price_p{p_max}'] - mp_final[f'unit_price_p{p_min}']).mean():.4f}")

# Show region distribution
print("\n3. Products per region:")
region_counts = mp_final.groupby('region').size().sort_values(ascending=False)
for region, count in region_counts.items():
    print(f"   - {region}: {count:,}")

# Show data quality metrics
print("\n4. Data quality metrics:")
print(f"   - Total product-regions with unit price ranges: {len(mp_final):,}")
print(f"   - With weighted avg unit price: {mp_final['weighted_avg_unit_price'].notna().sum():,}")
print(f"   - With recent weighted avg: {mp_final['weighted_avg_unit_price_recent'].notna().sum():,}")
print(f"   - Average price points per product: {mp_final['price_count'].mean():.1f}")


VALIDATION: Sample Output

1. Sample output for product_id = 1309:


Unnamed: 0,region,product_id,unit_price_p10,unit_price_p25,unit_price_p50,unit_price_p75,unit_price_p90,price_count,true_min,true_max,unit_price_mode,weighted_avg_unit_price,weighted_avg_unit_price_recent
310,Alexandria,1309,46.175,46.5625,46.875,48.5625,50.0,12.0,,50.0,46.6667,47.0969,46.1667
2322,Cairo,1309,45.9333,46.6667,47.0833,48.0625,50.0,14.0,,50.0,46.6667,47.2378,46.1667
4362,Delta East,1309,46.3167,46.6667,50.0,50.0,50.0,5.0,,50.0,50.0,49.109,
5975,Delta West,1309,46.0083,46.5208,48.0833,50.0,50.0,8.0,,50.0,50.0,48.5621,
7820,Giza,1309,45.9,46.25,46.6667,48.0833,50.0,13.0,,50.0,46.6667,47.184,46.1667
9886,Upper Egypt,1309,45.9333,46.6667,47.0833,48.0625,50.0,14.0,,50.0,46.6667,47.2378,46.1667



2. Overall unit price range statistics:
   - Min P10 unit price: 3.1998
   - Max P90 unit price: 1289.0000
   - Average spread (P90-P10): 10.6595

3. Products per region:
   - Cairo: 1,161
   - Giza: 1,113
   - Upper Egypt: 1,107
   - Alexandria: 1,023
   - Delta West: 830
   - Delta East: 618

4. Data quality metrics:
   - Total product-regions with unit price ranges: 5,852
   - With weighted avg unit price: 5,686
   - With recent weighted avg: 3,591
   - Average price points per product: 9.2


In [37]:
# =============================================================================
# FINAL OUTPUT: mp_final DataFrame
# =============================================================================
# The main output is the mp_final DataFrame containing:
#   - region: Geographic region
#   - product_id: Product identifier
#   - unit_price_p10, p25, p50, p75, p90: Percentile unit prices (per basic unit)
#   - price_count: Number of price points (from all packing units)
#   - true_min: Actual minimum unit price
#   - true_max: Actual maximum unit price
#   - unit_price_mode: Most common unit price
#   - weighted_avg_unit_price: NMV-weighted average (historical)
#   - weighted_avg_unit_price_recent: NMV-weighted average (recent)

print("=" * 60)
print("FINAL OUTPUT: mp_final")
print("=" * 60)
print(f"\nDataFrame shape: {mp_final.shape}")
print(f"\nFirst 10 rows:")
display(mp_final.head(10))

print("\n" + "=" * 60)
print("NOTEBOOK COMPLETE")
print("=" * 60)
print(f"""
The mp_final DataFrame is ready for use in pricing logic.

OUTPUT: Unit prices per (region, product_id) - NO packing unit

Key outputs (5 unit price points per product):
- unit_price_p10: Floor unit price (10th percentile)
- unit_price_p25: Low unit price (25th percentile)  
- unit_price_p50: Median unit price (50th percentile)
- unit_price_p75: High unit price (75th percentile)
- unit_price_p90: Ceiling unit price (90th percentile)
- unit_price_mode: Most common unit price
- weighted_avg_unit_price: Market-validated based on actual sales

Note: All prices are per BASIC UNIT (price / basic_unit_count)
      This pools all packing units into one analysis per product.

Configuration used:
- MAD Threshold: {MAD_THRESHOLD}
- WAC Range: {WAC_LOWER_BOUND}% to {WAC_UPPER_BOUND}%
- Percentiles: {PERCENTILES}
- Min Price Points: {MIN_PRICE_POINTS}
""")


FINAL OUTPUT: mp_final

DataFrame shape: (5852, 13)

First 10 rows:


Unnamed: 0,region,product_id,unit_price_p10,unit_price_p25,unit_price_p50,unit_price_p75,unit_price_p90,price_count,true_min,true_max,unit_price_mode,weighted_avg_unit_price,weighted_avg_unit_price_recent
0,Alexandria,3,253.0,255.0,255.0,263.0,265.0,7.0,,265.0,255.0,255.1701,252.4904
1,Alexandria,9,837.2,840.5,846.0,848.0,849.2,3.0,,850.0,835.0,838.4034,835.0
2,Alexandria,10,270.0,270.0,270.0,272.5,274.0,3.0,,275.0,270.0,270.0846,
5,Alexandria,17,598.5,603.75,613.5,620.0,625.7,8.0,,639.0,620.0,599.4625,
6,Alexandria,18,597.0,601.0,605.0,613.5,620.0,15.0,,620.0,605.0,604.7631,595.0
7,Alexandria,23,258.0,260.0,260.0,265.0,266.8,22.0,,270.0,260.0,259.5825,259.2253
8,Alexandria,24,257.7,260.0,260.0,265.0,267.0,25.0,,270.0,260.0,258.3969,259.1697
9,Alexandria,25,257.8,260.0,260.0,264.0,265.0,27.0,,270.0,260.0,258.3889,258.0507
10,Alexandria,26,258.0,260.0,262.0,265.0,265.8,27.0,,270.0,260.0,258.9631,257.7495
11,Alexandria,27,258.0,260.0,262.0,265.0,266.2,25.0,,270.0,260.0,259.2696,259.6293



NOTEBOOK COMPLETE

The mp_final DataFrame is ready for use in pricing logic.

OUTPUT: Unit prices per (region, product_id) - NO packing unit

Key outputs (5 unit price points per product):
- unit_price_p10: Floor unit price (10th percentile)
- unit_price_p25: Low unit price (25th percentile)  
- unit_price_p50: Median unit price (50th percentile)
- unit_price_p75: High unit price (75th percentile)
- unit_price_p90: Ceiling unit price (90th percentile)
- unit_price_mode: Most common unit price
- weighted_avg_unit_price: Market-validated based on actual sales

Note: All prices are per BASIC UNIT (price / basic_unit_count)
      This pools all packing units into one analysis per product.

Configuration used:
- MAD Threshold: 3.5
- WAC Range: -40% to 40%
- Percentiles: [10, 25, 50, 75, 90]
- Min Price Points: 3



In [42]:
mp_final[mp_final['product_id']==615]

Unnamed: 0,region,product_id,unit_price_p10,unit_price_p25,unit_price_p50,unit_price_p75,unit_price_p90,price_count,true_min,true_max,unit_price_mode,weighted_avg_unit_price,weighted_avg_unit_price_recent
2209,Cairo,615,279.2,281.0,284.0,297.0,304.8,3.0,,310.0,278.0,284.0,
