# Segment-Level Validation with 2% Threshold

This notebook performs **segment-level validation** with a **2% tolerance threshold**.

**Key Feature:** Differences under 2% are considered as **MATCHED** ✓

**Validation Segments:**
1. Overall Totals
2. By Date
3. By Campaign
4. By Gender
5. By Age Group
6. By Campaign + Date

## Configuration: Set Threshold

In [1]:
# CONFIGURATION: Set your threshold here
THRESHOLD_PERCENT = 3.0  # Accept differences up to 2%

print("="*80)
print("VALIDATION CONFIGURATION")
print("="*80)
print(f"\nThreshold: {THRESHOLD_PERCENT}%")
print(f"Differences under {THRESHOLD_PERCENT}% will be marked as MATCHED")
print("\nYou can change THRESHOLD_PERCENT above to adjust tolerance")

VALIDATION CONFIGURATION

Threshold: 3.0%
Differences under 3.0% will be marked as MATCHED

You can change THRESHOLD_PERCENT above to adjust tolerance


## Step 1: Import Libraries

In [2]:
# Install openpyxl if needed
import sys
!{sys.executable} -m pip install openpyxl -q

import pandas as pd
import numpy as np
from datetime import datetime

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

print("✓ Libraries imported successfully")
print(f"Analysis started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✓ Libraries imported successfully
Analysis started: 2025-12-20 23:13:32


## Step 2: Define Matching Function with Threshold

In [3]:
def check_match_with_threshold(csv_val, fabric_val, threshold_pct=2.0):
    """
    Check if two values match within a percentage threshold.
    
    Args:
        csv_val: Value from CSV
        fabric_val: Value from Fabric
        threshold_pct: Acceptable difference percentage (default 2%)
    
    Returns:
        Boolean: True if difference is within threshold
    """
    # Handle NaN values
    if pd.isna(csv_val) or pd.isna(fabric_val):
        return pd.isna(csv_val) and pd.isna(fabric_val)
    
    # Handle zero values
    if fabric_val == 0:
        return csv_val == 0
    
    # Calculate percentage difference
    pct_diff = abs((csv_val - fabric_val) / fabric_val * 100)
    
    return pct_diff <= threshold_pct

print("✓ Matching function defined")
print(f"  Threshold: {THRESHOLD_PERCENT}%")

✓ Matching function defined
  Threshold: 3.0%


## Step 3: Load and Prepare Data

In [4]:
# Load CSV (skip 2 header rows)
print("Loading CSV...")
csv_df = pd.read_csv("merged_age_gender(growth).csv", skiprows=2)

# Clean and map columns
csv_df['Impr.'] = csv_df['Impr.'].str.replace(',', '').astype(int)
csv_df['Cost'] = pd.to_numeric(csv_df['Cost'], errors='coerce')
csv_df['Clicks'] = pd.to_numeric(csv_df['Clicks'], errors='coerce')

csv_df = csv_df.rename(columns={
    'Campaign': 'campaign_name',
    'Day': 'day',
    'Gender': 'gender',
    'Age': 'age',
    'Cost': 'cost',
    'Impr.': 'impressions',
    'Clicks': 'clicks'
})

print(f"✓ CSV loaded: {len(csv_df):,} rows")

# Load Fabric Excel
print("\nLoading Fabric export...")
fabric_df = pd.read_excel("merged_age_gender(gold)2.xlsx")
fabric_df['day'] = pd.to_datetime(fabric_df['day']).dt.strftime('%Y-%m-%d')

print(f"✓ Fabric loaded: {len(fabric_df):,} rows")

print("\n" + "="*80)
print("DATA SUMMARY")
print("="*80)
print(f"\nCSV Date Range: {csv_df['day'].min()} to {csv_df['day'].max()}")
print(f"Fabric Date Range: {fabric_df['day'].min()} to {fabric_df['day'].max()}")

Loading CSV...


FileNotFoundError: [Errno 2] No such file or directory: 'merged_age_gender(growth).csv'

## Step 4: Overall Totals Comparison (with 2% threshold)

In [None]:
print("="*80)
print("OVERALL TOTALS COMPARISON (with 2% threshold)")
print("="*80)

# Calculate totals
csv_totals = csv_df[['cost', 'impressions', 'clicks']].sum()
fabric_totals = fabric_df[['cost', 'impressions', 'clicks']].sum()

# Create comparison dataframe
overall_comparison = pd.DataFrame({
    'Metric': ['Cost (₹)', 'Impressions', 'Clicks'],
    'CSV': [csv_totals['cost'], csv_totals['impressions'], csv_totals['clicks']],
    'Fabric': [fabric_totals['cost'], fabric_totals['impressions'], fabric_totals['clicks']],
})

overall_comparison['Difference'] = overall_comparison['CSV'] - overall_comparison['Fabric']
overall_comparison['Diff %'] = (overall_comparison['Difference'] / overall_comparison['Fabric'] * 100).round(2)

# Apply threshold matching
overall_comparison['Match'] = overall_comparison['Diff %'].abs() <= THRESHOLD_PERCENT
overall_comparison['Status'] = overall_comparison['Match'].apply(lambda x: '✓ PASS' if x else '✗ FAIL')

display(overall_comparison)

# Summary
matches = overall_comparison['Match'].sum()
print(f"\n✓ Matches (within {THRESHOLD_PERCENT}%): {matches}/3 metrics")
if matches == 3:
    print(f"✓✓✓ ALL OVERALL TOTALS MATCH (within {THRESHOLD_PERCENT}% threshold)! ✓✓✓")
else:
    print(f"⚠ {3-matches} metric(s) exceed {THRESHOLD_PERCENT}% threshold")

OVERALL TOTALS COMPARISON (with 2% threshold)


Unnamed: 0,Metric,CSV,Fabric,Difference,Diff %,Match,Status
0,Cost (₹),260312.7,260320.48,-7.78,-0.0,True,✓ PASS
1,Impressions,1338526.0,1335418.0,3108.0,0.23,True,✓ PASS
2,Clicks,52059.0,50129.0,1930.0,3.85,False,✗ FAIL



✓ Matches (within 3.0%): 2/3 metrics
⚠ 1 metric(s) exceed 3.0% threshold


## Step 5: Validation by Date (with 2% threshold)

In [None]:
print("="*80)
print(f"SEGMENT VALIDATION: BY DATE (with {THRESHOLD_PERCENT}% threshold)")
print("="*80)

# Aggregate by date
csv_by_date = csv_df.groupby('day').agg({
    'cost': 'sum',
    'impressions': 'sum',
    'clicks': 'sum'
}).reset_index()
csv_by_date.columns = ['day', 'cost_csv', 'impressions_csv', 'clicks_csv']

fabric_by_date = fabric_df.groupby('day').agg({
    'cost': 'sum',
    'impressions': 'sum',
    'clicks': 'sum'
}).reset_index()
fabric_by_date.columns = ['day', 'cost_fabric', 'impressions_fabric', 'clicks_fabric']

# Merge and compare
date_comparison = pd.merge(csv_by_date, fabric_by_date, on='day', how='outer', indicator=True)

# Calculate differences and percentages
date_comparison['cost_diff'] = date_comparison['cost_csv'] - date_comparison['cost_fabric']
date_comparison['cost_diff_pct'] = (date_comparison['cost_diff'] / date_comparison['cost_fabric'] * 100).round(2)

date_comparison['impr_diff'] = date_comparison['impressions_csv'] - date_comparison['impressions_fabric']
date_comparison['impr_diff_pct'] = (date_comparison['impr_diff'] / date_comparison['impressions_fabric'] * 100).round(2)

date_comparison['clicks_diff'] = date_comparison['clicks_csv'] - date_comparison['clicks_fabric']
date_comparison['clicks_diff_pct'] = (date_comparison['clicks_diff'] / date_comparison['clicks_fabric'] * 100).round(2)

# Apply threshold matching
date_comparison['cost_match'] = date_comparison.apply(
    lambda row: check_match_with_threshold(row['cost_csv'], row['cost_fabric'], THRESHOLD_PERCENT), axis=1
)
date_comparison['impr_match'] = date_comparison.apply(
    lambda row: check_match_with_threshold(row['impressions_csv'], row['impressions_fabric'], THRESHOLD_PERCENT), axis=1
)
date_comparison['clicks_match'] = date_comparison.apply(
    lambda row: check_match_with_threshold(row['clicks_csv'], row['clicks_fabric'], THRESHOLD_PERCENT), axis=1
)

date_comparison['perfect_match'] = date_comparison['cost_match'] & date_comparison['impr_match'] & date_comparison['clicks_match']
date_comparison['status'] = date_comparison['perfect_match'].apply(lambda x: '✓ PASS' if x else '✗ FAIL')

# Display results
display_cols = ['day', 'cost_csv', 'cost_fabric', 'cost_diff_pct', 
                'impressions_csv', 'impressions_fabric', 'impr_diff_pct',
                'clicks_csv', 'clicks_fabric', 'clicks_diff_pct', 'status']

print(f"\nTotal dates compared: {len(date_comparison)}")
print(f"✓ Matches (within {THRESHOLD_PERCENT}%): {date_comparison['perfect_match'].sum()}")
print(f"✗ Exceeds threshold: {(~date_comparison['perfect_match']).sum()}")

print("\nDetailed comparison:")
display(date_comparison[display_cols].sort_values('day'))

# Save mismatches
if (~date_comparison['perfect_match']).sum() > 0:
    mismatches = date_comparison[~date_comparison['perfect_match']]
    mismatches[display_cols].to_csv('segment_validation_by_date_threshold.csv', index=False)
    print(f"\n✓ Date-level mismatches (>{THRESHOLD_PERCENT}%) saved to: segment_validation_by_date_threshold.csv")

SEGMENT VALIDATION: BY DATE (with 3.0% threshold)

Total dates compared: 29
✓ Matches (within 3.0%): 6
✗ Exceeds threshold: 23

Detailed comparison:


Unnamed: 0,day,cost_csv,cost_fabric,cost_diff_pct,impressions_csv,impressions_fabric,impr_diff_pct,clicks_csv,clicks_fabric,clicks_diff_pct,status
0,2025-11-01,5406.62,5406.67,-0.0,89790,89680,0.12,1519,1428,6.37,✗ FAIL
1,2025-11-03,958.94,958.92,0.0,6534,6422,1.74,190,137,38.69,✗ FAIL
2,2025-11-04,5404.33,5404.34,-0.0,57103,56981,0.21,1475,1392,5.96,✗ FAIL
3,2025-11-05,576.05,576.02,0.01,7172,7064,1.53,144,102,41.18,✗ FAIL
4,2025-11-06,2212.8,2212.79,0.0,22750,22634,0.51,564,501,12.57,✗ FAIL
5,2025-11-07,5691.94,5691.93,0.0,49953,49838,0.23,1396,1317,6.0,✗ FAIL
6,2025-11-08,10985.24,10985.25,-0.0,84543,84437,0.13,1955,1872,4.43,✗ FAIL
7,2025-11-09,16887.9,16887.89,0.0,113292,113181,0.1,2092,2017,3.72,✗ FAIL
8,2025-11-10,16130.8,16130.81,-0.0,90208,90093,0.13,1838,1742,5.51,✗ FAIL
9,2025-11-11,11965.48,11965.46,0.0,87841,87738,0.12,1773,1700,4.29,✗ FAIL



✓ Date-level mismatches (>3.0%) saved to: segment_validation_by_date_threshold.csv


## Step 6: Validation by Campaign (with 2% threshold)

In [None]:
print("="*80)
print(f"SEGMENT VALIDATION: BY CAMPAIGN (with {THRESHOLD_PERCENT}% threshold)")
print("="*80)

# Aggregate by campaign
csv_by_campaign = csv_df.groupby('campaign_name').agg({
    'cost': 'sum',
    'impressions': 'sum',
    'clicks': 'sum'
}).reset_index()
csv_by_campaign.columns = ['campaign_name', 'cost_csv', 'impressions_csv', 'clicks_csv']

fabric_by_campaign = fabric_df.groupby('campaign_name').agg({
    'cost': 'sum',
    'impressions': 'sum',
    'clicks': 'sum'
}).reset_index()
fabric_by_campaign.columns = ['campaign_name', 'cost_fabric', 'impressions_fabric', 'clicks_fabric']

# Merge and compare
campaign_comparison = pd.merge(csv_by_campaign, fabric_by_campaign, on='campaign_name', how='inner', indicator=True)

# Calculate differences and percentages
campaign_comparison['cost_diff_pct'] = ((
    campaign_comparison['cost_csv'] - campaign_comparison['cost_fabric']
) / campaign_comparison['cost_fabric'] * 100).round(2)

campaign_comparison['impr_diff_pct'] = ((
    campaign_comparison['impressions_csv'] - campaign_comparison['impressions_fabric']
) / campaign_comparison['impressions_fabric'] * 100).round(2)

campaign_comparison['clicks_diff_pct'] = ((
    campaign_comparison['clicks_csv'] - campaign_comparison['clicks_fabric']
) / campaign_comparison['clicks_fabric'] * 100).round(2)

# Apply threshold matching
campaign_comparison['perfect_match'] = (
    (campaign_comparison['cost_diff_pct'].abs() <= THRESHOLD_PERCENT) & 
    (campaign_comparison['impr_diff_pct'].abs() <= THRESHOLD_PERCENT) & 
    (campaign_comparison['clicks_diff_pct'].abs() <= THRESHOLD_PERCENT)
)
campaign_comparison['status'] = campaign_comparison['perfect_match'].apply(lambda x: '✓ PASS' if x else '✗ FAIL')

# Display results
display_cols = ['campaign_name', 'cost_csv', 'cost_fabric', 'cost_diff_pct',
                'impressions_csv', 'impressions_fabric', 'impr_diff_pct',
                'clicks_csv', 'clicks_fabric', 'clicks_diff_pct', 'status']

print(f"\nTotal campaigns compared: {len(campaign_comparison)}")
print(f"✓ Matches (within {THRESHOLD_PERCENT}%): {campaign_comparison['perfect_match'].sum()}")
print(f"✗ Exceeds threshold: {(~campaign_comparison['perfect_match']).sum()}")

print("\nDetailed comparison:")
display(campaign_comparison[display_cols].sort_values('campaign_name'))

# Save mismatches
if (~campaign_comparison['perfect_match']).sum() > 0:
    mismatches = campaign_comparison[~campaign_comparison['perfect_match']]
    mismatches[display_cols].to_csv('segment_validation_by_campaign_threshold.csv', index=False)
    print(f"\n✓ Campaign-level mismatches (>{THRESHOLD_PERCENT}%) saved to: segment_validation_by_campaign_threshold.csv")

SEGMENT VALIDATION: BY CAMPAIGN (with 3.0% threshold)

Total campaigns compared: 5
✓ Matches (within 3.0%): 2
✗ Exceeds threshold: 3

Detailed comparison:


Unnamed: 0,campaign_name,cost_csv,cost_fabric,cost_diff_pct,impressions_csv,impressions_fabric,impr_diff_pct,clicks_csv,clicks_fabric,clicks_diff_pct,status
0,Cadiveu_Instamart_External_20th_Nov_2025,5499.5,5499.49,0.0,342,268,27.61,26,18,44.44,✗ FAIL
1,IKONIC-AMZ-Glide-Peach-14-Oct-2025,30429.6,30439.37,-0.03,287833,287868,-0.01,10622,10625,-0.03,✓ PASS
2,ME_Search_|_Oct_25,111296.45,111296.54,-0.0,646629,645180,0.22,13091,12082,8.35,✗ FAIL
3,Nykaa_Black_Friday_Traffic,3499.34,3497.13,0.06,216816,216752,0.03,16089,16010,0.49,✓ PASS
4,PRO_Search_|_Oct_25,109587.81,109587.95,-0.0,186906,185350,0.84,12231,11394,7.35,✗ FAIL



✓ Campaign-level mismatches (>3.0%) saved to: segment_validation_by_campaign_threshold.csv


## Step 7: Final Summary Report

In [None]:
print("="*80)
print(f"SEGMENT VALIDATION SUMMARY REPORT (with {THRESHOLD_PERCENT}% threshold)")
print("="*80)
print(f"\nAnalysis completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# Create summary table
summary_data = [
    ['Overall Totals', 3, overall_comparison['Match'].sum(), 3 - overall_comparison['Match'].sum()],
    ['By Date', len(date_comparison), date_comparison['perfect_match'].sum(), 
     (~date_comparison['perfect_match']).sum()],
    ['By Campaign', len(campaign_comparison), campaign_comparison['perfect_match'].sum(), 
     (~campaign_comparison['perfect_match']).sum()]
]

summary_df = pd.DataFrame(summary_data, 
                         columns=['Segment Type', 'Total Segments', 'Matches', 'Exceeds Threshold'])
summary_df['Match %'] = (summary_df['Matches'] / summary_df['Total Segments'] * 100).round(2)

print("\n")
display(summary_df)

# Overall assessment
total_segments = summary_df['Total Segments'].sum()
total_matches = summary_df['Matches'].sum()
overall_match_pct = (total_matches / total_segments * 100)

print("\n" + "="*80)
print(f"OVERALL MATCH RATE (within {THRESHOLD_PERCENT}%): {total_matches}/{total_segments} ({overall_match_pct:.1f}%)")
print("="*80)

if overall_match_pct == 100:
    print(f"\n✓✓✓ PERFECT VALIDATION! All segments within {THRESHOLD_PERCENT}% threshold! ✓✓✓")
elif overall_match_pct >= 95:
    print(f"\n✓ EXCELLENT! {overall_match_pct:.1f}% of segments within {THRESHOLD_PERCENT}% threshold")
elif overall_match_pct >= 80:
    print(f"\n⚠ GOOD: {overall_match_pct:.1f}% within threshold. Some segments need review.")
else:
    print(f"\n⚠ ATTENTION: Only {overall_match_pct:.1f}% within {THRESHOLD_PERCENT}% threshold. Review required.")

print("\n" + "-"*80)
print("KEY INSIGHTS:")
print("-"*80)
print(f"• Threshold used: {THRESHOLD_PERCENT}%")
print(f"• Segments passing: {total_matches}/{total_segments}")
print(f"• Segments exceeding threshold: {total_segments - total_matches}")

print("\n" + "="*80)
print("VALIDATION COMPLETE")
print("="*80)

SEGMENT VALIDATION SUMMARY REPORT (with 3.0% threshold)

Analysis completed: 2025-12-17 10:36:15




Unnamed: 0,Segment Type,Total Segments,Matches,Exceeds Threshold,Match %
0,Overall Totals,3,2,1,66.67
1,By Date,29,6,23,20.69
2,By Campaign,5,2,3,40.0



OVERALL MATCH RATE (within 3.0%): 10/37 (27.0%)

⚠ ATTENTION: Only 27.0% within 3.0% threshold. Review required.

--------------------------------------------------------------------------------
KEY INSIGHTS:
--------------------------------------------------------------------------------
• Threshold used: 3.0%
• Segments passing: 10/37
• Segments exceeding threshold: 27

VALIDATION COMPLETE
