<!-- @format -->

# Validation Check Notebook

This notebook performs validation checks on the PriceVision dataset to ensure proper train/validation splits with stratified sampling based on price buckets.


<!-- @format -->

## Setup and Data Loading

This section sets up the Python path, imports necessary modules, loads the training data, and creates stratified train/validation splits.


In [None]:
# Add parent directory to Python path for imports
import sys
sys.path.append('..')

import pandas as pd
# Import custom validation function for stratified splitting
from utils.validation import create_validation_split
# Import target column name from config
from config import TARGET_COL

# Load training data
df = pd.read_csv("../data/train.csv")

# Create stratified train/validation splits based on price buckets
train_df, val_df = create_validation_split(df, TARGET_COL)

<!-- @format -->

## Price Statistics Comparison

This section compares the price distributions between the training and validation sets to ensure they are similar.


In [None]:
# Display price statistics for training set
print("Train price stats:")
print(train_df[TARGET_COL].describe())

print("\nValidation price stats:")
# Display price statistics for validation set
print(val_df[TARGET_COL].describe())

Train price stats:
count    40000.000000
mean        23.603449
std         31.946248
min          0.130000
25%          6.790000
50%         14.000000
75%         28.566250
max       1280.000000
Name: price, dtype: float64

Validation price stats:
count    10000.000000
mean        23.793345
std         32.835208
min          0.360000
25%          6.788750
50%         14.160000
75%         28.350000
max        691.160000
Name: price, dtype: float64


<!-- @format -->

## Bucket Distribution Analysis

This section checks the distribution of price buckets in both training and validation sets to verify stratified sampling.


In [None]:
# Display normalized bucket distribution for training set
print("Train buckets:")
print(train_df["price_bucket"].value_counts(normalize=True))

print("\nValidation buckets:")
# Display normalized bucket distribution for validation set
print(val_df["price_bucket"].value_counts(normalize=True))

Train buckets:
price_bucket
2    0.200125
0    0.200075
4    0.199975
1    0.199925
3    0.199900
Name: proportion, dtype: float64

Validation buckets:
price_bucket
0    0.2001
2    0.2001
4    0.2000
1    0.1999
3    0.1999
Name: proportion, dtype: float64


<!-- @format -->

## Additional Validation Checks

This section can be used for any additional validation or analysis as needed.


In [None]:
# Add any additional validation code here