# 01 — Data Profiling & Readiness

This notebook runs lightweight profiling to understand dataset shape, missingness, duplicates, types, and red flags **before** building models.

**Tip:** Put your dataset at `../data/sample.csv` or update the path below.

In [None]:
import pandas as pd
from src.profiling import (
    load_csv, basic_overview, missingness_report, duplicates_report,
    type_summary, numeric_stats, categorical_preview, red_flag_columns
)

DATA_PATH = "../data/sample.csv"  # change if needed
df = load_csv(DATA_PATH)

print("Overview:", basic_overview(df))
df.head()

In [None]:
# Missingness report (top 20)
missingness_report(df).head(20)

## Duplicate checks

If your dataset has a natural key (e.g., `id`, `event_id`, `transaction_id`), set it in `KEY_COLS`.
Otherwise, keep it empty and just review global duplicates.

In [None]:
KEY_COLS = []  # e.g., ["event_id"]
duplicates_report(df, subset=KEY_COLS if KEY_COLS else None)

In [None]:
# Type summary (top 25)
type_summary(df).head(25)

In [None]:
# Numeric stats (if numeric columns exist)
numeric_stats(df).head(20)

In [None]:
# Category previews (top values) — useful for spotting messy labels
previews = categorical_preview(df)
list(previews.items())[:5]

In [None]:
# Red flags: high missingness, likely IDs, low variance
red_flag_columns(df).head(30)