## Table of Contents
1. BI objectives & KPIs
2. Load data
3. Validation checks
4. Minimal transformations
5. Visualizations
6. Data dictionary & reproducibility notes

## 1. BI objectives & KPIs
**Objectives:**
- Identify demographic and job-related factors associated with higher income.
- Provide validated BI metrics for dashboards and follow-up analyses.

**Primary KPIs (definitions):**
- High-income rate = (count of rows with income==1) / total rows
- Median hours-per-week by income group = median(hours_per_week) grouped by income
- Proportion high-income by education bucket = count(high_income & education_bucket) / count(education_bucket)

In [None]:
# Imports and display options
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
pd.set_option('display.max_rows', 40)
pd.set_option('display.max_columns', 120)

In [None]:
# Load the cleaned dataset (already available in repo)
PATH = 'adult_cleaned.csv'
df = pd.read_csv(PATH)
print('Loaded', PATH, 'shape=', df.shape)
df.head()

## 3. Validation checks
These checks help ensure the dataset is in the expected cleaned state before using it for KPIs or dashboards. If a check fails, investigate upstream data preparation and rerun validations.

In [None]:
# Basic structural checks
expected_cols = ['age','fnlwgt','education_num','capital_gain','capital_loss','hours_per_week','income']
missing_expected = [c for c in expected_cols if c not in df.columns]
print('Missing expected columns:', missing_expected)
print('Total rows:', len(df))
print('Duplicate rows:', df.duplicated().sum())
print('Missing values (per column):')
print(df.isna().sum()[lambda x: x>0])

# Income value checks
if 'income' in df.columns:
    print('Income unique values:', df['income'].unique()[:20])
    # check binary (0/1) or string labels, convert to binary if needed
    if set(df['income'].dropna().unique()) <= {0,1}:
        print('Income is already binary 0/1')
    else:
        print('Converting income to binary: >50K => 1 else 0')
        df['income'] = df['income'].astype(str).str.contains('>50').astype(int)

In [None]:
# Sanity assertions (won't stop notebook silently)
assert len(df) > 0, 'Dataset is empty'
assert 'income' in df.columns, 'Missing income column'
assert df['income'].nunique() <= 2, 'Income should be binary after conversion'
print('Sanity checks passed')

## 4. Minimal transformations
We keep transformations minimal because this is a cleaned dataset; transformations should be documented and reversible where possible.

In [None]:
# Ensure `income` is integer and create useful derived columns
df['income'] = df['income'].astype(int)
# create high_income binary (explicit column name for clarity)
if 'high_income' not in df.columns:
    df['high_income'] = df['income'].astype(int)
# Age groups
if 'age' in df.columns:
    df['age_group'] = pd.cut(df['age'], bins=[0,25,35,50,65,120], labels=['<=25','26-35','36-50','51-65','66+'])
    df['age_group'] = df['age_group'].astype('category')
print('Transformations applied - sample:')
df[['age','age_group','income']].head()

## 5. Visualizations (quick KPI views)
These are quick, manager-ready visual checks; extract higher-quality versions for presentations as needed.

In [None]:
# High-income proportion
plt.figure(figsize=(6,4))
sns.countplot(x='high_income', data=df)
plt.title('High income count (0=<=50k, 1=>50k)')
plt.xlabel('high_income')
plt.show()

# Median hours-per-week by income
plt.figure(figsize=(8,5))
sns.barplot(x='income', y='hours_per_week', data=df, estimator=np.median)
plt.title('Median hours-per-week by income group')
plt.show()

## 6. Data dictionary (short)
- age: integer
- education_num: numeric encoding of education level
- hours_per_week: integer hours worked per week
- capital_gain/loss: integers
- income: binary target (0/1)
- high_income: duplicate of income to make KPI naming explicit
- one-hot columns: many categorical values are pre-expanded into one-hot columns (e.g., `occupation_*`, `workclass_*`).

---
## Reproducibility notes
- This notebook relies on `adult_cleaned.csv` in repository root.
- Recommended packages: pandas, numpy, matplotlib, seaborn. Add these to `requirements.txt` or install with `pip install pandas numpy matplotlib seaborn`.
- To validate automatically as part of a CI pipeline, run the checks in the `Validation checks` section and fail on any assertion error.

