# Phase 1: Synthetic Aadhaar-like Data Generator

## Aadhaar Pulse - ML Data Intelligence Layer

**Objective:** Generate a synthetic, aggregated dataset simulating district-level Aadhaar update behavior.

### Key Constraints:
- ‚ùå NO real Aadhaar or UIDAI data
- ‚ùå NO individual-level records  
- ‚úÖ ONLY synthetic, aggregated data
- ‚úÖ Granularity: `district √ó month √ó age_group`
- ‚úÖ Deterministic, rule-based simulation
- ‚úÖ Explainable logic

### Output Schema:
| Field | Type | Description |
|-------|------|-------------|
| `district_id` | string | From district_master.csv |
| `month` | string | YYYY-MM format |
| `age_group` | enum | 0-17, 18-25, 26-35, 36-60, 60+ |
| `address_updates` | integer | Simulated address changes |
| `mobile_updates` | integer | Simulated mobile updates |
| `total_aadhaar` | integer | Base population count |

## 1. Import Required Libraries and Set Seed

Import pandas and numpy. Set a fixed random seed (42) for reproducibility.
All randomness must be deterministic to ensure reproducible outputs.

In [None]:
"""
Import Required Libraries and Set Seed
--------------------------------------
- pandas: Data manipulation and CSV I/O
- numpy: Numerical operations and controlled randomness
- os/pathlib: File path handling
"""
import pandas as pd
import numpy as np
from pathlib import Path
import os

# =============================================================================
# FIXED SEED FOR REPRODUCIBILITY
# All random operations use this seed to ensure deterministic outputs.
# This is critical for explainability and debugging.
# =============================================================================
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print(f"‚úÖ Libraries imported successfully")
print(f"‚úÖ Random seed set to: {RANDOM_SEED}")

## 2. Load District Master Data

Read `district_master.csv` to get `district_id` and `region_type` (urban/rural/peri-urban).
This drives the simulation logic - each region type has different update behaviors.

In [None]:
"""
Load District Master Data
-------------------------
Source of truth for district identifiers and region classifications.
The region_type determines base population and update behavior patterns.
"""

# Define paths relative to notebook location
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
DISTRICT_MASTER_PATH = PROJECT_ROOT / 'data' / 'contracts' / 'district_master.csv'
OUTPUT_PATH = PROJECT_ROOT / 'data' / 'raw' / 'aadhaar_events_monthly.csv'

# Ensure output directory exists
OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)

# Load district master (skip comment lines starting with #)
districts_df = pd.read_csv(DISTRICT_MASTER_PATH, comment='#')

print(f"‚úÖ Loaded {len(districts_df)} districts from district_master.csv")
print(f"\nüìä Region type distribution:")
print(districts_df['region_type'].value_counts())
print(f"\nüìã Sample districts:")
districts_df.head()

## 3. Define Constants and Configuration

Define all simulation parameters with clear documentation:
- Age group enum values
- Date range for monthly simulation
- Base population multipliers by region type
- Update rate ranges by region type

In [None]:
"""
Constants and Configuration
---------------------------
All magic numbers are defined here with explanations.
These values are synthetic and designed to create realistic patterns.
"""

# =============================================================================
# AGE GROUP DEFINITIONS
# These represent typical demographic segments for analysis.
# Youth groups (18-25, 26-35) show higher mobility patterns.
# =============================================================================
AGE_GROUPS = ['0-17', '18-25', '26-35', '36-60', '60+']

# Age group population distribution (sums to 1.0)
# Reflects a young-skewing synthetic population
AGE_GROUP_DISTRIBUTION = {
    '0-17': 0.25,   # Children and minors
    '18-25': 0.18,  # Young adults (high mobility)
    '26-35': 0.22,  # Working age (high mobility)
    '36-60': 0.25,  # Middle age (moderate stability)
    '60+': 0.10     # Seniors (high stability)
}

# =============================================================================
# TIME RANGE CONFIGURATION
# 12 months of data for trend analysis
# =============================================================================
MONTHS = pd.date_range(start='2024-01', end='2024-12', freq='MS').strftime('%Y-%m').tolist()

# =============================================================================
# BASE POPULATION BY REGION TYPE (total_aadhaar)
# These are synthetic population counts per district.
# Urban districts have higher populations due to density.
# =============================================================================
BASE_POPULATION = {
    'urban': (150000, 250000),       # High density: 150k-250k
    'peri-urban': (80000, 150000),   # Medium density: 80k-150k
    'rural': (30000, 80000)          # Low density: 30k-80k
}

# =============================================================================
# ADDRESS UPDATE RATES BY REGION TYPE (% of total_aadhaar)
# Urban: High churn due to job changes, rentals
# Rural: Low churn, stable communities
# Peri-urban: Growing areas with increasing updates
# =============================================================================
ADDRESS_UPDATE_RATE = {
    'urban': (0.025, 0.050),       # 2.5-5% monthly address updates
    'peri-urban': (0.015, 0.035),  # 1.5-3.5% (with growth trend)
    'rural': (0.005, 0.015)        # 0.5-1.5% (stable)
}

# =============================================================================
# MOBILE UPDATE RATIO (% of address_updates)
# Models digital access gap: rural areas lag behind
# =============================================================================
MOBILE_UPDATE_RATIO = {
    'urban': (0.70, 0.90),      # 70-90% also update mobile (digital savvy)
    'peri-urban': (0.50, 0.70), # 50-70% (developing digital access)
    'rural': (0.30, 0.50)       # 30-50% (digital exclusion risk)
}

# =============================================================================
# AGE-BASED MOBILITY MULTIPLIERS
# Youth (18-35) are more mobile due to education, jobs
# Children and seniors are more stable
# =============================================================================
AGE_MOBILITY_MULTIPLIER = {
    '0-17': 0.3,    # Children: Low independent mobility
    '18-25': 2.0,   # Young adults: Highest mobility (education, first jobs)
    '26-35': 1.8,   # Working professionals: High mobility (career moves)
    '36-60': 1.0,   # Middle age: Baseline (some stability)
    '60+': 0.4      # Seniors: Low mobility (settled)
}

# =============================================================================
# PERI-URBAN GROWTH RATE
# Monthly compound growth in address updates for emerging areas
# =============================================================================
PERI_URBAN_MONTHLY_GROWTH = 0.03  # 3% month-on-month increase

print("‚úÖ Configuration loaded")
print(f"üìÖ Simulation period: {MONTHS[0]} to {MONTHS[-1]} ({len(MONTHS)} months)")
print(f"üë• Age groups: {AGE_GROUPS}")

## 4. Create Base Population by District and Age Group

Generate `total_aadhaar` values using region_type rules:
- **Urban:** Higher populations (150k-250k)
- **Rural:** Lower populations (30k-80k)  
- **Peri-urban:** In between (80k-150k)

Distribute across age groups with realistic proportions.

In [None]:
"""
Create Base Population by District and Age Group
------------------------------------------------
Generate total_aadhaar values based on region type.
Each district gets a stable base population that doesn't change month-to-month.
"""

def generate_base_population(district_id: str, region_type: str) -> dict:
    """
    Generate base population for a district across all age groups.
    
    Logic:
    - Use region_type to determine population range
    - Use district_id hash for deterministic variation within range
    - Distribute across age groups using predefined proportions
    
    Returns: dict with age_group -> total_aadhaar mapping
    """
    # Get population range for this region type
    pop_min, pop_max = BASE_POPULATION[region_type]
    
    # Use district_id hash for deterministic variation
    # This ensures same district always gets same population
    hash_value = hash(district_id) % 1000 / 1000  # 0.0 to 0.999
    
    # Calculate total district population
    total_pop = int(pop_min + (pop_max - pop_min) * hash_value)
    
    # Distribute across age groups
    population_by_age = {}
    for age_group, proportion in AGE_GROUP_DISTRIBUTION.items():
        # Add small deterministic variation per age group
        age_hash = hash(f"{district_id}_{age_group}") % 100 / 1000  # ¬±5% variation
        adjusted_proportion = proportion * (1 + age_hash - 0.05)
        population_by_age[age_group] = int(total_pop * adjusted_proportion)
    
    return population_by_age

# Generate base populations for all districts
district_populations = {}
for _, row in districts_df.iterrows():
    district_id = row['district_id']
    region_type = row['region_type']
    district_populations[district_id] = {
        'region_type': region_type,
        'populations': generate_base_population(district_id, region_type)
    }

# Display sample
print("‚úÖ Base populations generated for all districts")
print(f"\nüìä Sample population breakdown (first 3 districts):")
for district_id in list(district_populations.keys())[:3]:
    info = district_populations[district_id]
    total = sum(info['populations'].values())
    print(f"\n  {district_id} ({info['region_type']}): {total:,} total")
    for age, pop in info['populations'].items():
        print(f"    {age}: {pop:,}")

## 5. Generate Monthly Time Series

Create a cartesian product of `district_id √ó month √ó age_group` to form the base dataframe structure.
Each combination becomes one row in the final dataset.

In [None]:
"""
Generate Monthly Time Series
----------------------------
Create the base dataframe with all combinations of:
- district_id (20 districts)
- month (12 months)
- age_group (5 groups)

Total rows = 20 √ó 12 √ó 5 = 1,200 records
"""
from itertools import product

# Create cartesian product of all dimensions
combinations = list(product(
    districts_df['district_id'].tolist(),
    MONTHS,
    AGE_GROUPS
))

# Build base dataframe
df = pd.DataFrame(combinations, columns=['district_id', 'month', 'age_group'])

# Add region_type for calculations (will be used internally, not in final output)
district_to_region = districts_df.set_index('district_id')['region_type'].to_dict()
df['_region_type'] = df['district_id'].map(district_to_region)

# Add total_aadhaar from pre-generated populations
df['total_aadhaar'] = df.apply(
    lambda row: district_populations[row['district_id']]['populations'][row['age_group']],
    axis=1
)

# Add month index for trend calculations (0-11)
df['_month_idx'] = df['month'].apply(lambda m: MONTHS.index(m))

print(f"‚úÖ Base dataframe created: {len(df):,} rows")
print(f"   Dimensions: {len(districts_df)} districts √ó {len(MONTHS)} months √ó {len(AGE_GROUPS)} age groups")
print(f"\nüìã Sample rows:")
df.head(10)

## 6. Apply Region-Type Based Update Rules

Calculate `address_updates` based on `region_type`:
- **Urban:** 2.5-5% of total_aadhaar (high churn)
- **Rural:** 0.5-1.5% of total_aadhaar (stable communities)
- **Peri-urban:** 1.5-3.5% with monthly growth trend (emerging areas)

Ensures `address_updates < total_aadhaar` constraint.

In [None]:
"""
Apply Region-Type Based Update Rules
------------------------------------
Calculate base address_updates using region-specific rates.
This creates the foundation for migration signal detection.

Logic:
- Urban: High base rate (people move frequently for jobs, rentals)
- Rural: Low stable rate (long-term residents, less mobility)
- Peri-urban: Moderate rate with growth trend (developing areas attracting new residents)
"""

def calculate_base_address_updates(row) -> int:
    """
    Calculate address updates based on region type and population.
    
    Uses deterministic variation based on district_id and month
    to avoid pure randomness while maintaining realistic variation.
    """
    region_type = row['_region_type']
    total_pop = row['total_aadhaar']
    district_id = row['district_id']
    month = row['month']
    
    # Get rate range for this region type
    rate_min, rate_max = ADDRESS_UPDATE_RATE[region_type]
    
    # Deterministic variation using hash of district + month
    # This ensures reproducibility while adding realistic variation
    variation_hash = hash(f"{district_id}_{month}") % 1000 / 1000
    
    # Calculate rate within the defined range
    rate = rate_min + (rate_max - rate_min) * variation_hash
    
    # Calculate base updates
    base_updates = int(total_pop * rate)
    
    return base_updates

# Apply base address update calculation
df['_base_address_updates'] = df.apply(calculate_base_address_updates, axis=1)

print("‚úÖ Base address updates calculated by region type")
print(f"\nüìä Average base address updates by region type:")
region_stats = df.groupby('_region_type')['_base_address_updates'].mean()
for region, avg in region_stats.items():
    print(f"   {region}: {avg:,.0f} updates/month")

## 7. Apply Age-Group Based Modifiers

Apply multipliers to `address_updates` based on age demographics:
- **18-25, 26-35:** 1.8x-2.0x modifier (youth mobility - education, jobs)
- **0-17, 60+:** 0.3x-0.4x modifier (dependents, stable demographics)
- **36-60:** 1.0x baseline (moderate stability)

In [None]:
"""
Apply Age-Group Based Modifiers
-------------------------------
Youth demographics (18-35) show higher migration patterns.
This is a key signal for workforce mobility analysis.

Rationale:
- 18-25: Students, first-time job seekers ‚Üí Highest mobility
- 26-35: Career advancement, family formation ‚Üí High mobility
- 36-60: Established careers, children in school ‚Üí Baseline
- 0-17: Dependent on parents ‚Üí Low independent mobility
- 60+: Retired, health considerations ‚Üí Low mobility
"""

def apply_age_modifier(row) -> int:
    """
    Apply age-based mobility multiplier to address updates.
    
    This creates the youth ratio signal used for migration analysis.
    """
    base_updates = row['_base_address_updates']
    age_group = row['age_group']
    
    # Get multiplier for this age group
    multiplier = AGE_MOBILITY_MULTIPLIER[age_group]
    
    # Apply multiplier
    adjusted_updates = int(base_updates * multiplier)
    
    return adjusted_updates

# Apply age-based modifiers
df['_age_adjusted_updates'] = df.apply(apply_age_modifier, axis=1)

print("‚úÖ Age-based mobility modifiers applied")
print(f"\nüìä Average address updates by age group (after modifier):")
age_stats = df.groupby('age_group')['_age_adjusted_updates'].mean()
for age, avg in age_stats.items():
    print(f"   {age}: {avg:,.0f} updates/month")

## 8. Calculate Mobile Updates with Lag Effect

Set `mobile_updates ‚â§ address_updates` with region-based ratios:
- **Urban:** 70-90% of address_updates (high digital adoption)
- **Peri-urban:** 50-70% (developing digital access)
- **Rural:** 30-50% (digital exclusion risk)

This models the **digital access gap** - a key signal for exclusion analysis.

In [None]:
"""
Calculate Mobile Updates with Lag Effect
----------------------------------------
Mobile updates lag behind address updates in less developed areas.
This creates the digital_exclusion signal:
  low mobile_updates / high address_updates = digital exclusion risk

Logic:
- Urban residents update mobile along with address (digital-first)
- Rural residents often lack smartphone access or digital literacy
- Peri-urban areas are transitioning
"""

def calculate_mobile_updates(row) -> int:
    """
    Calculate mobile updates as a fraction of address updates.
    
    The ratio depends on region type, modeling digital access gaps.
    """
    address_updates = row['_age_adjusted_updates']
    region_type = row['_region_type']
    district_id = row['district_id']
    month = row['month']
    
    # Get mobile update ratio range for this region
    ratio_min, ratio_max = MOBILE_UPDATE_RATIO[region_type]
    
    # Deterministic variation
    variation_hash = hash(f"{district_id}_{month}_mobile") % 1000 / 1000
    ratio = ratio_min + (ratio_max - ratio_min) * variation_hash
    
    # Calculate mobile updates (always <= address_updates)
    mobile_updates = int(address_updates * ratio)
    
    return mobile_updates

# Calculate mobile updates
df['mobile_updates'] = df.apply(calculate_mobile_updates, axis=1)

print("‚úÖ Mobile updates calculated with digital access lag")
print(f"\nüìä Mobile-to-Address ratio by region type:")
for region in ['urban', 'peri-urban', 'rural']:
    region_data = df[df['_region_type'] == region]
    total_address = region_data['_age_adjusted_updates'].sum()
    total_mobile = region_data['mobile_updates'].sum()
    ratio = total_mobile / total_address * 100 if total_address > 0 else 0
    print(f"   {region}: {ratio:.1f}% mobile update rate")

## 9. Apply Month-on-Month Smoothing for Peri-Urban Growth

Ensure smooth trends by applying growth adjustments:
- **Peri-urban districts:** 3% monthly compound growth in address_updates
- **Other regions:** Stable baseline with minor seasonal variation

This creates the **sustained growth signal** for peri-urbanization detection.

In [None]:
"""
Apply Month-on-Month Smoothing for Peri-Urban Growth
----------------------------------------------------
Peri-urban areas show sustained growth as people migrate from
rural areas and urban overflow settles in these transitioning zones.

This creates explainable month-on-month trends:
- Peri-urban: Compound growth (3% monthly)
- Urban: Slight seasonal variation (stable high)
- Rural: Very stable (minimal change)
"""

def apply_growth_trend(row) -> int:
    """
    Apply month-over-month growth for peri-urban districts.
    
    Growth formula: base_value * (1 + growth_rate)^month_index
    This creates a smooth upward trend for emerging areas.
    """
    base_updates = row['_age_adjusted_updates']
    region_type = row['_region_type']
    month_idx = row['_month_idx']
    
    if region_type == 'peri-urban':
        # Compound growth for peri-urban districts
        # Month 0 = baseline, Month 11 = ~40% higher
        growth_factor = (1 + PERI_URBAN_MONTHLY_GROWTH) ** month_idx
        adjusted_updates = int(base_updates * growth_factor)
    elif region_type == 'urban':
        # Urban: slight seasonal variation (summer peak)
        # Peak in months 4-6 (May-July)
        seasonal_factor = 1 + 0.1 * np.sin((month_idx - 2) * np.pi / 6)
        adjusted_updates = int(base_updates * seasonal_factor)
    else:
        # Rural: stable with minimal variation
        adjusted_updates = base_updates
    
    return adjusted_updates

# Apply growth trends
df['address_updates'] = df.apply(apply_growth_trend, axis=1)

# Recalculate mobile_updates to maintain ratio after growth adjustment
def recalc_mobile_updates(row) -> int:
    """Recalculate mobile updates maintaining the digital access ratio."""
    address_updates = row['address_updates']
    region_type = row['_region_type']
    district_id = row['district_id']
    month = row['month']
    
    ratio_min, ratio_max = MOBILE_UPDATE_RATIO[region_type]
    variation_hash = hash(f"{district_id}_{month}_mobile") % 1000 / 1000
    ratio = ratio_min + (ratio_max - ratio_min) * variation_hash
    
    return int(address_updates * ratio)

df['mobile_updates'] = df.apply(recalc_mobile_updates, axis=1)

print("‚úÖ Month-on-month growth trends applied")
print(f"\nüìä Peri-urban growth trend (average address_updates by month):")
peri_urban_trend = df[df['_region_type'] == 'peri-urban'].groupby('month')['address_updates'].mean()
for i, (month, avg) in enumerate(peri_urban_trend.items()):
    growth = ((avg / peri_urban_trend.iloc[0]) - 1) * 100 if i > 0 else 0
    print(f"   {month}: {avg:,.0f} ({growth:+.1f}% from start)")

## 10. Validate Data Integrity

Run comprehensive sanity checks to ensure data quality:
- ‚úÖ `total_aadhaar > address_updates >= mobile_updates >= 0`
- ‚úÖ No null values
- ‚úÖ Correct data types
- ‚úÖ Valid `age_group` enum values
- ‚úÖ Valid `month` format (YYYY-MM)

In [None]:
"""
Validate Data Integrity
-----------------------
Comprehensive checks to ensure the synthetic data meets all constraints.
Any validation failure should halt the pipeline.
"""

def validate_data(df: pd.DataFrame) -> bool:
    """
    Run all validation checks on the synthetic dataset.
    Returns True if all checks pass, raises AssertionError otherwise.
    """
    errors = []
    
    # Check 1: No null values
    null_counts = df[['district_id', 'month', 'age_group', 'address_updates', 
                      'mobile_updates', 'total_aadhaar']].isnull().sum()
    if null_counts.sum() > 0:
        errors.append(f"‚ùå Null values found: {null_counts[null_counts > 0].to_dict()}")
    else:
        print("‚úÖ Check 1: No null values")
    
    # Check 2: total_aadhaar > address_updates
    invalid_total = df[df['total_aadhaar'] <= df['address_updates']]
    if len(invalid_total) > 0:
        errors.append(f"‚ùå {len(invalid_total)} rows have total_aadhaar <= address_updates")
    else:
        print("‚úÖ Check 2: total_aadhaar > address_updates")
    
    # Check 3: address_updates >= mobile_updates
    invalid_mobile = df[df['address_updates'] < df['mobile_updates']]
    if len(invalid_mobile) > 0:
        errors.append(f"‚ùå {len(invalid_mobile)} rows have address_updates < mobile_updates")
    else:
        print("‚úÖ Check 3: address_updates >= mobile_updates")
    
    # Check 4: mobile_updates >= 0
    negative_mobile = df[df['mobile_updates'] < 0]
    if len(negative_mobile) > 0:
        errors.append(f"‚ùå {len(negative_mobile)} rows have negative mobile_updates")
    else:
        print("‚úÖ Check 4: mobile_updates >= 0")
    
    # Check 5: address_updates >= 0
    negative_address = df[df['address_updates'] < 0]
    if len(negative_address) > 0:
        errors.append(f"‚ùå {len(negative_address)} rows have negative address_updates")
    else:
        print("‚úÖ Check 5: address_updates >= 0")
    
    # Check 6: Valid age_group enum values
    valid_age_groups = set(AGE_GROUPS)
    actual_age_groups = set(df['age_group'].unique())
    if not actual_age_groups.issubset(valid_age_groups):
        invalid = actual_age_groups - valid_age_groups
        errors.append(f"‚ùå Invalid age_group values: {invalid}")
    else:
        print("‚úÖ Check 6: Valid age_group enum values")
    
    # Check 7: Valid month format (YYYY-MM)
    import re
    month_pattern = re.compile(r'^\d{4}-(0[1-9]|1[0-2])$')
    invalid_months = df[~df['month'].apply(lambda m: bool(month_pattern.match(m)))]
    if len(invalid_months) > 0:
        errors.append(f"‚ùå {len(invalid_months)} rows have invalid month format")
    else:
        print("‚úÖ Check 7: Valid month format (YYYY-MM)")
    
    # Check 8: Valid district_id (exists in master)
    valid_districts = set(districts_df['district_id'].tolist())
    actual_districts = set(df['district_id'].unique())
    if not actual_districts.issubset(valid_districts):
        invalid = actual_districts - valid_districts
        errors.append(f"‚ùå Invalid district_id values: {invalid}")
    else:
        print("‚úÖ Check 8: Valid district_id values")
    
    # Check 9: Correct data types
    expected_types = {
        'district_id': 'object',
        'month': 'object',
        'age_group': 'object',
        'address_updates': 'int',
        'mobile_updates': 'int',
        'total_aadhaar': 'int'
    }
    type_errors = []
    for col, expected in expected_types.items():
        actual = str(df[col].dtype)
        if expected not in actual:
            type_errors.append(f"{col}: expected {expected}, got {actual}")
    if type_errors:
        errors.append(f"‚ùå Type mismatches: {type_errors}")
    else:
        print("‚úÖ Check 9: Correct data types")
    
    # Report results
    if errors:
        print("\nüö® VALIDATION FAILED:")
        for error in errors:
            print(f"   {error}")
        return False
    else:
        print("\n‚úÖ ALL VALIDATION CHECKS PASSED")
        return True

# Run validation
validation_passed = validate_data(df)

## 11. Export to CSV

Save the final dataframe to `data/raw/aadhaar_events_monthly.csv` with exact schema:
- `district_id`
- `month`
- `age_group`
- `address_updates`
- `mobile_updates`
- `total_aadhaar`

Internal columns (prefixed with `_`) are dropped before export.

In [None]:
"""
Export to CSV
-------------
Final step: Save the synthetic dataset with exact schema.
Only the required columns are exported (no internal columns).
"""

# Define final schema (exact column order as specified)
FINAL_COLUMNS = [
    'district_id',
    'month',
    'age_group',
    'address_updates',
    'mobile_updates',
    'total_aadhaar'
]

# Create export dataframe with only required columns
export_df = df[FINAL_COLUMNS].copy()

# Sort for consistent output
export_df = export_df.sort_values(['district_id', 'month', 'age_group']).reset_index(drop=True)

# Export to CSV
export_df.to_csv(OUTPUT_PATH, index=False)

print(f"‚úÖ Dataset exported to: {OUTPUT_PATH}")
print(f"\nüìä Export Summary:")
print(f"   Total rows: {len(export_df):,}")
print(f"   Districts: {export_df['district_id'].nunique()}")
print(f"   Months: {export_df['month'].nunique()}")
print(f"   Age groups: {export_df['age_group'].nunique()}")
print(f"\nüìã Schema:")
for col in FINAL_COLUMNS:
    dtype = export_df[col].dtype
    sample = export_df[col].iloc[0]
    print(f"   {col}: {dtype} (e.g., {sample})")

print(f"\nüìÑ First 10 rows:")
export_df.head(10)

## Summary: Synthetic Data Generation Logic

### Key Signals Created for ML Pipeline:

| Signal | Source | Interpretation |
|--------|--------|----------------|
| **Address Update Velocity** | `address_updates / total_aadhaar` | Migration intensity |
| **Youth Ratio** | Higher updates in 18-35 age groups | Workforce mobility |
| **Digital Gap** | `mobile_updates / address_updates` | Digital exclusion risk |
| **Sustained Growth** | Peri-urban month-on-month increase | Urbanization trend |

### Deterministic Factors:
- ‚úÖ Fixed random seed (42)
- ‚úÖ Hash-based variation using district_id + month
- ‚úÖ No external data dependencies
- ‚úÖ Reproducible on every run

### Constraints Enforced:
- ‚ùå NO real Aadhaar data
- ‚ùå NO individual records
- ‚úÖ District √ó Month √ó Age Group granularity
- ‚úÖ All values explainable