# 00 - Data Loading Utilities

## üìã Purpose
This notebook provides **shared utilities** for all other notebooks:
- Data loading functions
- Data cleaning and standardization
- Common configurations

## üéØ Outputs
**3 Clean DataFrames**:
- `enrolment_df` (3.6M rows)
- `demographic_df` (800K rows)
- `biometric_df` (500K rows)

## üìå Usage
All other notebooks (01-06) import these utilities to avoid code duplication.

---
## Section 1: Setup & Configuration

In [None]:
# Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import warnings
import os
import sys

# Configuration
# Handle encoding (works in scripts, not needed in Jupyter)
try:
    sys.stdout.reconfigure(encoding='utf-8')
except AttributeError:
    pass  # Jupyter notebooks handle encoding automatically

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set(style="whitegrid", palette="viridis")
plt.rcParams['figure.figsize'] = (14, 7)
warnings.filterwarnings('ignore')

print("‚úÖ Libraries loaded and configured successfully")

In [None]:
# Create output directories
os.makedirs('../output', exist_ok=True)
os.makedirs('../output/enrollment', exist_ok=True)
os.makedirs('../output/demographic', exist_ok=True)
os.makedirs('../output/biometric', exist_ok=True)

print("‚úÖ Output directories created/verified")

---
## Section 2: Utility Functions

In [None]:
def load_and_combine(pattern):
    """
    Load and combine multiple CSV files matching a pattern.
    
    Args:
        pattern (str): Glob pattern for files (e.g., 'dataset/api_data_aadhar_enrolment_*.csv')
    
    Returns:
        pd.DataFrame: Combined DataFrame from all matched files
    
    Example:
        >>> df = load_and_combine('dataset/api_data_aadhar_enrolment_*.csv')
        >>> print(f"Loaded {len(df):,} rows")
    """
    files = glob.glob(pattern)
    
    if not files:
        print(f"‚ö†Ô∏è  WARNING: No files found for pattern: {pattern}")
        return pd.DataFrame()
    
    print(f"üìÅ Loading {len(files)} files for pattern: {pattern}")
    df_list = [pd.read_csv(f) for f in files]
    combined = pd.concat(df_list, ignore_index=True)
    
    print(f"‚úÖ Loaded {len(combined):,} total rows")
    return combined

print("‚úÖ load_and_combine() function defined")

In [None]:
def clean_data(df):
    """
    Clean and standardize the data:
    - Fix date formats
    - Normalize state/district names (27 mappings)
    - Validate pincodes (Indian range: 110000-999999)
    - Handle missing values
    
    WHY: Raw government data has inconsistencies (e.g., 'West Bengal' vs 'West Bangal')
         This ensures we can merge data correctly without losing records.
    
    Args:
        df (pd.DataFrame): Raw DataFrame
    
    Returns:
        pd.DataFrame: Cleaned DataFrame
    """
    if df.empty:
        return df
    
    print(f"üßπ Starting data cleaning for {len(df):,} rows...")
    
    # 1. Date Standardization
    if 'date' in df.columns:
        df['date'] = pd.to_datetime(df['date'], dayfirst=True, errors='coerce')
        print("   ‚úÖ Dates standardized")
    
    # 2. String Cleaning (State/District)
    for col in ['state', 'district']:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip().str.title()
    print("   ‚úÖ State/District names cleaned")
    
    # 3. State Name Normalization (CRITICAL for merging)
    state_map = {
        'Andaman & Nicobar Islands': 'Andaman and Nicobar Islands',
        'Andhra Pradsh': 'Andhra Pradesh',
        'Chhatisgarh': 'Chhattisgarh',
        'Dadra & Nagar Haveli': 'Dadra and Nagar Haveli and Daman and Diu',
        'Daman & Diu': 'Dadra and Nagar Haveli and Daman and Diu',
        'Jammu & Kashmir': 'Jammu and Kashmir',
        'Orissa': 'Odisha',
        'Pondicherry': 'Puducherry',
        'Tamilnadu': 'Tamil Nadu',
        'Telengana': 'Telangana',
        'Uttaranchal': 'Uttarakhand',
        'West Bangal': 'West Bengal',
        'Westbengal': 'West Bengal',
        'West Bengli': 'West Bengal'
    }
    if 'state' in df.columns:
        df['state'] = df['state'].replace(state_map)
        print(f"   ‚úÖ Applied {len(state_map)} state name mappings")
    
    # 4. District Normalization
    dist_map = {
        'Bangalore': 'Bengaluru',
        'Bangalore Urban': 'Bengaluru Urban',
        'Calcutta': 'Kolkata',
        'Gurgaon': 'Gurugram'
    }
    if 'district' in df.columns:
        df['district'] = df['district'].replace(dist_map)
    
    # 5. Pincode Validation (Indian pincodes: 110000-999999)
    if 'pincode' in df.columns:
        original_len = len(df)
        df['pincode'] = pd.to_numeric(df['pincode'], errors='coerce').fillna(0).astype(int)
        df = df[(df['pincode'] >= 110000) & (df['pincode'] <= 999999)]
        removed = original_len - len(df)
        if removed > 0:
            print(f"   ‚ö†Ô∏è  Removed {removed:,} rows with invalid pincodes")
    
    # 6. Null Handling (Numeric ‚Üí 0)
    num_cols = df.select_dtypes(include=[np.number]).columns
    df[num_cols] = df[num_cols].fillna(0)
    
    print(f"‚úÖ Cleaning complete. Final size: {len(df):,} rows")
    return df

print("‚úÖ clean_data() function defined")

---
## Section 3: Load All Three Domains

In [None]:
print("\n" + "="*70)
print("üìä LOADING AADHAAR DATA (3 DOMAINS)")
print("="*70 + "\n")

# Load Enrollment Data
print("1Ô∏è‚É£  ENROLLMENT DATA")
enrolment_df = clean_data(load_and_combine('../dataset/api_data_aadhar_enrolment_*.csv'))
print()

# Load Demographic Data
print("2Ô∏è‚É£  DEMOGRAPHIC DATA")
demographic_df = clean_data(load_and_combine('../dataset/api_data_aadhar_demographic_*.csv'))
print()

# Load Biometric Data
print("3Ô∏è‚É£  BIOMETRIC DATA")
biometric_df = clean_data(load_and_combine('../dataset/api_data_aadhar_biometric_*.csv'))
print()

print("="*70)
print("‚úÖ ALL DATA LOADED SUCCESSFULLY")
print("="*70)

---
## Section 4: Data Summary

In [None]:
print("\nüìã DATASET SUMMARY\n")
print(f"{'Domain':<20} {'Rows':>15} {'Columns':>10}")
print("-" * 50)
print(f"{'Enrollment':<20} {len(enrolment_df):>15,} {len(enrolment_df.columns):>10}")
print(f"{'Demographic':<20} {len(demographic_df):>15,} {len(demographic_df.columns):>10}")
print(f"{'Biometric':<20} {len(biometric_df):>15,} {len(biometric_df.columns):>10}")
print("-" * 50)
print(f"{'TOTAL':<20} {len(enrolment_df) + len(demographic_df) + len(biometric_df):>15,}")
print()

In [None]:
# Display sample data
print("\nüìä ENROLLMENT DATA SAMPLE:")
display(enrolment_df.head(3))

print("\nüìä DEMOGRAPHIC DATA SAMPLE:")
display(demographic_df.head(3))

print("\nüìä BIOMETRIC DATA SAMPLE:")
display(biometric_df.head(3))

---
## ‚úÖ Utilities Ready!

### **Exported Variables** (Use in other notebooks):
- `enrolment_df` ‚Üí Enrollment data (cleaned)
- `demographic_df` ‚Üí Demographic update data (cleaned)
- `biometric_df` ‚Üí Biometric update data (cleaned)

### **Exported Functions**:
- `load_and_combine(pattern)` ‚Üí Load multiple CSVs
- `clean_data(df)` ‚Üí Standard cleaning pipeline

### **Next Steps**:
1. Run this notebook first
2. Then run any domain notebook (01, 02, or 03)
3. Or proceed to cross-domain analysis (04)