# üìä Notebook 01: Data Pipeline & Preprocessing

## AADHAAR INTELLIGENCE SYSTEM - UIDAI Hackathon 2025-26

---

### Objective
Extract, transform, and load the **3 Real UIDAI Datasets** for behavioral analytics:
1. **Enrollment Data** - New Aadhaar registrations by age group (0-5, 5-17, 18+)
2. **Demographic Update Data** - Address/Name updates by age group
3. **Biometric Update Data** - Fingerprint/Iris updates by age group

### Dataset Structure
- **Enrolment**: `date, state, district, pincode, age_0_5, age_5_17, age_18_greater`
- **Demographic**: `date, state, district, pincode, demo_age_5_17, demo_age_17_`
- **Biometric**: `date, state, district, pincode, bio_age_5_17, bio_age_17_`

### Output
- Cleaned, merged master dataset
- Data quality report
- Export for downstream analysis

In [1]:
# ============================================
# CELL 1: Import Libraries
# ============================================

import pandas as pd
import numpy as np
import os
import glob
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print("‚úÖ Libraries imported successfully")
print(f"üìÖ Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"üêç Pandas version: {pd.__version__}")

‚úÖ Libraries imported successfully
üìÖ Analysis Date: 2026-01-13 16:43
üêç Pandas version: 2.3.3


In [2]:
# ============================================
# CELL 2: Configuration & Paths
# ============================================

# Define paths
DATA_DIR = '../data/'
OUTPUT_DIR = '../outputs/'

# Dataset folder paths (extracted from zip files)
DATASET_PATHS = {
    'enrolment': f"{DATA_DIR}enrolment/",
    'demographic': f"{DATA_DIR}demographic/",
    'biometric': f"{DATA_DIR}biometric/"
}

# Create output directory if not exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/charts", exist_ok=True)

print("üìÅ Configuration set:")
print(f"   Data Directory: {DATA_DIR}")
print(f"   Output Directory: {OUTPUT_DIR}")
print(f"\nüìÇ Dataset Paths:")
for name, path in DATASET_PATHS.items():
    print(f"   ‚Ä¢ {name}: {path}")

üìÅ Configuration set:
   Data Directory: ../data/
   Output Directory: ../outputs/

üìÇ Dataset Paths:
   ‚Ä¢ enrolment: ../data/enrolment/
   ‚Ä¢ demographic: ../data/demographic/
   ‚Ä¢ biometric: ../data/biometric/


In [3]:
# ============================================
# CELL 3: Data Loading Function
# ============================================

def load_all_csvs(folder_path, dataset_name):
    """
    Load and concatenate all CSV files from a folder
    
    Parameters:
    -----------
    folder_path : str - Path to folder containing CSVs
    dataset_name : str - Name for logging
    
    Returns:
    --------
    pd.DataFrame - Concatenated dataset
    """
    print(f"\nüì¶ Loading: {dataset_name.upper()}")
    print(f"   Path: {folder_path}")
    
    # Find all CSV files recursively
    all_files = glob.glob(os.path.join(folder_path, "**/*.csv"), recursive=True)
    
    if not all_files:
        print(f"   ‚ö†Ô∏è No CSV files found in {folder_path}")
        return None
    
    print(f"   üìÑ Found {len(all_files)} CSV files")
    
    # Load and concatenate all CSVs
    dfs = []
    total_rows = 0
    for file in all_files:
        df = pd.read_csv(file)
        dfs.append(df)
        total_rows += len(df)
        print(f"      ‚úì {os.path.basename(file)}: {len(df):,} rows")
    
    combined_df = pd.concat(dfs, ignore_index=True)
    print(f"   ‚úÖ Total loaded: {len(combined_df):,} records")
    
    return combined_df

print("‚úÖ Data loading function ready")

‚úÖ Data loading function ready


In [4]:
# ============================================
# CELL 4: Load All 3 UIDAI Datasets
# ============================================

print("="*60)
print("üîÑ LOADING REAL UIDAI DATASETS")
print("="*60)

# Load Enrolment Data
df_enrolment = load_all_csvs(DATASET_PATHS['enrolment'], 'enrolment')

# Load Demographic Update Data  
df_demographic = load_all_csvs(DATASET_PATHS['demographic'], 'demographic')

# Load Biometric Update Data
df_biometric = load_all_csvs(DATASET_PATHS['biometric'], 'biometric')

print("\n" + "="*60)
print("‚úÖ ALL DATASETS LOADED SUCCESSFULLY")
print("="*60)
print(f"\nüìà Dataset Summary:")
print(f"   Enrolment Records: {len(df_enrolment):,}")
print(f"   Demographic Records: {len(df_demographic):,}")
print(f"   Biometric Records: {len(df_biometric):,}")

üîÑ LOADING REAL UIDAI DATASETS

üì¶ Loading: ENROLMENT
   Path: ../data/enrolment/
   üìÑ Found 3 CSV files


      ‚úì api_data_aadhar_enrolment_0_500000.csv: 500,000 rows
      ‚úì api_data_aadhar_enrolment_1000000_1006029.csv: 6,029 rows


      ‚úì api_data_aadhar_enrolment_500000_1000000.csv: 500,000 rows
   ‚úÖ Total loaded: 1,006,029 records

üì¶ Loading: DEMOGRAPHIC
   Path: ../data/demographic/
   üìÑ Found 5 CSV files


      ‚úì api_data_aadhar_demographic_0_500000.csv: 500,000 rows


      ‚úì api_data_aadhar_demographic_1000000_1500000.csv: 500,000 rows


      ‚úì api_data_aadhar_demographic_1500000_2000000.csv: 500,000 rows
      ‚úì api_data_aadhar_demographic_2000000_2071700.csv: 71,700 rows


      ‚úì api_data_aadhar_demographic_500000_1000000.csv: 500,000 rows
   ‚úÖ Total loaded: 2,071,700 records

üì¶ Loading: BIOMETRIC
   Path: ../data/biometric/
   üìÑ Found 4 CSV files


      ‚úì api_data_aadhar_biometric_0_500000.csv: 500,000 rows


      ‚úì api_data_aadhar_biometric_1000000_1500000.csv: 500,000 rows


      ‚úì api_data_aadhar_biometric_1500000_1861108.csv: 361,108 rows


      ‚úì api_data_aadhar_biometric_500000_1000000.csv: 500,000 rows
   ‚úÖ Total loaded: 1,861,108 records

‚úÖ ALL DATASETS LOADED SUCCESSFULLY

üìà Dataset Summary:
   Enrolment Records: 1,006,029
   Demographic Records: 2,071,700
   Biometric Records: 1,861,108


In [5]:
# ============================================
# CELL 5: Data Exploration
# ============================================

print("\nüîç DATA EXPLORATION")
print("="*60)

# Display column info for each dataset
print("\nüìã ENROLMENT DATA:")
print(f"   Columns: {df_enrolment.columns.tolist()}")
print(f"   Shape: {df_enrolment.shape}")
display(df_enrolment.head(3))

print("\nüìã DEMOGRAPHIC UPDATE DATA:")
print(f"   Columns: {df_demographic.columns.tolist()}")
print(f"   Shape: {df_demographic.shape}")
display(df_demographic.head(3))

print("\nüìã BIOMETRIC UPDATE DATA:")
print(f"   Columns: {df_biometric.columns.tolist()}")
print(f"   Shape: {df_biometric.shape}")
display(df_biometric.head(3))


üîç DATA EXPLORATION

üìã ENROLMENT DATA:
   Columns: ['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater']
   Shape: (1006029, 7)


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,02-03-2025,Meghalaya,East Khasi Hills,793121,11,61,37
1,09-03-2025,Karnataka,Bengaluru Urban,560043,14,33,39
2,09-03-2025,Uttar Pradesh,Kanpur Nagar,208001,29,82,12



üìã DEMOGRAPHIC UPDATE DATA:
   Columns: ['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_']
   Shape: (2071700, 6)


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
0,01-03-2025,Uttar Pradesh,Gorakhpur,273213,49,529
1,01-03-2025,Andhra Pradesh,Chittoor,517132,22,375
2,01-03-2025,Gujarat,Rajkot,360006,65,765



üìã BIOMETRIC UPDATE DATA:
   Columns: ['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_']
   Shape: (1861108, 6)


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
0,01-03-2025,Haryana,Mahendragarh,123029,280,577
1,01-03-2025,Bihar,Madhepura,852121,144,369
2,01-03-2025,Jammu and Kashmir,Punch,185101,643,1091


In [6]:
# ============================================
# CELL 6: Data Preprocessing
# ============================================

print("\nüßπ DATA PREPROCESSING")
print("="*60)

# Convert date columns
df_enrolment['date'] = pd.to_datetime(df_enrolment['date'], format='%d-%m-%Y')
df_demographic['date'] = pd.to_datetime(df_demographic['date'], format='%d-%m-%Y')
df_biometric['date'] = pd.to_datetime(df_biometric['date'], format='%d-%m-%Y')

# Calculate total enrollments per record
df_enrolment['total_enrolments'] = df_enrolment['age_0_5'] + df_enrolment['age_5_17'] + df_enrolment['age_18_greater']

# Fix column names (strip trailing characters)
df_demographic.columns = df_demographic.columns.str.strip('_')
df_biometric.columns = df_biometric.columns.str.strip('_')

# Rename columns for consistency
if 'demo_age_17' in df_demographic.columns:
    df_demographic.rename(columns={'demo_age_17': 'demo_age_18_greater'}, inplace=True)
if 'bio_age_17' in df_biometric.columns:
    df_biometric.rename(columns={'bio_age_17': 'bio_age_18_greater'}, inplace=True)

# Calculate totals for demographic and biometric
df_demographic['total_demo_updates'] = df_demographic.filter(like='demo_age').sum(axis=1)
df_biometric['total_bio_updates'] = df_biometric.filter(like='bio_age').sum(axis=1)

# Add derived date columns
for df in [df_enrolment, df_demographic, df_biometric]:
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['quarter'] = df['date'].dt.quarter
    df['day_of_week'] = df['date'].dt.dayofweek

print("‚úÖ Date columns converted to datetime")
print(f"   Enrolment date range: {df_enrolment['date'].min()} to {df_enrolment['date'].max()}")
print(f"   Demographic date range: {df_demographic['date'].min()} to {df_demographic['date'].max()}")
print(f"   Biometric date range: {df_biometric['date'].min()} to {df_biometric['date'].max()}")


üßπ DATA PREPROCESSING


‚úÖ Date columns converted to datetime
   Enrolment date range: 2025-03-02 00:00:00 to 2025-12-31 00:00:00
   Demographic date range: 2025-03-01 00:00:00 to 2025-12-29 00:00:00
   Biometric date range: 2025-03-01 00:00:00 to 2025-12-29 00:00:00


In [7]:
# ============================================
# CELL 7: Data Quality Report
# ============================================

def data_quality_report(df, name):
    """Generate comprehensive data quality report"""
    print(f"\n{'='*60}")
    print(f"üìã DATA QUALITY REPORT: {name.upper()}")
    print(f"{'='*60}")
    
    print(f"\nüìä BASIC STATISTICS:")
    print(f"   Total Records: {len(df):,}")
    print(f"   Total Columns: {len(df.columns)}")
    print(f"   Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print(f"\nüìù COLUMN INFORMATION:")
    for col in df.columns:
        null_pct = df[col].isnull().sum() / len(df) * 100
        unique = df[col].nunique()
        dtype = df[col].dtype
        print(f"   ‚Ä¢ {col}: {dtype} | Unique: {unique:,} | Null: {null_pct:.1f}%")
    
    return {
        'name': name,
        'records': len(df),
        'columns': len(df.columns),
        'null_pct': df.isnull().sum().sum() / (len(df) * len(df.columns)) * 100
    }

# Generate reports for all datasets
quality_reports = []
quality_reports.append(data_quality_report(df_enrolment, 'Enrollment'))
quality_reports.append(data_quality_report(df_demographic, 'Demographic'))
quality_reports.append(data_quality_report(df_biometric, 'Biometric'))


üìã DATA QUALITY REPORT: ENROLLMENT

üìä BASIC STATISTICS:
   Total Records: 1,006,029
   Total Columns: 12


   Memory Usage: 173.22 MB

üìù COLUMN INFORMATION:
   ‚Ä¢ date: datetime64[ns] | Unique: 92 | Null: 0.0%


   ‚Ä¢ state: object | Unique: 55 | Null: 0.0%
   ‚Ä¢ district: object | Unique: 985 | Null: 0.0%
   ‚Ä¢ pincode: int64 | Unique: 19,463 | Null: 0.0%
   ‚Ä¢ age_0_5: int64 | Unique: 671 | Null: 0.0%
   ‚Ä¢ age_5_17: int64 | Unique: 624 | Null: 0.0%


   ‚Ä¢ age_18_greater: int64 | Unique: 199 | Null: 0.0%
   ‚Ä¢ total_enrolments: int64 | Unique: 1,028 | Null: 0.0%
   ‚Ä¢ year: int32 | Unique: 1 | Null: 0.0%
   ‚Ä¢ month: int32 | Unique: 9 | Null: 0.0%
   ‚Ä¢ quarter: int32 | Unique: 4 | Null: 0.0%
   ‚Ä¢ day_of_week: int32 | Unique: 7 | Null: 0.0%



üìã DATA QUALITY REPORT: DEMOGRAPHIC

üìä BASIC STATISTICS:
   Total Records: 2,071,700
   Total Columns: 11


   Memory Usage: 341.46 MB

üìù COLUMN INFORMATION:
   ‚Ä¢ date: datetime64[ns] | Unique: 95 | Null: 0.0%


   ‚Ä¢ state: object | Unique: 65 | Null: 0.0%
   ‚Ä¢ district: object | Unique: 983 | Null: 0.0%


   ‚Ä¢ pincode: int64 | Unique: 19,742 | Null: 0.0%
   ‚Ä¢ demo_age_5_17: int64 | Unique: 614 | Null: 0.0%
   ‚Ä¢ demo_age_18_greater: int64 | Unique: 2,668 | Null: 0.0%
   ‚Ä¢ total_demo_updates: int64 | Unique: 2,848 | Null: 0.0%
   ‚Ä¢ year: int32 | Unique: 1 | Null: 0.0%
   ‚Ä¢ month: int32 | Unique: 9 | Null: 0.0%
   ‚Ä¢ quarter: int32 | Unique: 4 | Null: 0.0%


   ‚Ä¢ day_of_week: int32 | Unique: 7 | Null: 0.0%

üìã DATA QUALITY REPORT: BIOMETRIC

üìä BASIC STATISTICS:
   Total Records: 1,861,108
   Total Columns: 11


   Memory Usage: 306.53 MB

üìù COLUMN INFORMATION:
   ‚Ä¢ date: datetime64[ns] | Unique: 89 | Null: 0.0%


   ‚Ä¢ state: object | Unique: 57 | Null: 0.0%
   ‚Ä¢ district: object | Unique: 974 | Null: 0.0%


   ‚Ä¢ pincode: int64 | Unique: 19,707 | Null: 0.0%
   ‚Ä¢ bio_age_5_17: int64 | Unique: 2,121 | Null: 0.0%
   ‚Ä¢ bio_age_18_greater: int64 | Unique: 2,212 | Null: 0.0%
   ‚Ä¢ total_bio_updates: int64 | Unique: 3,380 | Null: 0.0%
   ‚Ä¢ year: int32 | Unique: 1 | Null: 0.0%
   ‚Ä¢ month: int32 | Unique: 9 | Null: 0.0%
   ‚Ä¢ quarter: int32 | Unique: 4 | Null: 0.0%


   ‚Ä¢ day_of_week: int32 | Unique: 7 | Null: 0.0%


In [8]:
# ============================================
# CELL 8: Create Pincode-Level Master Dataset
# ============================================

print("\nüìç CREATING PINCODE-LEVEL MASTER DATASET")
print("="*60)

# Aggregate enrollment by pincode
enrolment_by_pincode = df_enrolment.groupby(['state', 'district', 'pincode']).agg({
    'age_0_5': 'sum',
    'age_5_17': 'sum',
    'age_18_greater': 'sum',
    'total_enrolments': 'sum',
    'date': 'count'
}).reset_index()
enrolment_by_pincode.rename(columns={'date': 'enrolment_days'}, inplace=True)

# Aggregate demographic updates by pincode
demo_by_pincode = df_demographic.groupby(['state', 'district', 'pincode']).agg({
    'total_demo_updates': 'sum'
}).reset_index()

# Aggregate biometric updates by pincode
bio_by_pincode = df_biometric.groupby(['state', 'district', 'pincode']).agg({
    'total_bio_updates': 'sum'
}).reset_index()

# Merge all datasets
master_pincode = enrolment_by_pincode.merge(
    demo_by_pincode[['pincode', 'total_demo_updates']], 
    on='pincode', 
    how='left'
).merge(
    bio_by_pincode[['pincode', 'total_bio_updates']], 
    on='pincode', 
    how='left'
)

# Fill NaN values
master_pincode['total_demo_updates'] = master_pincode['total_demo_updates'].fillna(0)
master_pincode['total_bio_updates'] = master_pincode['total_bio_updates'].fillna(0)

# Calculate total activity
master_pincode['total_activity'] = (
    master_pincode['total_enrolments'] + 
    master_pincode['total_demo_updates'] + 
    master_pincode['total_bio_updates']
)

# Calculate daily rate
master_pincode['daily_enrolment_rate'] = master_pincode['total_enrolments'] / master_pincode['enrolment_days']

print(f"\nüìä Master Pincode Dataset Created:")
print(f"   Total Unique Pincodes: {len(master_pincode):,}")
print(f"   Total States/UTs: {master_pincode['state'].nunique()}")
print(f"   Total Enrolments: {master_pincode['total_enrolments'].sum():,}")
print(f"   Total Demo Updates: {master_pincode['total_demo_updates'].sum():,.0f}")
print(f"   Total Bio Updates: {master_pincode['total_bio_updates'].sum():,.0f}")

display(master_pincode.head(10))


üìç CREATING PINCODE-LEVEL MASTER DATASET



üìä Master Pincode Dataset Created:
   Total Unique Pincodes: 147,399
   Total States/UTs: 55
   Total Enrolments: 21,020,763
   Total Demo Updates: 179,049,601
   Total Bio Updates: 240,696,861


Unnamed: 0,state,district,pincode,age_0_5,age_5_17,age_18_greater,total_enrolments,enrolment_days,total_demo_updates,total_bio_updates,total_activity,daily_enrolment_rate
0,100000,100000,100000,0,1,217,218,22,2.0,0.0,220.0,9.909091
1,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,303.0,1324.0,1636.0,1.0
2,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,303.0,1584.0,1896.0,1.0
3,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,392.0,1324.0,1725.0,1.0
4,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,392.0,1584.0,1985.0,1.0
5,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,148.0,215.0,388.0,1.136364
6,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,148.0,90.0,263.0,1.136364
7,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,148.0,1971.0,2144.0,1.136364
8,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,104.0,215.0,344.0,1.136364
9,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,104.0,90.0,219.0,1.136364


In [9]:
# ============================================
# CELL 9: Data Summary Visualization
# ============================================

# Create summary visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        'Enrollments by State (Top 10)', 
        'Age Group Distribution',
        'Monthly Enrollment Trend',
        'Daily Activity Pattern'
    ),
    specs=[[{"type": "bar"}, {"type": "pie"}],
           [{"type": "scatter"}, {"type": "bar"}]]
)

# Plot 1: Enrollments by State
state_enrol = master_pincode.groupby('state')['total_enrolments'].sum().sort_values(ascending=True).tail(10)
fig.add_trace(
    go.Bar(x=state_enrol.values, y=state_enrol.index, orientation='h', 
           marker_color='#FF6B35', name='Enrollments'),
    row=1, col=1
)

# Plot 2: Age Group Distribution
age_totals = {
    '0-5 years': df_enrolment['age_0_5'].sum(),
    '5-17 years': df_enrolment['age_5_17'].sum(),
    '18+ years': df_enrolment['age_18_greater'].sum()
}
fig.add_trace(
    go.Pie(labels=list(age_totals.keys()), values=list(age_totals.values()),
           marker_colors=['#1B998B', '#F77F00', '#D62828']),
    row=1, col=2
)

# Plot 3: Monthly Trend
monthly = df_enrolment.groupby(['year', 'month'])['total_enrolments'].sum().reset_index()
monthly['date'] = pd.to_datetime(monthly[['year', 'month']].assign(day=1))
fig.add_trace(
    go.Scatter(x=monthly['date'], y=monthly['total_enrolments'], mode='lines+markers',
               line_color='#1B998B', name='Trend'),
    row=2, col=1
)

# Plot 4: Daily Activity Pattern
daily_pattern = df_enrolment.groupby('day_of_week')['total_enrolments'].sum()
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
fig.add_trace(
    go.Bar(x=days, y=daily_pattern.values, marker_color='#004E89', name='Daily'),
    row=2, col=2
)

fig.update_layout(
    title_text='<b>AADHAAR DATA PIPELINE SUMMARY</b>',
    title_x=0.5,
    height=700,
    showlegend=False,
    template='plotly_white'
)

# Save figure
fig.write_html(f"{OUTPUT_DIR}/charts/01_data_pipeline_summary.html")
print("üìä Chart saved to outputs/charts/01_data_pipeline_summary.html")

üìä Chart saved to outputs/charts/01_data_pipeline_summary.html


In [10]:
# ============================================
# CELL 10: Export Processed Data
# ============================================

print("\nüíæ EXPORTING PROCESSED DATASETS")
print("="*60)

# Save cleaned datasets
df_enrolment.to_csv(f"{OUTPUT_DIR}/enrolment_cleaned.csv", index=False)
df_demographic.to_csv(f"{OUTPUT_DIR}/demographic_cleaned.csv", index=False)
df_biometric.to_csv(f"{OUTPUT_DIR}/biometric_cleaned.csv", index=False)
master_pincode.to_csv(f"{OUTPUT_DIR}/master_pincode.csv", index=False)

print(f"\n‚úÖ Exported files:")
print(f"   ‚Ä¢ enrolment_cleaned.csv ({len(df_enrolment):,} records)")
print(f"   ‚Ä¢ demographic_cleaned.csv ({len(df_demographic):,} records)")
print(f"   ‚Ä¢ biometric_cleaned.csv ({len(df_biometric):,} records)")
print(f"   ‚Ä¢ master_pincode.csv ({len(master_pincode):,} pincodes)")


üíæ EXPORTING PROCESSED DATASETS



‚úÖ Exported files:
   ‚Ä¢ enrolment_cleaned.csv (1,006,029 records)
   ‚Ä¢ demographic_cleaned.csv (2,071,700 records)
   ‚Ä¢ biometric_cleaned.csv (1,861,108 records)
   ‚Ä¢ master_pincode.csv (147,399 pincodes)


In [11]:
# ============================================
# CELL 11: Pipeline Metrics Summary
# ============================================

print("\n" + "="*60)
print("üìä FINAL PIPELINE METRICS")
print("="*60)

total_records = len(df_enrolment) + len(df_demographic) + len(df_biometric)

metrics = {
    'Total Records Processed': f"{total_records:,}",
    'Unique Pincodes': f"{len(master_pincode):,}",
    'Total States/UTs': f"{master_pincode['state'].nunique()}",
    'Total Enrollments': f"{master_pincode['total_enrolments'].sum():,}",
    'Total Demographic Updates': f"{master_pincode['total_demo_updates'].sum():,.0f}",
    'Total Biometric Updates': f"{master_pincode['total_bio_updates'].sum():,.0f}",
    'Age 0-5 Enrollments': f"{df_enrolment['age_0_5'].sum():,}",
    'Age 5-17 Enrollments': f"{df_enrolment['age_5_17'].sum():,}",
    'Age 18+ Enrollments': f"{df_enrolment['age_18_greater'].sum():,}",
    'Date Range': f"{df_enrolment['date'].min().date()} to {df_enrolment['date'].max().date()}"
}

for key, value in metrics.items():
    print(f"   ‚Ä¢ {key}: {value}")

print("\n" + "="*60)
print("‚úÖ NOTEBOOK 01 COMPLETE - Proceed to 02_life_events.ipynb")
print("="*60)


üìä FINAL PIPELINE METRICS
   ‚Ä¢ Total Records Processed: 4,938,837
   ‚Ä¢ Unique Pincodes: 147,399
   ‚Ä¢ Total States/UTs: 55
   ‚Ä¢ Total Enrollments: 21,020,763
   ‚Ä¢ Total Demographic Updates: 179,049,601
   ‚Ä¢ Total Biometric Updates: 240,696,861
   ‚Ä¢ Age 0-5 Enrollments: 3,546,965
   ‚Ä¢ Age 5-17 Enrollments: 1,720,384
   ‚Ä¢ Age 18+ Enrollments: 168,353
   ‚Ä¢ Date Range: 2025-03-02 to 2025-12-31

‚úÖ NOTEBOOK 01 COMPLETE - Proceed to 02_life_events.ipynb
