# üìç Notebook 03: Hyper-Local Geographic Analysis

## AADHAAR INTELLIGENCE SYSTEM - LENS 2

---

### Objective
Perform pincode-level saturation analysis using **real UIDAI datasets**:
1. **Enrolment Data** - New Aadhaar registrations by age group
2. **Demographic Update Data** - Address/name updates
3. **Biometric Update Data** - Fingerprint/iris updates

### Analysis Goals
- Identify critical gap zones with low enrollment
- Find optimal mobile enrollment van deployment locations
- Analyze regional disparities in Aadhaar coverage
- Age-wise enrollment patterns (0-5, 5-17, 18+)

### Methods
- Geographic clustering (K-Means, DBSCAN)
- Saturation rate calculation by pincode
- Interactive heatmap visualization

### Key Insight
> "Identify critical pincodes for targeted mobile deployment based on real enrollment data"

In [28]:
# ============================================
# CELL 1: Import Libraries
# ============================================

import pandas as pd
import numpy as np
from datetime import datetime
import warnings
import os
import glob
warnings.filterwarnings('ignore')

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio

# Set default renderer for notebook
pio.renderers.default = "notebook"

# Machine Learning
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler

# Try to import folium for interactive maps
try:
    import folium
    from folium.plugins import HeatMap, MarkerCluster
    FOLIUM_AVAILABLE = True
except ImportError:
    FOLIUM_AVAILABLE = False
    print("‚ö†Ô∏è Folium not installed. Using Plotly for maps.")

print("‚úÖ Libraries imported successfully")
print(f"üìÖ Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")

‚ö†Ô∏è Folium not installed. Using Plotly for maps.
‚úÖ Libraries imported successfully
üìÖ Analysis Date: 2026-01-13 16:20


In [14]:
# ============================================
# CELL 2: Load Real UIDAI Datasets
# ============================================

DATA_DIR = '../data/'
OUTPUT_DIR = '../outputs/'

# Create output directories if not exist
os.makedirs(f"{OUTPUT_DIR}/charts", exist_ok=True)

# Function to load all CSV files from a folder
def load_all_csvs(folder_path):
    """Load and concatenate all CSV files from a folder"""
    all_files = glob.glob(os.path.join(folder_path, "**/*.csv"), recursive=True)
    if not all_files:
        print(f"‚ö†Ô∏è No CSV files found in {folder_path}")
        return None
    
    dfs = []
    for file in all_files:
        df = pd.read_csv(file)
        dfs.append(df)
        print(f"   üìÑ Loaded: {os.path.basename(file)} ({len(df):,} rows)")
    
    return pd.concat(dfs, ignore_index=True)

# Load Enrolment Data
print("üìä LOADING ENROLMENT DATA...")
print("-" * 50)
df_enrolment = load_all_csvs(f"{DATA_DIR}/enrolment")

# Load Demographic Update Data
print("\nüìä LOADING DEMOGRAPHIC UPDATE DATA...")
print("-" * 50)
df_demographic = load_all_csvs(f"{DATA_DIR}/demographic")

# Load Biometric Update Data
print("\nüìä LOADING BIOMETRIC UPDATE DATA...")
print("-" * 50)
df_biometric = load_all_csvs(f"{DATA_DIR}/biometric")

print("\n" + "=" * 60)
print("‚úÖ ALL DATASETS LOADED SUCCESSFULLY!")
print("=" * 60)
print(f"\nüìà Dataset Summary:")
print(f"   Enrolment Records: {len(df_enrolment):,}")
print(f"   Demographic Update Records: {len(df_demographic):,}")
print(f"   Biometric Update Records: {len(df_biometric):,}")

üìä LOADING ENROLMENT DATA...
--------------------------------------------------
   üìÑ Loaded: api_data_aadhar_enrolment_0_500000.csv (500,000 rows)
   üìÑ Loaded: api_data_aadhar_enrolment_1000000_1006029.csv (6,029 rows)
   üìÑ Loaded: api_data_aadhar_enrolment_500000_1000000.csv (500,000 rows)

üìä LOADING DEMOGRAPHIC UPDATE DATA...
--------------------------------------------------
   üìÑ Loaded: api_data_aadhar_demographic_0_500000.csv (500,000 rows)
   üìÑ Loaded: api_data_aadhar_demographic_1000000_1500000.csv (500,000 rows)
   üìÑ Loaded: api_data_aadhar_demographic_1500000_2000000.csv (500,000 rows)
   üìÑ Loaded: api_data_aadhar_demographic_2000000_2071700.csv (71,700 rows)
   üìÑ Loaded: api_data_aadhar_demographic_500000_1000000.csv (500,000 rows)

üìä LOADING BIOMETRIC UPDATE DATA...
--------------------------------------------------
   üìÑ Loaded: api_data_aadhar_biometric_0_500000.csv (500,000 rows)
   üìÑ Loaded: api_data_aadhar_biometric_1000000_1500000.c

In [15]:
# ============================================
# CELL 3: Data Preprocessing & Exploration
# ============================================

print("\nüîç DATA EXPLORATION")
print("="*60)

# Display column info for each dataset
print("\nüìã ENROLMENT DATA COLUMNS:")
print(df_enrolment.columns.tolist())
print(df_enrolment.head(3))

print("\nüìã DEMOGRAPHIC UPDATE DATA COLUMNS:")
print(df_demographic.columns.tolist())
print(df_demographic.head(3))

print("\nüìã BIOMETRIC UPDATE DATA COLUMNS:")
print(df_biometric.columns.tolist())
print(df_biometric.head(3))

# Convert date columns
df_enrolment['date'] = pd.to_datetime(df_enrolment['date'], format='%d-%m-%Y')
df_demographic['date'] = pd.to_datetime(df_demographic['date'], format='%d-%m-%Y')
df_biometric['date'] = pd.to_datetime(df_biometric['date'], format='%d-%m-%Y')

# Calculate total enrollments per record
df_enrolment['total_enrolments'] = df_enrolment['age_0_5'] + df_enrolment['age_5_17'] + df_enrolment['age_18_greater']

# Fix column names for demographic and biometric (remove trailing underscore if any)
df_demographic.columns = df_demographic.columns.str.strip('_')
df_biometric.columns = df_biometric.columns.str.strip('_')

# Rename columns for consistency
if 'demo_age_17' in df_demographic.columns:
    df_demographic.rename(columns={'demo_age_17': 'demo_age_18_greater'}, inplace=True)
if 'bio_age_17' in df_biometric.columns:
    df_biometric.rename(columns={'bio_age_17': 'bio_age_18_greater'}, inplace=True)

print("\n‚úÖ Date columns converted to datetime")
print(f"   Enrolment date range: {df_enrolment['date'].min()} to {df_enrolment['date'].max()}")
print(f"   Demographic date range: {df_demographic['date'].min()} to {df_demographic['date'].max()}")
print(f"   Biometric date range: {df_biometric['date'].min()} to {df_biometric['date'].max()}")


üîç DATA EXPLORATION

üìã ENROLMENT DATA COLUMNS:
['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater']
         date          state          district  pincode  age_0_5  age_5_17  \
0  02-03-2025      Meghalaya  East Khasi Hills   793121       11        61   
1  09-03-2025      Karnataka   Bengaluru Urban   560043       14        33   
2  09-03-2025  Uttar Pradesh      Kanpur Nagar   208001       29        82   

   age_18_greater  
0              37  
1              39  
2              12  

üìã DEMOGRAPHIC UPDATE DATA COLUMNS:
['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_']
         date           state   district  pincode  demo_age_5_17  demo_age_17_
0  01-03-2025   Uttar Pradesh  Gorakhpur   273213             49           529
1  01-03-2025  Andhra Pradesh   Chittoor   517132             22           375
2  01-03-2025         Gujarat     Rajkot   360006             65           765

üìã BIOMETRIC UPDATE DATA COLUMNS:
['date

In [16]:
# ============================================
# CELL 4: Create Master Pincode Dataset
# ============================================

print("\nüìç CREATING MASTER PINCODE DATASET")
print("="*60)

# Aggregate enrolment data by pincode
enrolment_by_pincode = df_enrolment.groupby(['state', 'district', 'pincode']).agg({
    'age_0_5': 'sum',
    'age_5_17': 'sum',
    'age_18_greater': 'sum',
    'total_enrolments': 'sum',
    'date': 'count'  # Number of days with data
}).reset_index()
enrolment_by_pincode.rename(columns={'date': 'enrolment_days'}, inplace=True)

# Aggregate demographic updates by pincode
demo_agg_dict = {'demo_age_5_17': 'sum'}
# Check for the 18+ column (might be named differently)
demo_18_col = [c for c in df_demographic.columns if '17_' in c or '18' in c]
if demo_18_col:
    demo_agg_dict[demo_18_col[0]] = 'sum'

demo_by_pincode = df_demographic.groupby(['state', 'district', 'pincode']).agg(demo_agg_dict).reset_index()
demo_by_pincode['total_demo_updates'] = demo_by_pincode.iloc[:, 3:].sum(axis=1)

# Aggregate biometric updates by pincode
bio_agg_dict = {'bio_age_5_17': 'sum'}
# Check for the 18+ column (might be named differently)
bio_18_col = [c for c in df_biometric.columns if '17_' in c or '18' in c]
if bio_18_col:
    bio_agg_dict[bio_18_col[0]] = 'sum'

bio_by_pincode = df_biometric.groupby(['state', 'district', 'pincode']).agg(bio_agg_dict).reset_index()
bio_by_pincode['total_bio_updates'] = bio_by_pincode.iloc[:, 3:].sum(axis=1)

# Merge all datasets
master_pincode = enrolment_by_pincode.merge(
    demo_by_pincode[['pincode', 'total_demo_updates']], 
    on='pincode', 
    how='left'
).merge(
    bio_by_pincode[['pincode', 'total_bio_updates']], 
    on='pincode', 
    how='left'
)

# Fill NaN values
master_pincode['total_demo_updates'] = master_pincode['total_demo_updates'].fillna(0)
master_pincode['total_bio_updates'] = master_pincode['total_bio_updates'].fillna(0)

# Calculate total activity
master_pincode['total_activity'] = (
    master_pincode['total_enrolments'] + 
    master_pincode['total_demo_updates'] + 
    master_pincode['total_bio_updates']
)

print(f"‚úÖ Master pincode dataset created: {len(master_pincode):,} unique pincodes")
print(f"\nüìä Dataset Summary:")
print(f"   Total Pincodes: {len(master_pincode):,}")
print(f"   States/UTs: {master_pincode['state'].nunique()}")
print(f"   Total Enrolments: {master_pincode['total_enrolments'].sum():,}")
print(f"   Total Demographic Updates: {master_pincode['total_demo_updates'].sum():,.0f}")
print(f"   Total Biometric Updates: {master_pincode['total_bio_updates'].sum():,.0f}")

display(master_pincode.head(10))


üìç CREATING MASTER PINCODE DATASET
‚úÖ Master pincode dataset created: 147,399 unique pincodes

üìä Dataset Summary:
   Total Pincodes: 147,399
   States/UTs: 55
   Total Enrolments: 21,020,763
   Total Demographic Updates: 179,049,601
   Total Biometric Updates: 240,696,861


Unnamed: 0,state,district,pincode,age_0_5,age_5_17,age_18_greater,total_enrolments,enrolment_days,total_demo_updates,total_bio_updates,total_activity
0,100000,100000,100000,0,1,217,218,22,2.0,0.0,220.0
1,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,303.0,1324.0,1636.0
2,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,303.0,1584.0,1896.0
3,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,392.0,1324.0,1725.0
4,Andaman & Nicobar Islands,Andamans,744101,8,1,0,9,9,392.0,1584.0,1985.0
5,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,148.0,215.0,388.0
6,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,148.0,90.0,263.0
7,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,148.0,1971.0,2144.0
8,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,104.0,215.0,344.0
9,Andaman & Nicobar Islands,Andamans,744103,24,1,0,25,22,104.0,90.0,219.0


In [17]:
# ============================================
# CELL 5: Add Geographic Coordinates
# ============================================

print("\nüó∫Ô∏è ADDING GEOGRAPHIC COORDINATES")
print("="*60)

# State to approximate coordinates mapping (centroids)
state_coords = {
    'Andaman and Nicobar': (11.7, 92.7),
    'Andhra Pradesh': (15.9, 79.7),
    'Arunachal Pradesh': (28.2, 94.7),
    'Assam': (26.2, 92.9),
    'Bihar': (25.1, 85.3),
    'Chandigarh': (30.7, 76.8),
    'Chhattisgarh': (21.2, 81.8),
    'Dadra and Nagar Haveli': (20.1, 73.0),
    'Daman and Diu': (20.4, 72.8),
    'Delhi': (28.7, 77.1),
    'Goa': (15.3, 74.0),
    'Gujarat': (22.2, 71.2),
    'Haryana': (29.0, 76.1),
    'Himachal Pradesh': (31.1, 77.2),
    'Jammu and Kashmir': (33.7, 76.5),
    'Jharkhand': (23.6, 85.3),
    'Karnataka': (15.3, 75.7),
    'Kerala': (10.8, 76.2),
    'Ladakh': (34.2, 77.6),
    'Lakshadweep': (10.6, 72.6),
    'Madhya Pradesh': (22.9, 78.7),
    'Maharashtra': (19.7, 75.7),
    'Manipur': (24.6, 93.9),
    'Meghalaya': (25.5, 91.4),
    'Mizoram': (23.2, 92.9),
    'Nagaland': (26.1, 94.6),
    'Odisha': (20.9, 84.8),
    'Puducherry': (11.9, 79.8),
    'Punjab': (31.1, 75.3),
    'Rajasthan': (27.0, 74.2),
    'Sikkim': (27.5, 88.5),
    'Tamil Nadu': (11.1, 78.6),
    'Telangana': (18.1, 79.0),
    'Tripura': (23.9, 91.9),
    'Uttar Pradesh': (26.8, 80.9),
    'Uttarakhand': (30.1, 79.3),
    'West Bengal': (22.9, 87.8)
}

# Add coordinates with some randomness based on pincode
def get_coords(row):
    state = row['state']
    pincode = row['pincode']
    base = state_coords.get(state, (20.5, 78.9))  # Default to India center
    # Add variation based on pincode to spread points
    np.random.seed(int(str(pincode)[:4]) if pd.notna(pincode) else 42)
    lat_offset = np.random.normal(0, 1.2)
    lon_offset = np.random.normal(0, 1.2)
    return base[0] + lat_offset, base[1] + lon_offset

# Apply coordinate generation
coords = master_pincode.apply(get_coords, axis=1)
master_pincode['latitude'] = coords.apply(lambda x: x[0])
master_pincode['longitude'] = coords.apply(lambda x: x[1])

# Clip to India bounds
master_pincode['latitude'] = master_pincode['latitude'].clip(6, 37)
master_pincode['longitude'] = master_pincode['longitude'].clip(68, 98)

print(f"‚úÖ Coordinates added for {len(master_pincode):,} pincodes")
print(f"   Latitude range: {master_pincode['latitude'].min():.2f} to {master_pincode['latitude'].max():.2f}")
print(f"   Longitude range: {master_pincode['longitude'].min():.2f} to {master_pincode['longitude'].max():.2f}")


üó∫Ô∏è ADDING GEOGRAPHIC COORDINATES
‚úÖ Coordinates added for 147,399 pincodes
   Latitude range: 7.80 to 37.00
   Longitude range: 68.80 to 96.65


In [18]:
# ============================================
# CELL 6: Identify Low Activity Zones (Critical Gap Areas)
# ============================================

print("\nüö® IDENTIFYING LOW ACTIVITY ZONES")
print("="*60)

# Calculate daily enrollment rate
master_pincode['daily_enrolment_rate'] = master_pincode['total_enrolments'] / master_pincode['enrolment_days']

# Calculate percentiles for categorization
p25 = master_pincode['total_enrolments'].quantile(0.25)
p50 = master_pincode['total_enrolments'].quantile(0.50)
p75 = master_pincode['total_enrolments'].quantile(0.75)

print(f"üìä Enrolment Distribution:")
print(f"   25th Percentile: {p25:,.0f} enrollments")
print(f"   50th Percentile (Median): {p50:,.0f} enrollments")
print(f"   75th Percentile: {p75:,.0f} enrollments")

# Categorize pincodes based on activity
def categorize_activity(total):
    if total <= p25:
        return 'Critical (Bottom 25%)'
    elif total <= p50:
        return 'Low (25-50%)'
    elif total <= p75:
        return 'Medium (50-75%)'
    else:
        return 'High (Top 25%)'

master_pincode['activity_category'] = master_pincode['total_enrolments'].apply(categorize_activity)

# Count by category
category_counts = master_pincode['activity_category'].value_counts()

print("\nüìä ACTIVITY DISTRIBUTION:")
print("-" * 50)
for cat in ['Critical (Bottom 25%)', 'Low (25-50%)', 'Medium (50-75%)', 'High (Top 25%)']:
    if cat in category_counts.index:
        count = category_counts[cat]
        pct = count / len(master_pincode) * 100
        print(f"   {cat:<25}: {count:>6,} pincodes ({pct:.1f}%)")

# Get critical pincodes (lowest activity)
critical_pincodes = master_pincode[master_pincode['activity_category'] == 'Critical (Bottom 25%)'].copy()
critical_pincodes = critical_pincodes.sort_values('total_enrolments')

print(f"\nüî¥ CRITICAL LOW-ACTIVITY ZONES: {len(critical_pincodes):,} pincodes")


üö® IDENTIFYING LOW ACTIVITY ZONES
üìä Enrolment Distribution:
   25th Percentile: 10 enrollments
   50th Percentile (Median): 39 enrollments
   75th Percentile: 124 enrollments

üìä ACTIVITY DISTRIBUTION:
--------------------------------------------------
   Critical (Bottom 25%)    : 38,884 pincodes (26.4%)
   Low (25-50%)             : 35,312 pincodes (24.0%)
   Medium (50-75%)          : 36,516 pincodes (24.8%)
   High (Top 25%)           : 36,687 pincodes (24.9%)

üî¥ CRITICAL LOW-ACTIVITY ZONES: 38,884 pincodes


In [19]:
# ============================================
# CELL 7: Top Priority Pincodes for Mobile Deployment
# ============================================

print("\nüéØ TOP PRIORITY PINCODES FOR MOBILE DEPLOYMENT")
print("="*60)

# Priority scoring: low activity areas that still have some enrollment potential
# (not completely inactive - those might have other issues)
critical_pincodes['priority_score'] = (
    (critical_pincodes['total_enrolments'].max() - critical_pincodes['total_enrolments']) / 
    critical_pincodes['total_enrolments'].max() * 100
)

# Filter out completely inactive pincodes (might be data issues)
deployment_candidates = critical_pincodes[critical_pincodes['total_enrolments'] > 0].copy()

# Top 50 deployment targets
top_deployment = deployment_candidates.nlargest(50, 'priority_score')[
    ['pincode', 'state', 'district', 'total_enrolments', 'age_0_5', 'age_5_17', 
     'age_18_greater', 'daily_enrolment_rate', 'latitude', 'longitude']
].reset_index(drop=True)

top_deployment.index = top_deployment.index + 1  # 1-indexed

print("\nüìã TOP 20 HIGH-PRIORITY DEPLOYMENT LOCATIONS:")
print("-" * 100)
print(f"{'Rank':<5} {'Pincode':<10} {'State':<20} {'District':<20} {'Total Enrol':>12} {'Daily Rate':>10}")
print("-" * 100)
for idx, row in top_deployment.head(20).iterrows():
    print(f"{idx:<5} {row['pincode']:<10} {row['state'][:19]:<20} {row['district'][:19]:<20} {row['total_enrolments']:>12,} {row['daily_enrolment_rate']:>10.1f}")

print(f"\nüìç Total Priority Deployment Candidates: {len(top_deployment)} locations")
print(f"üìä Avg Daily Enrollment Rate in Critical Zones: {top_deployment['daily_enrolment_rate'].mean():.1f}")


üéØ TOP PRIORITY PINCODES FOR MOBILE DEPLOYMENT

üìã TOP 20 HIGH-PRIORITY DEPLOYMENT LOCATIONS:
----------------------------------------------------------------------------------------------------
Rank  Pincode    State                District              Total Enrol Daily Rate
----------------------------------------------------------------------------------------------------
1     501218     andhra pradesh       rangareddi                      1        1.0
2     501218     andhra pradesh       rangareddi                      1        1.0
3     501218     andhra pradesh       rangareddi                      1        1.0
4     744301     Andaman & Nicobar I  Nicobars                        1        1.0
5     744301     Andaman & Nicobar I  Nicobars                        1        1.0
6     744301     Andaman & Nicobar I  Nicobars                        1        1.0
7     744301     Andaman & Nicobar I  Nicobars                        1        1.0
8     744102     Andaman & Nicobar 

In [21]:
# ============================================
# CELL 8: State-wise Enrollment Analysis
# ============================================

print("\nüìä STATE-WISE ENROLLMENT ANALYSIS")
print("="*60)

# Aggregate by state
state_stats = master_pincode.groupby('state').agg({
    'pincode': 'count',
    'total_enrolments': 'sum',
    'age_0_5': 'sum',
    'age_5_17': 'sum',
    'age_18_greater': 'sum',
    'total_demo_updates': 'sum',
    'total_bio_updates': 'sum',
    'daily_enrolment_rate': 'mean'
}).reset_index()

state_stats.columns = ['state', 'num_pincodes', 'total_enrolments', 'enrol_0_5', 
                       'enrol_5_17', 'enrol_18_plus', 'demo_updates', 'bio_updates', 'avg_daily_rate']

state_stats = state_stats.sort_values('total_enrolments', ascending=False)

# Create visualization
fig_state = go.Figure()

# Add bar chart
fig_state.add_trace(go.Bar(
    x=state_stats['state'],
    y=state_stats['total_enrolments'],
    marker_color=state_stats['total_enrolments'],
    marker_colorscale='Viridis',
    text=[f"{e:,.0f}" for e in state_stats['total_enrolments']],
    textposition='outside'
))

fig_state.update_layout(
    title=dict(
        text='<b>AADHAAR ENROLMENTS BY STATE</b><br><sup>Based on Real UIDAI Data</sup>',
        x=0.5
    ),
    xaxis_title='State/UT',
    yaxis_title='Total Enrolments',
    height=600,
    template='plotly_white',
    showlegend=False,
    xaxis_tickangle=-45
)

# Save to HTML
fig_state.write_html(f"{OUTPUT_DIR}/charts/03_state_enrolments.html")
print(f"‚úÖ Chart saved: {OUTPUT_DIR}/charts/03_state_enrolments.html")

print("\nüìä Top 10 States by Enrolments:")
display(state_stats.head(10))


üìä STATE-WISE ENROLLMENT ANALYSIS
‚úÖ Chart saved: ../outputs//charts/03_state_enrolments.html

üìä Top 10 States by Enrolments:


Unnamed: 0,state,num_pincodes,total_enrolments,enrol_0_5,enrol_5_17,enrol_18_plus,demo_updates,bio_updates,avg_daily_rate
45,Uttar Pradesh,6224,2642461,1356069,1241802,44590,19904859.0,22354666.0,12.284391
51,West Bengal,16632,2342478,1724364,575077,43037,21720686.0,14605873.0,11.643197
42,Telangana,19705,2070404,1643635,411027,15742,15539366.0,16866313.0,4.361677
6,Bihar,3945,1881926,839527,1010565,31834,13499378.0,14263060.0,18.525778
27,Madhya Pradesh,3254,1444532,1061069,355918,27545,6881409.0,14576472.0,9.127408
28,Maharashtra,7567,1261402,938110,295723,27569,14784954.0,25625569.0,3.512079
23,Karnataka,11013,956305,759960,149954,46391,7463324.0,11314868.0,7.351281
16,Gujarat,4836,953856,623457,271172,59227,5057647.0,9715731.0,10.340428
22,Jharkhand,3668,925291,592865,324097,8329,6646803.0,10604720.0,13.593801
5,Assam,3152,795493,502664,225664,67165,3132423.0,3381426.0,5.639535


In [22]:
# ============================================
# CELL 9: Interactive Pincode Map
# ============================================

print("\nüó∫Ô∏è CREATING INTERACTIVE PINCODE MAP")
print("="*60)

# Color mapping for categories
color_map = {
    'Critical (Bottom 25%)': '#D62828',
    'Low (25-50%)': '#F77F00',
    'Medium (50-75%)': '#FCBF49',
    'High (Top 25%)': '#1B998B'
}

# Sample for visualization (full data may be slow)
sample_size = min(5000, len(master_pincode))
map_data = master_pincode.sample(sample_size, random_state=42).copy()

# Ensure all critical pincodes are included
map_data = pd.concat([map_data, critical_pincodes]).drop_duplicates(subset=['pincode'])

fig_map = px.scatter_mapbox(
    map_data,
    lat='latitude',
    lon='longitude',
    color='activity_category',
    color_discrete_map=color_map,
    size='total_enrolments',
    size_max=15,
    hover_name='pincode',
    hover_data={
        'state': True,
        'district': True,
        'total_enrolments': ':,',
        'age_0_5': ':,',
        'age_5_17': ':,',
        'latitude': False,
        'longitude': False
    },
    category_orders={'activity_category': ['Critical (Bottom 25%)', 'Low (25-50%)', 'Medium (50-75%)', 'High (Top 25%)']},
    mapbox_style='carto-positron',
    zoom=4,
    center={'lat': 20.5, 'lon': 78.9}
)

fig_map.update_layout(
    title=dict(
        text='<b>INDIA PINCODE ACTIVITY MAP</b><br><sup>üî¥ Critical Zones Identified for Mobile Deployment</sup>',
        x=0.5
    ),
    height=700,
    margin={'r': 0, 't': 80, 'l': 0, 'b': 0}
)

# Save to HTML
fig_map.write_html(f"{OUTPUT_DIR}/charts/03_pincode_activity_map.html")
print(f"‚úÖ Interactive map saved: {OUTPUT_DIR}/charts/03_pincode_activity_map.html")


üó∫Ô∏è CREATING INTERACTIVE PINCODE MAP
‚úÖ Interactive map saved: ../outputs//charts/03_pincode_activity_map.html


In [23]:
# ============================================
# CELL 10: Geographic Clustering Analysis
# ============================================

print("\nüî¨ GEOGRAPHIC CLUSTERING ANALYSIS")
print("="*60)

# Prepare features for clustering
cluster_features = ['latitude', 'longitude', 'total_enrolments', 'daily_enrolment_rate']
X = master_pincode[cluster_features].copy()

# Handle any NaN values
X = X.fillna(X.median())

# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# K-Means clustering
n_clusters = 8
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
master_pincode['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze clusters
cluster_stats = master_pincode.groupby('cluster').agg({
    'pincode': 'count',
    'total_enrolments': ['sum', 'mean'],
    'daily_enrolment_rate': 'mean',
    'age_0_5': 'sum',
    'age_5_17': 'sum',
    'age_18_greater': 'sum'
}).reset_index()

cluster_stats.columns = ['Cluster', 'Pincodes', 'Total_Enrolments', 'Avg_Enrolments', 
                         'Avg_Daily_Rate', 'Age_0_5', 'Age_5_17', 'Age_18_Plus']

# Identify priority clusters (low average enrollments)
cluster_stats['Priority'] = cluster_stats.apply(
    lambda x: 'HIGH' if x['Avg_Enrolments'] < cluster_stats['Avg_Enrolments'].median() * 0.5 
    else 'MEDIUM' if x['Avg_Enrolments'] < cluster_stats['Avg_Enrolments'].median() 
    else 'LOW',
    axis=1
)

print("\nüìä CLUSTER ANALYSIS RESULTS:")
print("-" * 100)
display(cluster_stats.round(2))

high_priority_clusters = cluster_stats[cluster_stats['Priority'] == 'HIGH']['Cluster'].values
print(f"\nüö® High-Priority Clusters (Need Intervention): {list(high_priority_clusters)}")


üî¨ GEOGRAPHIC CLUSTERING ANALYSIS

üìä CLUSTER ANALYSIS RESULTS:
----------------------------------------------------------------------------------------------------


Unnamed: 0,Cluster,Pincodes,Total_Enrolments,Avg_Enrolments,Avg_Daily_Rate,Age_0_5,Age_5_17,Age_18_Plus,Priority
0,0,45809,2243450,48.97,2.0,1881110,334177,28163,HIGH
1,1,2907,5007276,1722.49,24.47,2992392,1821092,193792,LOW
2,2,14930,2526770,169.24,3.44,1657033,837408,32329,HIGH
3,3,80,141950,1774.38,1419.0,69480,70441,2029,LOW
4,4,241,1559939,6472.78,82.71,803871,639906,116162,LOW
5,5,40559,4658300,114.85,3.03,3088591,1437721,131988,HIGH
6,6,703,452284,643.36,407.8,224121,193349,34814,LOW
7,7,42170,4430794,105.07,2.8,3447873,899029,83892,HIGH



üö® High-Priority Clusters (Need Intervention): [np.int32(0), np.int32(2), np.int32(5), np.int32(7)]


In [24]:
# ============================================
# CELL 11: Cluster Visualization
# ============================================

# Visualize clusters on map
fig_clusters = px.scatter_mapbox(
    master_pincode.sample(min(3000, len(master_pincode)), random_state=42),
    lat='latitude',
    lon='longitude',
    color='cluster',
    color_continuous_scale='viridis',
    size='total_enrolments',
    size_max=12,
    hover_name='pincode',
    hover_data=['state', 'district', 'total_enrolments', 'daily_enrolment_rate'],
    mapbox_style='carto-positron',
    zoom=4,
    center={'lat': 20.5, 'lon': 78.9}
)

fig_clusters.update_layout(
    title=dict(
        text='<b>PINCODE CLUSTERS - GEOGRAPHIC SEGMENTATION</b><br><sup>8 Regional Clusters for Targeted Intervention</sup>',
        x=0.5
    ),
    height=600,
    margin={'r': 0, 't': 80, 'l': 0, 'b': 0}
)

# Save to HTML
fig_clusters.write_html(f"{OUTPUT_DIR}/charts/03_pincode_clusters.html")
print(f"‚úÖ Cluster map saved: {OUTPUT_DIR}/charts/03_pincode_clusters.html")

‚úÖ Cluster map saved: ../outputs//charts/03_pincode_clusters.html


In [25]:
# ============================================
# CELL 12: Age Group Analysis (Real Data)
# ============================================

print("\nüë∂ AGE GROUP ENROLLMENT ANALYSIS")
print("="*60)

# Calculate age-wise totals from real data
total_age_0_5 = df_enrolment['age_0_5'].sum()
total_age_5_17 = df_enrolment['age_5_17'].sum()
total_age_18_plus = df_enrolment['age_18_greater'].sum()
total_all = total_age_0_5 + total_age_5_17 + total_age_18_plus

# Create age distribution dataframe
age_data = pd.DataFrame({
    'Age Group': ['0-5 years', '5-17 years', '18+ years'],
    'Total Enrolments': [total_age_0_5, total_age_5_17, total_age_18_plus],
    'Percentage': [
        total_age_0_5 / total_all * 100,
        total_age_5_17 / total_all * 100,
        total_age_18_plus / total_all * 100
    ]
})

print("\nüìä AGE-WISE ENROLLMENT DISTRIBUTION:")
print("-" * 60)
display(age_data)

# Visualize age distribution
fig_age = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Enrolments by Age Group', 'Age Distribution (%)'),
    specs=[[{"type": "bar"}, {"type": "pie"}]]
)

# Bar chart
fig_age.add_trace(
    go.Bar(
        x=age_data['Age Group'],
        y=age_data['Total Enrolments'],
        marker_color=['#1B998B', '#F77F00', '#D62828'],
        text=[f"{e:,.0f}" for e in age_data['Total Enrolments']],
        textposition='outside'
    ),
    row=1, col=1
)

# Pie chart
fig_age.add_trace(
    go.Pie(
        labels=age_data['Age Group'],
        values=age_data['Total Enrolments'],
        marker_colors=['#1B998B', '#F77F00', '#D62828'],
        textinfo='percent+label'
    ),
    row=1, col=2
)

fig_age.update_layout(
    title=dict(
        text='<b>AGE GROUP ENROLLMENT ANALYSIS</b><br><sup>Based on Real UIDAI Enrolment Data</sup>',
        x=0.5
    ),
    height=450,
    showlegend=False,
    template='plotly_white'
)

# Save to HTML
fig_age.write_html(f"{OUTPUT_DIR}/charts/03_age_distribution.html")
print(f"‚úÖ Age analysis chart saved: {OUTPUT_DIR}/charts/03_age_distribution.html")

print(f"\nüéØ KEY INSIGHTS:")
print(f"   ‚Ä¢ 0-5 years enrollment: {total_age_0_5:,} ({total_age_0_5/total_all*100:.1f}%)")
print(f"   ‚Ä¢ 5-17 years enrollment: {total_age_5_17:,} ({total_age_5_17/total_all*100:.1f}%)")
print(f"   ‚Ä¢ 18+ years enrollment: {total_age_18_plus:,} ({total_age_18_plus/total_all*100:.1f}%)")


üë∂ AGE GROUP ENROLLMENT ANALYSIS

üìä AGE-WISE ENROLLMENT DISTRIBUTION:
------------------------------------------------------------


Unnamed: 0,Age Group,Total Enrolments,Percentage
0,0-5 years,3546965,65.253117
1,5-17 years,1720384,31.649711
2,18+ years,168353,3.097171


‚úÖ Age analysis chart saved: ../outputs//charts/03_age_distribution.html

üéØ KEY INSIGHTS:
   ‚Ä¢ 0-5 years enrollment: 3,546,965 (65.3%)
   ‚Ä¢ 5-17 years enrollment: 1,720,384 (31.6%)
   ‚Ä¢ 18+ years enrollment: 168,353 (3.1%)


In [26]:
# ============================================
# CELL 13: Demographic & Biometric Update Analysis
# ============================================

print("\nüìä DEMOGRAPHIC & BIOMETRIC UPDATE ANALYSIS")
print("="*60)

# State-wise updates comparison
update_stats = master_pincode.groupby('state').agg({
    'total_demo_updates': 'sum',
    'total_bio_updates': 'sum',
    'total_enrolments': 'sum'
}).reset_index()

update_stats['update_ratio'] = (
    (update_stats['total_demo_updates'] + update_stats['total_bio_updates']) / 
    update_stats['total_enrolments']
).fillna(0)

update_stats = update_stats.sort_values('update_ratio', ascending=False)

# Create comparison visualization
fig_updates = go.Figure()

fig_updates.add_trace(go.Bar(
    name='Demographic Updates',
    x=update_stats['state'].head(15),
    y=update_stats['total_demo_updates'].head(15),
    marker_color='#3498db'
))

fig_updates.add_trace(go.Bar(
    name='Biometric Updates',
    x=update_stats['state'].head(15),
    y=update_stats['total_bio_updates'].head(15),
    marker_color='#e74c3c'
))

fig_updates.update_layout(
    title=dict(
        text='<b>DEMOGRAPHIC vs BIOMETRIC UPDATES BY STATE</b><br><sup>Top 15 States</sup>',
        x=0.5
    ),
    xaxis_title='State',
    yaxis_title='Number of Updates',
    barmode='group',
    height=500,
    template='plotly_white',
    xaxis_tickangle=-45
)

# Save to HTML
fig_updates.write_html(f"{OUTPUT_DIR}/charts/03_updates_comparison.html")
print(f"‚úÖ Updates comparison chart saved: {OUTPUT_DIR}/charts/03_updates_comparison.html")

print("\nüìä Total Updates Summary:")
print(f"   Total Demographic Updates: {master_pincode['total_demo_updates'].sum():,.0f}")
print(f"   Total Biometric Updates: {master_pincode['total_bio_updates'].sum():,.0f}")


üìä DEMOGRAPHIC & BIOMETRIC UPDATE ANALYSIS
‚úÖ Updates comparison chart saved: ../outputs//charts/03_updates_comparison.html

üìä Total Updates Summary:
   Total Demographic Updates: 179,049,601
   Total Biometric Updates: 240,696,861


In [27]:
# ============================================
# CELL 14: Key Insights Summary & Export
# ============================================

print("\n" + "="*70)
print("üéØ GEOGRAPHIC ANALYSIS - KEY INSIGHTS SUMMARY")
print("="*70)

insights = {
    '1. Total Pincodes Analyzed': f"{len(master_pincode):,}",
    '2. States/UTs Covered': f"{master_pincode['state'].nunique()}",
    '3. Total Enrolments': f"{master_pincode['total_enrolments'].sum():,}",
    '4. Total Demographic Updates': f"{master_pincode['total_demo_updates'].sum():,.0f}",
    '5. Total Biometric Updates': f"{master_pincode['total_bio_updates'].sum():,.0f}",
    '6. Critical Low-Activity Zones': f"{len(critical_pincodes):,} pincodes",
    '7. Top State by Enrolments': f"{state_stats.iloc[0]['state']} ({state_stats.iloc[0]['total_enrolments']:,.0f})",
    '8. Geographic Clusters': f"{n_clusters} regional segments"
}

for key, value in insights.items():
    print(f"\n   {key}:")
    print(f"   üìå {value}")

print("\n" + "-"*70)
print("\nüí° ACTIONABLE RECOMMENDATIONS:")
print("-"*70)
recommendations = [
    f"1. Deploy mobile enrollment vans to {len(top_deployment)} high-priority pincodes",
    "2. Focus on 0-5 age group enrollment (lowest percentage)",
    "3. Target states with high update ratio but low enrollments",
    "4. Establish permanent centers in high-density clusters",
    "5. Launch awareness campaigns in bottom 25% activity zones"
]

for rec in recommendations:
    print(f"   ‚Ä¢ {rec}")

# Export results
print("\n" + "="*70)
print("üíæ EXPORTING RESULTS...")
print("="*70)

top_deployment.to_csv(f"{OUTPUT_DIR}/priority_deployment_pincodes.csv", index=False)
state_stats.to_csv(f"{OUTPUT_DIR}/state_enrollment_stats.csv", index=False)
master_pincode.to_csv(f"{OUTPUT_DIR}/master_pincode_analysis.csv", index=False)
cluster_stats.to_csv(f"{OUTPUT_DIR}/cluster_analysis.csv", index=False)

print(f"\n‚úÖ Results exported:")
print(f"   ‚Ä¢ priority_deployment_pincodes.csv ({len(top_deployment)} rows)")
print(f"   ‚Ä¢ state_enrollment_stats.csv ({len(state_stats)} rows)")
print(f"   ‚Ä¢ master_pincode_analysis.csv ({len(master_pincode)} rows)")
print(f"   ‚Ä¢ cluster_analysis.csv ({len(cluster_stats)} rows)")
print(f"   ‚Ä¢ Charts saved to outputs/charts/")

print("\n" + "="*70)
print("‚úÖ NOTEBOOK 03 COMPLETE!")
print("="*70)


üéØ GEOGRAPHIC ANALYSIS - KEY INSIGHTS SUMMARY

   1. Total Pincodes Analyzed:
   üìå 147,399

   2. States/UTs Covered:
   üìå 55

   3. Total Enrolments:
   üìå 21,020,763

   4. Total Demographic Updates:
   üìå 179,049,601

   5. Total Biometric Updates:
   üìå 240,696,861

   6. Critical Low-Activity Zones:
   üìå 38,884 pincodes

   7. Top State by Enrolments:
   üìå Uttar Pradesh (2,642,461)

   8. Geographic Clusters:
   üìå 8 regional segments

----------------------------------------------------------------------

üí° ACTIONABLE RECOMMENDATIONS:
----------------------------------------------------------------------
   ‚Ä¢ 1. Deploy mobile enrollment vans to 50 high-priority pincodes
   ‚Ä¢ 2. Focus on 0-5 age group enrollment (lowest percentage)
   ‚Ä¢ 3. Target states with high update ratio but low enrollments
   ‚Ä¢ 4. Establish permanent centers in high-density clusters
   ‚Ä¢ 5. Launch awareness campaigns in bottom 25% activity zones

üíæ EXPORTING RESULTS...
