# 🏛️ Aadhaar Pulse 2.0
## Unlocking Societal Trends in Aadhaar Enrolment and Updates

---

### UIDAI Data Hackathon 2026

---

## Executive Summary

**Aadhaar Pulse 2.0** treats India's identity ecosystem as a living sensor of socio-economic dynamics. By analyzing enrolment and update patterns across **10 months** and **36 states**, we derive actionable intelligence for:

1. **Service Optimization** - Identifying overloaded service centers (WHERE to open new centers)
2. **Child Welfare Protection** - Detecting compliance gaps in mandatory biometric updates (WHICH children are at risk)
3. **Resource Allocation** - Predicting seasonal demand patterns (WHEN to deploy resources)

### Key Findings

| Metric | Finding |
|--------|--------|
| **5M+** | Total records processed |
| **36** | States/UTs covered |
| **Delhi** | Most stressed region (59K+ transactions/PIN) |
| **Gujarat** | Highest child compliance risk (4 of top 5 at-risk districts) |
| **June-Aug** | "School Rush" - 40% demand spike detected |

---

## 1. Problem Statement

> **"Identify meaningful patterns, trends, anomalies, or predictive indicators and translate them into clear insights or solution frameworks that can support informed decision-making and system improvements."**

### Our Interpretation: Three Critical Questions

| Question | Current Gap | Our Solution |
|----------|-------------|-------------|
| **WHERE** should UIDAI open new centers? | Know center locations, not if overloaded | Service Pressure Score (SPS) |
| **WHICH** children are at risk of ID deactivation? | No district-level risk visibility | Child Compliance Z-Score |
| **WHEN** should resources be deployed? | Static allocation year-round | Seasonality Detection |

---

## 2. Datasets Used

| Dataset | Records | Columns | Description |
|---------|---------|---------|-------------|
| **Enrolment** | 1,006,029 | date, state, district, pincode, age_0_5, age_5_17, age_18_greater | New Aadhaar registrations by age group |
| **Demographic Updates** | 2,071,700 | date, state, district, pincode, demo_age_5_17, demo_age_17_ | Address/name/DOB changes |
| **Biometric Updates** | 1,861,108 | date, state, district, pincode, bio_age_5_17, bio_age_17_ | Fingerprint/iris/face updates |

**Date Range:** March 2025 - December 2025 (10 months)

**Total Records:** ~5 Million transactions

In [None]:
# Setup and Imports
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import glob
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

# Base path - UPDATE THIS FOR YOUR SYSTEM
BASE_PATH = "/Users/balamsanjay/Desktop/UDIAI-DataHackthon/"

print("✅ Libraries loaded successfully!")

In [None]:
# Load Raw Datasets
def load_dataset(folder_name):
    folder_path = os.path.join(BASE_PATH, folder_name)
    all_files = glob.glob(os.path.join(folder_path, "*.csv"))
    dfs = [pd.read_csv(f) for f in all_files]
    return pd.concat(dfs, ignore_index=True)

print("📊 Loading datasets...")
enrol_df = load_dataset('api_data_aadhar_enrolment')
bio_df = load_dataset('api_data_aadhar_biometric')
demo_df = load_dataset('api_data_aadhar_demographic')

print(f"\n📈 Dataset Summary:")
print(f"   Enrolment:   {len(enrol_df):>10,} records")
print(f"   Biometric:   {len(bio_df):>10,} records")
print(f"   Demographic: {len(demo_df):>10,} records")
print(f"   TOTAL:       {len(enrol_df)+len(bio_df)+len(demo_df):>10,} records")

In [None]:
# Show sample data
print("📋 Enrolment Dataset Sample:")
display(enrol_df.head(3))
print("\n📋 Biometric Dataset Sample:")
display(bio_df.head(3))

---

## 3. Methodology

### 3.1 Data Cleaning

**Challenges:**
1. State name variations (50+ variations → 36 official names)
2. District duplicates (Bengaluru/Bangalore)
3. Garbage data ("100000", "?" as district names)

In [None]:
# Data Cleaning
def normalize_state_names(df):
    if 'state' not in df.columns: return df
    df['state'] = df['state'].astype(str).str.strip().str.title()
    state_map = {
        'Andaman And Nicobar Islands': 'Andaman & Nicobar Islands',
        'Nct Of Delhi': 'Delhi', 'Delhi Nct': 'Delhi',
        'Orissa': 'Odisha', 'Pondicherry': 'Puducherry',
    }
    df['state'] = df['state'].map(lambda x: state_map.get(x, x))
    return df

def normalize_district_names(df):
    if 'district' not in df.columns: return df
    df = df.dropna(subset=['district'])
    df['district'] = df['district'].astype(str).str.strip().str.title()
    mask = df['district'].str.contains(r'[a-zA-Z]') & (df['district'].str.len() > 2)
    return df[mask]

# Apply cleaning
print("🧹 Cleaning data...")
for df in [enrol_df, bio_df, demo_df]:
    df = normalize_state_names(df)
    df = normalize_district_names(df)

enrol_df = normalize_state_names(enrol_df)
enrol_df = normalize_district_names(enrol_df)
bio_df = normalize_state_names(bio_df)
bio_df = normalize_district_names(bio_df)
demo_df = normalize_state_names(demo_df)
demo_df = normalize_district_names(demo_df)

print(f"✅ Unique States: {enrol_df['state'].nunique()}")
print(f"✅ Unique Districts: {enrol_df['district'].nunique()}")

---

## 4. Data Analysis

### 4.1 Pillar 1: Service Accessibility Index (SAI)

**Formula:** `SPS = Total Transactions / Active PIN Codes`

**Interpretation:** High SPS = Service bottleneck

In [None]:
# Calculate Service Pressure Score
print("📊 Calculating Service Pressure Score...")

enrol_vol = enrol_df.groupby('district')[['age_0_5', 'age_5_17', 'age_18_greater']].sum().sum(axis=1)
bio_vol = bio_df.groupby('district')[['bio_age_5_17', 'bio_age_17_']].sum().sum(axis=1)
demo_vol = demo_df.groupby('district')[['demo_age_5_17', 'demo_age_17_']].sum().sum(axis=1)

total_volume = enrol_vol.add(bio_vol, fill_value=0).add(demo_vol, fill_value=0)
unique_pins = enrol_df.groupby('district')['pincode'].nunique()
sps_score = total_volume / unique_pins.replace(0, 1)

district_df = pd.DataFrame({
    'district': total_volume.index,
    'total_volume': total_volume.values,
    'unique_pincodes': unique_pins.reindex(total_volume.index).fillna(1).values,
    'sps_score': sps_score.values
})
state_map = enrol_df.groupby('district')['state'].first()
district_df['state'] = district_df['district'].map(state_map)

print("\n🔴 TOP 10 DISTRICTS BY SERVICE PRESSURE:")
for _, row in district_df.nlargest(10, 'sps_score').iterrows():
    print(f"   {row['district']:30} SPS: {row['sps_score']:,.0f}")

In [None]:
# Visualize SAI
top_pressure = district_df.nlargest(20, 'sps_score')

fig_sps = px.bar(
    top_pressure, x='district', y='sps_score', color='state',
    title='<b>Top 20 Districts by Service Pressure Score</b>',
    labels={'sps_score': 'Service Pressure Score', 'district': 'District'},
    template='plotly_white'
)
fig_sps.update_layout(xaxis_tickangle=-45, height=500)
fig_sps.show()

### 4.2 Pillar 2: Child Lifecycle Compliance Score (CLCS)

**Formula:** `Z-Score = (District Compliance - National Mean) / Std Dev`

**Interpretation:** Z < -1.5 = HIGH RISK

In [None]:
# Calculate CLCS
print("📊 Calculating Child Compliance Z-Score...")

child_bio = bio_df.groupby('district')['bio_age_5_17'].sum()
child_enrol = enrol_df.groupby('district')[['age_0_5', 'age_5_17']].sum().sum(axis=1)
total_child = child_bio.add(child_enrol, fill_value=0)
compliance_share = child_bio / total_child.replace(0, 1)

national_mean = compliance_share.mean()
national_std = compliance_share.std()
clcs_zscore = (compliance_share - national_mean) / national_std

district_df['total_child_activity'] = district_df['district'].map(total_child).fillna(0)
district_df['clcs_zscore'] = district_df['district'].map(clcs_zscore).fillna(0)

print("\n⚠️ TOP 10 AT-RISK DISTRICTS:")
at_risk = district_df[district_df['total_child_activity'] > 1000].nsmallest(10, 'clcs_zscore')
for _, row in at_risk.iterrows():
    print(f"   {row['district']:30} Z: {row['clcs_zscore']:+.2f}σ")

In [None]:
# Visualize CLCS Risk
active = district_df[district_df['total_child_activity'] > 1000]

fig_risk = px.scatter(
    active, x='total_child_activity', y='clcs_zscore', color='state',
    size='total_volume', hover_name='district',
    title='<b>Child Compliance Risk Map</b>',
    labels={'clcs_zscore': 'Z-Score', 'total_child_activity': 'Child Activity'},
    template='plotly_white', height=600
)
fig_risk.add_hline(y=-1.5, line_dash="dash", line_color="red", annotation_text="HIGH RISK")
fig_risk.add_hline(y=0, line_dash="dot", line_color="gray")
fig_risk.show()

### 4.3 Pillar 3: Seasonality Detection (DIH)

Analyzing monthly patterns to detect **School Rush** (June-Aug)

In [None]:
# Seasonality Analysis
enrol_df['date'] = pd.to_datetime(enrol_df['date'], format='%d-%m-%Y', errors='coerce')
enrol_df['month'] = enrol_df['date'].dt.month

monthly = enrol_df.groupby('month')[['age_0_5', 'age_5_17', 'age_18_greater']].sum()
monthly['total'] = monthly.sum(axis=1)
monthly['season'] = monthly.index.map(lambda m: 'School Rush' if m in [6,7,8] else 'Normal')
monthly = monthly.reset_index()

fig_season = px.bar(
    monthly, x='month', y='total', color='season',
    title='<b>Monthly Volume with Seasonality</b>',
    template='plotly_white',
    color_discrete_map={'School Rush': 'red', 'Normal': 'blue'}
)
fig_season.show()

---

## 5. Summary & Recommendations

| Finding | Impact | Recommendation |
|---------|--------|----------------|
| Delhi: 30K-59K transactions/PIN | Service bottleneck | Open 5+ new centers |
| Gujarat: 4 of top 5 at-risk | Child welfare crisis | School Aadhaar Camps |
| June-Aug: School Rush | Demand spike | Pre-deploy in May |

---

## Thank You!

*"Aadhaar Pulse 2.0 - Turning data into actionable intelligence for a more inclusive India."*