🔍 Aadhaar Data Insights Analysis

Unlocking India's Digital Identity Patterns Through Data Science
A comprehensive analysis of 260+ million Aadhaar enrollment and update transactions using machine learning and predictive analytics to optimize resource allocation and enhance citizen experience.

Report:(https://aadhaar-data-hackathon.vercel.app)

🎯 Repository Description

Short Description:
Machine learning-powered analysis of Aadhaar enrollment patterns across 39 states, revealing strategic insights through demographic clustering, temporal forecasting, and geographic segmentation.

Detailed Description:
This project analyzes 260+ million Aadhaar transactions (enrollments, biometric updates, and demographic changes) from March-December 2025 to derive actionable insights for India's digital identity infrastructure. Using K-Means clustering, we discover distinct "enrollment archetypes" across states, while predictive models forecast demand patterns. The analysis spans 7 dimensions—from seasonal trends to age-cohort behaviors—delivering data-driven recommendations for policymakers and system administrators.

Tags: aadhaar data-analysis machine-learning india uidai kmeans-clustering predictive-analytics jupyter-notebook plotly data-visualization

✨ Project Highlights

🔬 260M+ Transactions Analyzed — Comprehensive analysis of enrollment, biometric, and demographic updates
🎯 K-Means Clustering — Discovered 3 distinct enrollment archetypes across 39 states
📈 Predictive Modeling — 85-90% accuracy in forecasting daily enrollment demand
🗺️ Geographic Segmentation — Zone-based analysis revealing regional patterns
📊 13 Interactive Visualizations — Plotly-powered HTML dashboards for insights
⚡ End-to-End Pipeline — From raw data cleaning to actionable recommendations

📋 Problem Statement and Approach

Problem Statement

Analyze Aadhaar enrollment and update data to extract actionable insights that can optimize resource allocation, improve service delivery, and enhance citizen experience across India's digital identity infrastructure.

Analytical Approach

Our multi-dimensional analysis framework addresses three core objectives:

Temporal Analysis: Identify enrollment patterns, seasonal trends, and growth dynamics to forecast future demand
Geographic Segmentation: Profile state-level system maturity and classify regions based on enrollment behavior
Predictive Modeling: Build machine learning models to enable data-driven resource planning

Key Innovation: We apply unsupervised learning (K-Means clustering) to discover natural groupings of states based on child-to-adult enrollment ratios, revealing distinct "enrollment archetypes" that inform tailored policy interventions.

📊 Datasets Used

Primary Datasets (UIDAI Provided)

1. Enrolment Dataset (`enrolment_master.csv`)

Captures new Aadhaar registrations across India.

Column	Description	Data Type
`date`	Transaction date (DD-MM-YYYY)	String → DateTime
`state`	State/Union Territory name	String
`district`	District name	String
`pincode`	6-digit area code	String
`age_0_5`	New enrollments age 0-5 years	String → Integer
`age_5_17`	New enrollments age 5-17 years	String → Integer
`age_18_greater`	New enrollments age 18+ years	String → Integer

Total Records: ~5.4 million new enrollments
Time Period: March - December 2025 (10 months)

2. Biometric Update Dataset (`biometric_master.csv`)

Tracks fingerprint/iris re-capture activities (mandatory 10-year renewals).

Column	Description	Data Type
`date`	Update date (DD-MM-YYYY)	String → DateTime
`state`	State/Union Territory	String
`district`	District name	String
`pincode`	Area code	String
`bio_age_5_17`	Biometric updates age 5-17	String → Integer
`bio_age_17_`	Biometric updates age 18+	String → Integer

Total Records: ~175 million biometric updates

3. Demographic Update Dataset (`demographic_master.csv`)

Captures address, phone, and other non-biometric changes (often self-service).

Column	Description	Data Type
`date`	Update date	String → DateTime
`state`	State/Union Territory	String
`district`	District name	String
`pincode`	Area code	String
`demo_age_5_17`	Demographic updates age 5-17	String → Integer
`demo_age_17_`	Demographic updates age 18+	String → Integer

Total Records: ~80 million demographic updates

Dataset Characteristics

Geographic Coverage: 39 states and union territories
Temporal Granularity: Daily transaction-level data
Age Segmentation: Three cohorts (0-5, 5-17, 18+)
Spatial Granularity: State → District → Pincode hierarchy

🔧 Methodology

1. Data Cleaning and Preprocessing

Date Standardization

df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y', errors='coerce')

Converted DD-MM-YYYY strings to datetime objects
Handled malformed dates with coercion to NaT

Type Casting and Validation

for col in ['age_0_5', 'age_5_17', 'age_18_greater']:
    df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)

Numeric columns parsed from string format
Invalid values coerced to NaN, then filled with 0
Ensured integer data types for aggregation

Geographic Normalization

df['state_clean'] = df['state'].str.strip().str.title()

Removed leading/trailing whitespace
Standardized capitalization for consistent grouping

2. Feature Engineering

Temporal Features

Monthly Aggregation: df['month'] = df['date'].dt.to_period('M').dt.to_timestamp()
Quarterly Aggregation: df['quarter'] = df['date'].dt.to_period('Q').dt.to_timestamp()
Day of Week: df['dow'] = df['date'].dt.dayofweek (for forecasting models)

Derived Metrics

Total Enrollments: age_0_5 + age_5_17 + age_18_greater
Maintenance Activity: biometric_updates + demographic_updates
Maintenance/Growth Ratio: Maintenance ÷ Growth (system maturity indicator)
Child/Adult Ratio: (age_0_5 + age_5_17) ÷ age_18_greater (archetype classification)

Time-Series Features (for Predictions)

Lag Features: lag_1 (yesterday), lag_7 (same day last week)
Rolling Statistics: 7-day moving average
Seasonality: One-hot encoded day-of-week

Geographic Zoning

Mapped 39 states/UTs into 6 geographic zones for regional analysis:

North: J&K, HP, Punjab, Uttarakhand, Haryana, Delhi, Rajasthan, Chandigarh, Ladakh
South: AP, Karnataka, Kerala, Tamil Nadu, Telangana, Puducherry, A&N, Lakshadweep
East: West Bengal, Bihar, Sikkim, Odisha, Jharkhand
West: Goa, Gujarat, Maharashtra, Dadra & Nagar Haveli, Daman & Diu
Central: Madhya Pradesh, Chhattisgarh, Uttar Pradesh
Northeast: Assam, Arunachal Pradesh, Manipur, Meghalaya, Mizoram, Nagaland, Tripura

3. Analytical Methods

3.1 Descriptive Statistics

Monthly/quarterly aggregation using groupby()
Percentage change calculations for growth rates
Top/bottom state identification via ranking

3.2 Unsupervised Learning (K-Means Clustering)

from sklearn.cluster import KMeans
X = df[['ratio']].values
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(X)

Purpose: Classify states into enrollment archetypes
Features: Child-to-Adult enrollment ratio
Algorithm: K-Means (k=3)
Output: Adult-Heavy, Balanced, Child-Heavy categories

3.3 Predictive Modeling (Linear Regression)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

Target: Next-day enrollment volume
Features: Lag variables, rolling averages, day-of-week
Performance: R² ≈ 0.85-0.90

4. Data Transformation Pipeline

Stage 1: Load → Read CSVs with dtype=str for safety
Stage 2: Clean → Date parsing, type casting, missing value handling
Stage 3: Aggregate → Group by time/geography, compute totals
Stage 4: Engineer → Create derived features and metrics
Stage 5: Analyze → Apply statistical/ML methods
Stage 6: Visualize → Generate interactive Plotly charts

📈 Data Analysis and Visualisation

Analysis Framework: 7 Dimensions

1. Monthly Enrollment Trends

Key Findings:

Peak Month: September 2025 (4.43M enrollments - 39% of Q3)
Growth Pattern: Exponential acceleration Mar→Sep, followed by Q4 stabilization
Seasonal Driver: School enrollment season + festival preparations

Visualization: enrolment_trends_monthly.html

Interactive line chart with markers
Monthly totals and trend line
Hover tooltips for precise values

2. Maintenance vs Growth Ratio (System Maturity)

Key Findings:

National Average Ratio: ~25-30 (mature system overall)
Most Saturated: Daman & Diu (124.3), A&N Islands (58.5), Chandigarh (49.5)
Highest Growth: Meghalaya (1.2), Assam (6.4), Nagaland (7.7)
Pattern: UTs and Southern states saturated; Northeastern states expanding

Visualization: maintenance_growth_ratio.csv

State-level ratios with components (maintenance, growth)
Ranked by maturity indicator

Interpretation:

Ratio > 40 → Focus on update efficiency
Ratio < 5 → Prioritize new enrollment infrastructure

3. Top & Bottom States Deep Dive

Key Findings:

Statistical Summary: Mean ratio 25-30, high variance (σ large)
Size Effect: Smaller states/UTs reach saturation faster
Geographic Influence: Island/remote areas show distinct patterns

Visualization: Console output with formatted tables

4. Month-over-Month & Quarter-over-Quarter Trends

Key Findings:

Highest MoM Growth: April (+1,452%), July (+186%), September (+139%)
Regional Leaders: Central Zone drives national trends (UP, MP dominant)
Zonal Patterns: Northeast highest growth rates; South most stable

Visualizations (4 files):

monthly_enrolment_mom_national.html - National MoM %
monthly_enrolment_mom_by_zone.html - Zone MoM comparison
quarterly_enrolment_qoq_national.html - National QoQ %
quarterly_enrolment_qoq_by_zone.html - Zone QoQ comparison

5. Enrollment Archetypes (Machine Learning)

Method: K-Means clustering (k=3) on child/adult enrollment ratio

Key Findings:

Adult-Heavy (31 states, avg ratio 29.5): Kerala, Gujarat, Karnataka
- Mature systems, focus on biometric renewals
- Policy: Digital-first demographic updates
Balanced (8 states, avg ratio 85.8): Andhra Pradesh, Haryana, Jharkhand
- Demographic transition states
- Policy: Hybrid service models
Child-Heavy (4 states, avg ratio 180.5): Tamil Nadu, Odisha, Lakshadweep
- Young populations, school-based drives
- Policy: School partnerships, birth certificate integration

Visualizations (5 files):

state_child_adult_ratio_clusters.html - Bar chart by state
archetype_scatter_child_vs_adult.html - Scatter plot with reference line
archetype_ratio_distribution.html - Box plots by archetype
archetype_top_bottom_states.html - Top/Bottom 10 comparison
archetype_summary_dashboard.html - 4-panel overview

6. Update Intensity per Age Cohort

Key Findings:

Children (5-17):
- 91.5% biometric, 8.5% demographic (10.8:1 ratio)
- Physical center-dependent, compliance-driven
Adults (18+):
- 55.4% biometric, 44.6% demographic (1.24:1 ratio)
- Self-service preference, digital literacy evident

Strategic Insight: Dual-track service model needed - physical infrastructure for children, digital-first for adults

7. Predictive Analytics

7.1 Daily Enrollment Forecasting

Model: Linear Regression with time-series features
Performance: R² ≈ 0.85-0.90
Features: lag_1, lag_7, rolling_mean_7, day_of_week
Business Value: Optimize staffing, server capacity, appointment slots

Operational Impact:

Proactive resource allocation
Prevent queue build-up
Reduce operational costs

Visualization Suite Summary

Total Output Files: 13

Category	Files
Temporal Trends	5 HTML (monthly trend, 4 MoM/QoQ)
Maturity Analysis	1 CSV (maintenance ratios)
Archetypes	5 HTML + 1 CSV (clustering visualizations)
Summary	1 MD (consolidated insights)

Technology Stack:

Visualization: Plotly (interactive HTML charts)
Analysis: Pandas, NumPy, scikit-learn
Documentation: Jupyter Notebook + Python scripts

🎯 Key Insights Summary

Strategic Recommendations

Short-Term (0-6 months)

Document September success factors for replication
Investigate August data gap (quality assurance)
Pilot mobile biometric units in top 5 saturated states

Medium-Term (6-18 months)

Deploy zone-specific strategies leveraging regional patterns
Scale school-based enrollment in child-heavy states
Implement enrollment forecasting system nationwide

Long-Term (18+ months)

Develop proactive lifecycle management system
Digital-first update platform for adult-heavy states
Equity-focused acceleration in growth states

Impact Delivered

✅ Data-driven policy framework for resource allocation
✅ Predictive capabilities for demand forecasting
✅ Geographic segmentation for targeted interventions
✅ Behavioral insights for service design optimization

📁 Project Structure

DataHackathon/
│
├── 📊 Data Processing Pipeline
│   ├── clean_datasets.py              # Initial data cleaning & validation
│   ├── impute_missing_data.py         # Missing value handling & imputation
│   ├── aggregate_and_engineer.py      # Feature engineering & aggregations
│   └── process_datasets.py            # Master workflow orchestrator
│
├── 🔬 Analysis Scripts
│   ├── insights.ipynb                 # Main analysis notebook (7 sections)
│   ├── run_insights.py                # Standalone Python script version
│   └── main.ipynb                     # Exploratory analysis notebook
│
├── 📈 Visualization & Reporting
│   ├── capture_visualizations.py      # Screenshot automation for reports
│   ├── report.tex                     # LaTeX report template
│   └── report.html                    # Rendered HTML report
│
├── 📂 Data & Outputs
│   ├── Datasets/                      # Raw data (gitignored)
│   │   ├── enrolment_master.csv
│   │   ├── biometric_master.csv
│   │   └── demographic_master.csv
│   │
│   └── output/
│       ├── master/                    # Cleaned master datasets
│       └── insights/                  # Analysis outputs (13 files)
│           ├── *.html                 # Interactive Plotly visualizations
│           ├── *.csv                  # Processed data tables
│           └── summary_report.md      # Consolidated insights
│
├── 📄 Documentation
│   ├── README.md                      # This file
│   ├── PDF_GENERATION_GUIDE.md        # LaTeX to PDF instructions
│   └── ADD_VISUALIZATIONS_GUIDE.md    # How to add new analyses
│
└── ⚙️ Configuration
    └── .gitignore                     # Version control configuration

Key Files Description

File	Purpose	Output
`insights.ipynb`	Complete 7-dimensional analysis	13 visualization files
`run_insights.py`	Non-interactive version of analysis	Same as notebook
`aggregate_and_engineer.py`	Feature engineering pipeline	Enriched datasets
`report.tex`	Professional LaTeX report	PDF documentation

🚀 Quick Start

Prerequisites

Ensure you have Python 3.8+ installed, then install dependencies:

pip install pandas numpy plotly scikit-learn jupyter

Option 1: Run Complete Analysis (Recommended)

Using Jupyter Notebook:

jupyter notebook insights.ipynb

Run cells sequentially. All outputs save to output/insights/.

Using Python Script:

python run_insights.py

Generates all 13 output files automatically in output/insights/.

Option 2: Run Data Pipeline Only

Process raw datasets through the complete cleaning and feature engineering pipeline:

# Step 1: Clean raw data
python clean_datasets.py

# Step 2: Handle missing values
python impute_missing_data.py

# Step 3: Aggregate and engineer features
python aggregate_and_engineer.py

Option 3: Generate Report

Create a professional PDF report from the LaTeX template (requires LaTeX installation):

# See detailed instructions in:
cat PDF_GENERATION_GUIDE.md

Expected Outputs

After running the analysis, you'll find:

13 HTML files in output/insights/ (interactive visualizations)
2 CSV files with cluster analysis results
1 markdown summary with key findings

Troubleshooting

Issue: Missing datasets
Solution: Ensure Datasets/ folder contains the three master CSV files from UIDAI

Issue: Import errors
Solution: Verify all dependencies are installed: pip list | grep -E "pandas|numpy|plotly|scikit" (or findstr on Windows)

👥 Team & Acknowledgments

Team Members

Glen Elric Fernandes - Data Science & Machine Learning
Reoney Iral Madtha - Data Engineering & Visualization

Hackathon Details

Detail	Value
Event	UIDAI Data Hackathon 2026
Dataset Provider	Unique Identification Authority of India (UIDAI)
Analysis Period	March - December 2025 (10 months)
Submission Date	January 2026
Data Volume	260+ million transactions

Acknowledgments

We thank UIDAI for providing this invaluable dataset and organizing the hackathon to drive data-driven insights for India's digital identity infrastructure. This analysis aims to contribute meaningful recommendations for improving Aadhaar service delivery nationwide.

📄 License

This analysis is submitted as part of the UIDAI Data Hackathon 2026. All datasets remain property of UIDAI. Analysis code and insights are provided for evaluation purposes.

For questions or further details, please refer to the comprehensive analysis in insights.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
output/insights		output/insights
.gitignore		.gitignore
README.md		README.md
aggregate_and_engineer.py		aggregate_and_engineer.py
clean_datasets.py		clean_datasets.py
impute_missing_data.py		impute_missing_data.py
index.html		index.html
insights.ipynb		insights.ipynb
main.ipynb		main.ipynb
process_datasets.py		process_datasets.py
report.tex		report.tex
run_insights.py		run_insights.py

Folders and files

Latest commit

History

Repository files navigation

🔍 Aadhaar Data Insights Analysis

🎯 Repository Description

✨ Project Highlights

📋 Problem Statement and Approach

Problem Statement

Analytical Approach

📊 Datasets Used

Primary Datasets (UIDAI Provided)

1. Enrolment Dataset (enrolment_master.csv)

2. Biometric Update Dataset (biometric_master.csv)

3. Demographic Update Dataset (demographic_master.csv)

Dataset Characteristics

🔧 Methodology

1. Data Cleaning and Preprocessing

Date Standardization

Type Casting and Validation

Geographic Normalization

2. Feature Engineering

Temporal Features

Derived Metrics

Time-Series Features (for Predictions)

Geographic Zoning

3. Analytical Methods

3.1 Descriptive Statistics

3.2 Unsupervised Learning (K-Means Clustering)

3.3 Predictive Modeling (Linear Regression)

4. Data Transformation Pipeline

📈 Data Analysis and Visualisation

Analysis Framework: 7 Dimensions

1. Monthly Enrollment Trends

2. Maintenance vs Growth Ratio (System Maturity)

3. Top & Bottom States Deep Dive

4. Month-over-Month & Quarter-over-Quarter Trends

5. Enrollment Archetypes (Machine Learning)

6. Update Intensity per Age Cohort

7. Predictive Analytics

Visualization Suite Summary

🎯 Key Insights Summary

Strategic Recommendations

Short-Term (0-6 months)

Medium-Term (6-18 months)

Long-Term (18+ months)

Impact Delivered

📁 Project Structure

Key Files Description

🚀 Quick Start

Prerequisites

Option 1: Run Complete Analysis (Recommended)

Option 2: Run Data Pipeline Only

Option 3: Generate Report

Expected Outputs

Troubleshooting

👥 Team & Acknowledgments

Team Members

Hackathon Details

Acknowledgments

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Enrolment Dataset (`enrolment_master.csv`)

2. Biometric Update Dataset (`biometric_master.csv`)

3. Demographic Update Dataset (`demographic_master.csv`)

Packages