# UIDAI Hackathon - Statistical Analysis

## Objective
This notebook performs comprehensive statistical analysis on the cleaned Aadhaar datasets:
- Temporal trend analysis and seasonality detection
- Geographical distribution patterns
- Age group demographic analysis
- Cross-dataset correlations
- State and district-level comparisons
- Update ratios (demographic/biometric vs enrolments)

**Author:** Harsh Vardhan  
**Date:** January 13, 2026  
**Input:** Cleaned data from previous notebook  
**Output:** Statistical insights and analytical findings

## 1. Setup Environment

In [1]:
# Standard libraries
import sys
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Visualization (for quick checks)
import matplotlib.pyplot as plt
import seaborn as sns

# Add src directory to path
project_root = Path(r'c:\Users\harsh\OneDrive - Indian Institute of Information Technology, Nagpur\IIIT Nagpur\6th Semester\Projects\IdentityLab')
sys.path.append(str(project_root / 'src'))

# Import custom modules
from data_loader import AadhaarDataLoader
from preprocessing import AadhaarDataPreprocessor
from analysis import AadhaarAnalyzer, perform_chi_square_test, calculate_concentration_index

print("✓ Environment setup complete")
print(f"✓ Project root: {project_root}")

✓ Environment setup complete
✓ Project root: c:\Users\harsh\OneDrive - Indian Institute of Information Technology, Nagpur\IIIT Nagpur\6th Semester\Projects\IdentityLab


## 2. Load and Clean Data

In [None]:
# Load data
loader = AadhaarDataLoader(str(project_root))
preprocessor = AadhaarDataPreprocessor()

print("Loading and cleaning datasets...")
print("-" * 60)

# Load and clean enrolment
df_enrolment_raw = loader.load_enrolment_data()
df_enrolment = preprocessor.clean_enrolment_data(df_enrolment_raw)
df_enrolment = df_enrolment.drop_duplicates()
print(f"✓ Enrolment: {len(df_enrolment):,} records")

# Load and clean demographic
df_demographic_raw = loader.load_demographic_data()
df_demographic = preprocessor.clean_demographic_data(df_demographic_raw)
df_demographic = df_demographic.drop_duplicates()
print(f"✓ Demographic: {len(df_demographic):,} records")

# Load and clean biometric
df_biometric_raw = loader.load_biometric_data()
df_biometric = preprocessor.clean_biometric_data(df_biometric_raw)
df_biometric = df_biometric.drop_duplicates()
print(f"✓ Biometric: {len(df_biometric):,} records")

2026-01-13 01:29:15,740 - INFO - Loading enrolment data...
2026-01-13 01:29:15,742 - INFO - Found 3 enrolment files
2026-01-13 01:29:15,743 - INFO - Reading api_data_aadhar_enrolment_0_500000.csv


Loading and cleaning datasets...
------------------------------------------------------------


2026-01-13 01:29:15,957 - INFO - Reading api_data_aadhar_enrolment_1000000_1006029.csv
2026-01-13 01:29:15,963 - INFO - Reading api_data_aadhar_enrolment_500000_1000000.csv
2026-01-13 01:29:16,167 - INFO - Loaded 1,006,029 enrolment records
2026-01-13 01:29:16,175 - INFO - Cleaning enrolment data...
2026-01-13 01:29:17,355 - INFO - Enrolment data cleaned: 1,006,029 -> 1,006,029 rows
2026-01-13 01:29:17,745 - INFO - Loading demographic update data...
2026-01-13 01:29:17,746 - INFO - Found 5 demographic files
2026-01-13 01:29:17,746 - INFO - Reading api_data_aadhar_demographic_0_500000.csv
2026-01-13 01:29:17,933 - INFO - Reading api_data_aadhar_demographic_1000000_1500000.csv


✓ Enrolment: 983,000 records


2026-01-13 01:29:18,114 - INFO - Reading api_data_aadhar_demographic_1500000_2000000.csv
2026-01-13 01:29:18,297 - INFO - Reading api_data_aadhar_demographic_2000000_2071700.csv
2026-01-13 01:29:18,329 - INFO - Reading api_data_aadhar_demographic_500000_1000000.csv
2026-01-13 01:29:18,536 - INFO - Loaded 2,071,700 demographic update records
2026-01-13 01:29:18,549 - INFO - Cleaning demographic update data...
2026-01-13 01:29:21,190 - INFO - Demographic data cleaned: 2,071,700 -> 2,069,561 rows


## 3. Initialize Analyzer

In [None]:
# Initialize analyzer
analyzer = AadhaarAnalyzer()
print("✓ Analyzer initialized")

## 4. Univariate Analysis

Analyze the distribution of key metrics across datasets.

In [None]:
# Univariate analysis for enrolment
print("="*80)
print("ENROLMENT STATISTICS")
print("="*80)
enrol_stats = analyzer.univariate_analysis(df_enrolment, 'total_enrolments')
for key, value in enrol_stats.items():
    print(f"{key.upper()}: {value:.2f}")

In [None]:
# Univariate analysis for demographic updates
print("\n" + "="*80)
print("DEMOGRAPHIC UPDATE STATISTICS")
print("="*80)
demo_stats = analyzer.univariate_analysis(df_demographic, 'total_demo_updates')
for key, value in demo_stats.items():
    print(f"{key.upper()}: {value:.2f}")

In [None]:
# Univariate analysis for biometric updates
print("\n" + "="*80)
print("BIOMETRIC UPDATE STATISTICS")
print("="*80)
bio_stats = analyzer.univariate_analysis(df_biometric, 'total_bio_updates')
for key, value in bio_stats.items():
    print(f"{key.upper()}: {value:.2f}")

## 5. Temporal Analysis

Analyze trends over time with different aggregation frequencies.

In [None]:
# Monthly aggregation for enrolment
enrol_monthly = analyzer.temporal_aggregation(df_enrolment, 'total_enrolments', freq='M')
print("Monthly Enrolment Trends:")
display(enrol_monthly.head(10))

In [None]:
# Weekly aggregation for demographic updates
demo_weekly = analyzer.temporal_aggregation(df_demographic, 'total_demo_updates', freq='W')
print("\nWeekly Demographic Update Trends (first 10 weeks):")
display(demo_weekly.head(10))

In [None]:
# Daily aggregation for biometric updates
bio_daily = analyzer.temporal_aggregation(df_biometric, 'total_bio_updates', freq='D')
print("\nDaily Biometric Update Statistics:")
print(bio_daily['total_bio_updates_sum'].describe())

## 6. Seasonality Detection

Identify monthly patterns and peak/low periods.

In [None]:
# Seasonality in enrolments
enrol_seasonality = analyzer.detect_seasonality(df_enrolment, 'total_enrolments')
print("Enrolment Seasonality Analysis:")
print(f"Peak Month: {enrol_seasonality['peak_month']}")
print(f"Low Month: {enrol_seasonality['low_month']}")
print(f"Coefficient of Variation: {enrol_seasonality['coefficient_of_variation']:.2f}%")
print(f"Strong Seasonality: {enrol_seasonality['has_strong_seasonality']}")
print("\nMonthly Averages:")
for month, avg in enrol_seasonality['monthly_averages'].items():
    print(f"  Month {month}: {avg:.2f}")

In [None]:
# Seasonality in updates
demo_seasonality = analyzer.detect_seasonality(df_demographic, 'total_demo_updates')
bio_seasonality = analyzer.detect_seasonality(df_biometric, 'total_bio_updates')

print("\nDemographic Update Seasonality:")
print(f"Peak Month: {demo_seasonality['peak_month']}, Low Month: {demo_seasonality['low_month']}")
print(f"CV: {demo_seasonality['coefficient_of_variation']:.2f}%")

print("\nBiometric Update Seasonality:")
print(f"Peak Month: {bio_seasonality['peak_month']}, Low Month: {bio_seasonality['low_month']}")
print(f"CV: {bio_seasonality['coefficient_of_variation']:.2f}%")

## 7. Geographical Analysis

Analyze patterns across states and districts.

In [None]:
# State-level enrolment analysis
enrol_by_state = analyzer.geographical_aggregation(df_enrolment, 'state', 'total_enrolments')
print("Top 15 States by Enrolments:")
display(enrol_by_state.head(15))

In [None]:
# District-level enrolment analysis
enrol_by_district = analyzer.geographical_aggregation(df_enrolment, 'district', 'total_enrolments')
print("\nTop 15 Districts by Enrolments:")
display(enrol_by_district.head(15))

In [None]:
# State-level update analysis
demo_by_state = analyzer.geographical_aggregation(df_demographic, 'state', 'total_demo_updates')
bio_by_state = analyzer.geographical_aggregation(df_biometric, 'state', 'total_bio_updates')

print("\nTop 10 States by Demographic Updates:")
display(demo_by_state.head(10))

print("\nTop 10 States by Biometric Updates:")
display(bio_by_state.head(10))

## 8. Top N Analysis

Identify top performing states and districts.

In [None]:
# Top 10 states by enrolment
top_states_enrol = analyzer.top_n_analysis(df_enrolment, 'state', 'total_enrolments', n=10)
print("Top 10 States by Total Enrolments:")
display(top_states_enrol)

In [None]:
# Top 10 districts by enrolment
top_districts_enrol = analyzer.top_n_analysis(df_enrolment, 'district', 'total_enrolments', n=10)
print("\nTop 10 Districts by Total Enrolments:")
display(top_districts_enrol)

## 9. Update Ratio Analysis

Calculate the ratio of updates to enrolments by geography.

In [None]:
# Demographic update ratio by state
demo_ratio = analyzer.calculate_update_ratio(df_enrolment, df_demographic, geo_level='state')
print("Demographic Update Ratio by State (Top 15):")
display(demo_ratio.head(15))

In [None]:
# Biometric update ratio by state
bio_ratio = analyzer.calculate_update_ratio(df_enrolment, df_biometric, geo_level='state')
print("\nBiometric Update Ratio by State (Top 15):")
display(bio_ratio.head(15))

## 10. Age Group Analysis

Analyze patterns across different age groups.

In [None]:
# Age group distribution in enrolments
age_groups = ['age_0_5', 'age_5_17', 'age_18_greater']
age_totals = df_enrolment[age_groups].sum()
age_percentages = (age_totals / age_totals.sum() * 100).round(2)

print("Enrolment Age Group Distribution:")
print("="*60)
for age, total, pct in zip(age_groups, age_totals, age_percentages):
    print(f"{age}: {total:,} ({pct}%)")
print(f"\nTotal: {age_totals.sum():,}")

In [None]:
# Age group distribution in updates
demo_age_totals = df_demographic[['demo_age_5_17', 'demo_age_17_']].sum()
bio_age_totals = df_biometric[['bio_age_5_17', 'bio_age_17_']].sum()

print("\nDemographic Update Age Distribution:")
print("="*60)
print(f"Age 5-17: {demo_age_totals['demo_age_5_17']:,}")
print(f"Age 17+: {demo_age_totals['demo_age_17_']:,}")

print("\nBiometric Update Age Distribution:")
print("="*60)
print(f"Age 5-17: {bio_age_totals['bio_age_5_17']:,}")
print(f"Age 17+: {bio_age_totals['bio_age_17_']:,}")

## 11. Concentration Analysis

Measure inequality/concentration using Gini coefficient.

In [None]:
# Calculate Gini coefficient for state-level distribution
gini_enrol = calculate_concentration_index(enrol_by_state, 'total_enrolments_sum')
gini_demo = calculate_concentration_index(demo_by_state, 'total_demo_updates_sum')
gini_bio = calculate_concentration_index(bio_by_state, 'total_bio_updates_sum')

print("Concentration Index (Gini Coefficient) by State:")
print("="*60)
print(f"Enrolments: {gini_enrol:.4f}")
print(f"Demographic Updates: {gini_demo:.4f}")
print(f"Biometric Updates: {gini_bio:.4f}")
print("\n(0 = perfect equality, 1 = perfect inequality)")

## 12. Growth Rate Analysis

Calculate period-over-period growth rates.

In [None]:
# Monthly growth rate for enrolments
enrol_growth = analyzer.calculate_growth_rate(enrol_monthly, 'total_enrolments_sum', periods=1)
print("Monthly Enrolment Growth Rate:")
display(enrol_growth[['date', 'total_enrolments_sum', 'total_enrolments_sum_growth']].head(10))

In [None]:
# Growth statistics
print("\nGrowth Rate Statistics:")
print("="*60)
growth_stats = enrol_growth['total_enrolments_sum_growth'].dropna().describe()
print(growth_stats)

## 13. Analysis Summary

Consolidate key findings from all analyses.

In [None]:
print("="*80)
print("COMPREHENSIVE ANALYSIS SUMMARY")
print("="*80)

print("\n1. DATASET OVERVIEW:")
print(f"   Total Enrolments: {df_enrolment['total_enrolments'].sum():,}")
print(f"   Total Demographic Updates: {df_demographic['total_demo_updates'].sum():,}")
print(f"   Total Biometric Updates: {df_biometric['total_bio_updates'].sum():,}")

print("\n2. TEMPORAL PATTERNS:")
print(f"   Enrolment Peak Month: {enrol_seasonality['peak_month']}")
print(f"   Enrolment Low Month: {enrol_seasonality['low_month']}")
print(f"   Seasonality Strength: {enrol_seasonality['coefficient_of_variation']:.2f}%")

print("\n3. GEOGRAPHICAL DISTRIBUTION:")
print(f"   Top State (Enrolments): {top_states_enrol.iloc[0]['state']} ({top_states_enrol.iloc[0]['total_total_enrolments']:,})")
print(f"   Top District (Enrolments): {top_districts_enrol.iloc[0]['district']} ({top_districts_enrol.iloc[0]['total_total_enrolments']:,})")
print(f"   Gini Coefficient (State-level): {gini_enrol:.4f}")

print("\n4. AGE GROUP BREAKDOWN:")
print(f"   Age 0-5: {age_percentages['age_0_5']}%")
print(f"   Age 5-17: {age_percentages['age_5_17']}%")
print(f"   Age 18+: {age_percentages['age_18_greater']}%")

print("\n5. UPDATE RATIOS (Top State):")
print(f"   Demographic: {demo_ratio.iloc[0]['state']} - {demo_ratio.iloc[0]['update_ratio']}%")
print(f"   Biometric: {bio_ratio.iloc[0]['state']} - {bio_ratio.iloc[0]['update_ratio']}%")

print("\n" + "="*80)
print("✓ Analysis complete!")
print("\nKey insights ready for visualization and reporting.")