# Aadhaar Data Analytics Project

## Objective
Analyze the gap between registered Aadhaar holders and actively participating users to estimate inactivity or utilization patterns. We aim to produce neutral, policy-oriented insights regarding enrolment vs. authentication levels across states.

## Scope & Limitations
- **Inactivity Index**: Framed as a comparative indicator of engagement (Authentication / Enrolment volume) rather than absolute dormancy.
- **Failure Rates**: Due to data limitations (lack of explicit failure counts), analysis focuses on *utilization volume*.
- **Demographic Data**: Used as a proxy for non-biometric authentication/update activity.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
from sklearn.ensemble import IsolationForest
import glob
import os

# Set aesthetic style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# 1. Data Preparation

## 1.1 Load Data
The data is distributed across three folders (`biometric`, `demographic`, `enrolment`), each containing chunked CSV files. We will load and combine them into three master DataFrames.

In [None]:
def load_dataset(folder_path):
    """
    Reads all CSV files in the given folder and concatenates them.
    """
    all_files = glob.glob(os.path.join(folder_path, "*.csv"))
    if not all_files:
        print(f"No files found in {folder_path}")
        return pd.DataFrame()
    
    df_list = []
    for filename in all_files:
        try:
            df = pd.read_csv(filename)
            df_list.append(df)
        except Exception as e:
            print(f"Error reading {filename}: {e}")
            
    return pd.concat(df_list, ignore_index=True) if df_list else pd.DataFrame()

# Define paths based on workspace structure
base_path = "/Users/shikhar/Downloads/Shikhar/hacka/Dataset"
bio_path = os.path.join(base_path, "api_data_aadhar_biometric")
demo_path = os.path.join(base_path, "api_data_aadhar_demographic")
enrol_path = os.path.join(base_path, "api_data_aadhar_enrolment")

print("Loading Biometric Data...")
df_bio = load_dataset(bio_path)
print(f"Biometric Shape: {df_bio.shape}")

print("Loading Demographic Data...")
df_demo = load_dataset(demo_path)
print(f"Demographic Shape: {df_demo.shape}")

print("Loading Enrolment Data...")
df_enrol = load_dataset(enrol_path)
print(f"Enrolment Shape: {df_enrol.shape}")

## 1.2 Data Standardization
We need to ensure state names are consistent across all datasets.

In [None]:
# Standardization of State Names
def standardize_states(df, state_col='state'):
    if df.empty:
        return df
    
    # Common mappings for Indian states
    state_map = {
        'Orissa': 'Odisha',
        'Pondicherry': 'Puducherry',
        'Delhi': 'NCT of Delhi',
        'Andaman and Nicobar Islands': 'Andaman & Nicobar Islands',
        'Jammu and Kashmir': 'Jammu & Kashmir',
        'Dadra and Nagar Haveli': 'Dadra & Nagar Haveli and Daman & Diu',
        'Daman and Diu': 'Dadra & Nagar Haveli and Daman & Diu',
        # Catch any casing issues
        'Telengana': 'Telangana'
    }
    
    df[state_col] = df[state_col].astype(str).str.strip()
    df[state_col] = df[state_col].replace(state_map)
    return df

df_bio = standardize_states(df_bio)
df_demo = standardize_states(df_demo)
df_enrol = standardize_states(df_enrol)

# Check unique states
print("Unique States in Biometric:", df_bio['state'].nunique())
print("Unique States in Enrolment:", df_enrol['state'].nunique())

# 2. Analysis

## 2.1 Aggregation by State
We aggregate the counts by state to perform high-level analysis. We assume that the files contain incremental or snapshot data that should be summed to get the total volume for the period.

In [None]:
# Filter numeric columns for aggregation (excluding pincode)
def get_sum_columns(df, prefix):
    return [c for c in df.columns if c.startswith(prefix)]

bio_cols = get_sum_columns(df_bio, 'bio_age')
demo_cols = get_sum_columns(df_demo, 'demo_age')
enrol_cols = get_sum_columns(df_enrol, 'age_')

print(f"Biometric Cols to sum: {bio_cols}")
print(f"Demographic Cols to sum: {demo_cols}")
print(f"Enrolment Cols to sum: {enrol_cols}")

# Aggrgation
state_bio = df_bio.groupby('state')[bio_cols].sum().reset_index()
state_demo = df_demo.groupby('state')[demo_cols].sum().reset_index()
state_enrol = df_enrol.groupby('state')[enrol_cols].sum().reset_index()

# Calculate Totals
state_bio['Total_Biometric'] = state_bio[bio_cols].sum(axis=1)
state_demo['Total_Demographic'] = state_demo[demo_cols].sum(axis=1)
state_enrol['Total_Enrolment'] = state_enrol[enrol_cols].sum(axis=1)

# Merge datasets
df_master = pd.merge(state_enrol[['state', 'Total_Enrolment']], state_bio[['state', 'Total_Biometric']], on='state', how='outer')
df_master = pd.merge(df_master, state_demo[['state', 'Total_Demographic']], on='state', how='outer')

# Fill NaNs with 0 (assuming missing means no activity in that state for that category)
df_master = df_master.fillna(0)

# Total Authentications
df_master['Total_Authentications'] = df_master['Total_Biometric'] + df_master['Total_Demographic']

df_master.head()

## 2.2 Core Metrics Calculation

### Inactivity Index
**Formula**: $1 - \frac{\text{Total Authentications}}{\text{Total Enrolment}}$

> **Interpretation**: 
> * Values closer to **1** indicate high inactivity (low utilization relative to enrolment).
> * Values < **0** indicate utilization exceeds the captured enrolment flow (highly active). 
> * Values near **0** indicate balanced activity.

In [None]:
# Inactivity Index
df_master['Inactivity_Index'] = 1 - (df_master['Total_Authentications'] / df_master['Total_Enrolment'])

# Update Activity Rate (Demographic share of total auth)
# This acts as a proxy for "Correction/Update" vs "Usage"
df_master['Demo_Auth_Share'] = df_master['Total_Demographic'] / df_master['Total_Authentications']

# Handle edge cases (divide by zero)
df_master.replace([np.inf, -np.inf], np.nan, inplace=True)

df_master.head()

# 3. Exploratory Data Analysis (EDA)

## 3.1 State Rankings

In [None]:
def plot_top_bottom(df, col, title, n=5):
    sorted_df = df.sort_values(col)
    top = sorted_df.tail(n)
    bottom = sorted_df.head(n)
    combined = pd.concat([bottom, top])
    
    plt.figure(figsize=(10, 6))
    sns.barplot(data=combined, y='state', x=col, hue='state', palette='coolwarm', legend=False)
    plt.title(f"Top and Bottom {n} States by {title}")
    plt.axvline(0, color='black', linewidth=1)
    plt.show()

plot_top_bottom(df_master, 'Inactivity_Index', 'Inactivity Index (Higher = Less Active)')

## 3.2 Enrolment vs Authentication Volume
Comparing the raw usage volume against the enrolled base.

In [None]:
# Plot for a subset of large states to avoid clutter
# We pick top 10 by enrolment volume
top_states = df_master.nlargest(10, 'Total_Enrolment')

top_states_melt = top_states.melt(id_vars='state', 
                                  value_vars=['Total_Enrolment', 'Total_Authentications'], 
                                  var_name='Metric', value_name='Count')

plt.figure(figsize=(12, 6))
    sns.barplot(data=top_states_melt, x='state', y='Count', hue='Metric')
plt.title("Enrolment vs Authentication in Top 10 Enrolled States")
plt.xticks(rotation=45)
plt.show()

# 4. Anomaly Detection

We use **Z-Score** to identify states that deviate significantly from the national average Inactivity Index.
- **Z > 2**: Significantly High Inactivity (Potential ghost beneficiaries or migration OUT).
- **Z < -2**: Significantly High Activity (Potential migration IN or high service dependency).

In [None]:
mean_inactivity = df_master['Inactivity_Index'].mean()
std_inactivity = df_master['Inactivity_Index'].std()

df_master['Z_Score'] = (df_master['Inactivity_Index'] - mean_inactivity) / std_inactivity

anomalies = df_master[(df_master['Z_Score'] > 2) | (df_master['Z_Score'] < -2)]

print("Detected Anomalies:")
display(anomalies[['state', 'Total_Enrolment', 'Total_Authentications', 'Inactivity_Index', 'Z_Score']])

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_master, x='Total_Enrolment', y='Inactivity_Index', hue='Z_Score', palette='RdBu_r', size='Z_Score')
plt.title("Anomaly Detection: Inactivity vs Enrolment Size")
plt.axhline(mean_inactivity, color='gray', linestyle='--')
plt.show()

# 5. Policy Recommendations

Based on the analysis:

1. **High Inactivity States**:
    - States with high Z-scores require targeted **Aadhaar usage drives**.
    - Investigation into **dead/duplicate entries** is recommended if demographic updates provided are also low.
    
2. **High Utilization States**:
    - States with negative Inactivity Indices (High utilization) likely have **migrant inflows** or heavy reliance on DBT (Direct Benefit Transfer).
    - Recommendation: Strengthen **authentication infrastructure** (server capacity, biometric devices) in these regions to prevent failures.
    
3. **Demographic vs Biometric**:
    - Regions with disproportionately high Demographic authentication may be facing **biometric failures** (e.g., manual override or OTP fallback). Targeted hardware audits are advised.