# UIDAI Hackathon - Data Exploration

## Objective
This notebook performs initial exploration of the Aadhaar enrolment and update datasets to understand:
- Data structure and schema
- Data quality issues
- Basic patterns and trends
- Temporal and geographical distributions

**Author:** Harsh Vardhan 
**Date:** January 13, 2026  
**Dataset:** Aadhaar Enrolment, Demographic & Biometric Updates

## 1. Setup Development Environment

Import all necessary libraries and configure the environment.

In [3]:
# Standard libraries
import sys
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configure plotting
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Add src directory to path
project_root = Path(r'c:\Users\harsh\OneDrive - Indian Institute of Information Technology, Nagpur\IIIT Nagpur\6th Semester\Projects\IdentityLab')
sys.path.append(str(project_root / 'src'))

print("✓ Libraries imported successfully")
print(f"✓ Project root: {project_root}")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ NumPy version: {np.__version__}")

✓ Libraries imported successfully
✓ Project root: c:\Users\harsh\OneDrive - Indian Institute of Information Technology, Nagpur\IIIT Nagpur\6th Semester\Projects\IdentityLab
✓ Pandas version: 2.3.3
✓ NumPy version: 2.4.1


## 2. Load Datasets

Load the three Aadhaar datasets using our custom data loader module.

In [2]:
# Import custom data loader
from data_loader import AadhaarDataLoader

# Initialize the data loader
loader = AadhaarDataLoader(str(project_root))

print("Loading datasets (this may take a few minutes)...")
print("-" * 60)

ModuleNotFoundError: No module named 'data_loader'

### 2.1 Load Enrolment Data

In [None]:
# Load enrolment data
df_enrolment = loader.load_enrolment_data(use_dask=False)
print(f"✓ Enrolment data loaded: {df_enrolment.shape[0]:,} rows × {df_enrolment.shape[1]} columns")

### 2.2 Load Demographic Update Data

In [None]:
# Load demographic data
df_demographic = loader.load_demographic_data(use_dask=False)
print(f"✓ Demographic data loaded: {df_demographic.shape[0]:,} rows × {df_demographic.shape[1]} columns")

### 2.3 Load Biometric Update Data

In [None]:
# Load biometric data
df_biometric = loader.load_biometric_data(use_dask=False)
print(f"✓ Biometric data loaded: {df_biometric.shape[0]:,} rows × {df_biometric.shape[1]} columns")

## 3. Initial Data Exploration

Examine the structure, schema, and first few rows of each dataset.

### 3.1 Enrolment Data Overview

In [None]:
# Display first few rows
print("First 5 rows of Enrolment Data:")
display(df_enrolment.head())

print("\n" + "="*80)
print("Data Info:")
print("="*80)
df_enrolment.info()

In [None]:
# Statistical summary
print("Statistical Summary of Enrolment Data:")
display(df_enrolment.describe())

### 3.2 Demographic Update Data Overview

In [None]:
# Display first few rows
print("First 5 rows of Demographic Update Data:")
display(df_demographic.head())

print("\n" + "="*80)
print("Data Info:")
print("="*80)
df_demographic.info()

In [None]:
# Statistical summary
print("Statistical Summary of Demographic Update Data:")
display(df_demographic.describe())

### 3.3 Biometric Update Data Overview

In [None]:
# Display first few rows
print("First 5 rows of Biometric Update Data:")
display(df_biometric.head())

print("\n" + "="*80)
print("Data Info:")
print("="*80)
df_biometric.info()

In [None]:
# Statistical summary
print("Statistical Summary of Biometric Update Data:")
display(df_biometric.describe())

## 4. Data Quality Assessment

Check for missing values, duplicates, data types, and potential issues.

### 4.1 Missing Values Analysis

In [None]:
# Function to analyze missing values
def analyze_missing_values(df, name):
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(2)
    
    missing_df = pd.DataFrame({
        'Column': missing.index,
        'Missing_Count': missing.values,
        'Missing_Percentage': missing_pct.values
    })
    
    missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
    
    print(f"\n{'='*60}")
    print(f"Missing Values in {name} Dataset")
    print(f"{'='*60}")
    
    if len(missing_df) == 0:
        print("✓ No missing values found!")
    else:
        display(missing_df)
    
    return missing_df

# Analyze all datasets
missing_enrolment = analyze_missing_values(df_enrolment, "Enrolment")
missing_demographic = analyze_missing_values(df_demographic, "Demographic")
missing_biometric = analyze_missing_values(df_biometric, "Biometric")

### 4.2 Duplicate Records Check

In [None]:
# Check for duplicate rows
print("Duplicate Records:")
print("-" * 60)
print(f"Enrolment duplicates: {df_enrolment.duplicated().sum():,}")
print(f"Demographic duplicates: {df_demographic.duplicated().sum():,}")
print(f"Biometric duplicates: {df_biometric.duplicated().sum():,}")

### 4.3 Unique Values Count

In [None]:
# Unique values for categorical columns
print("Unique Values in Enrolment Data:")
print("-" * 60)
print(f"Unique States: {df_enrolment['state'].nunique()}")
print(f"Unique Districts: {df_enrolment['district'].nunique()}")
print(f"Unique Pincodes: {df_enrolment['pincode'].nunique()}")
print(f"Date Range: {df_enrolment['date'].min()} to {df_enrolment['date'].max()}")

## 5. Basic Visualizations

Create initial visualizations to understand the data distributions and patterns.

### 5.1 Sample Records from Each Dataset

In [None]:
# Random samples
print("Random samples from each dataset:\n")
print("Enrolment Sample:")
display(df_enrolment.sample(3))

print("\nDemographic Sample:")
display(df_demographic.sample(3))

print("\nBiometric Sample:")
display(df_biometric.sample(3))

## 6. Summary and Initial Insights

Summarize key findings from this initial exploration.

In [None]:
# Summary statistics
print("="*80)
print("DATA EXPLORATION SUMMARY")
print("="*80)

datasets = {
    'Enrolment': df_enrolment,
    'Demographic': df_demographic,
    'Biometric': df_biometric
}

summary_data = []
for name, df in datasets.items():
    summary_data.append({
        'Dataset': name,
        'Total Records': f"{len(df):,}",
        'Columns': df.shape[1],
        'Memory (MB)': f"{df.memory_usage(deep=True).sum() / 1024**2:.2f}",
        'Missing Values': df.isnull().sum().sum(),
        'Duplicates': df.duplicated().sum()
    })

summary_df = pd.DataFrame(summary_data)
display(summary_df)

print("\n✓ Data exploration complete!")
print("\nNext Steps:")
print("1. Data cleaning and preprocessing")
print("2. Detailed temporal analysis")
print("3. Geographical pattern analysis")
print("4. Cross-dataset correlation analysis")