# UIDAI Hackathon PS-1: Data Exploration
## Predictive Analysis of Aadhaar Update Demand

This notebook performs comprehensive exploratory data analysis on three Aadhaar datasets:
1. **Enrolment Data**: New Aadhaar enrollments by age group
2. **Demographic Data**: Demographic update requests
3. **Biometric Data**: Biometric update requests

### Objectives:
- Load and consolidate all CSV files from each dataset
- Perform data quality checks
- Generate statistical summaries
- Identify patterns and insights
- Save metadata for further analysis

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import glob
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.4.1


## 2. Define Data Paths and Load Functions

In [2]:
# Define base path
BASE_PATH = Path('/home/prince/Desktop/UIDAI Hackathon')
DATASET_PATH = BASE_PATH / 'dataset'

# Define paths for each dataset
ENROLMENT_PATH = DATASET_PATH / 'api_data_aadhar_enrolment'
DEMOGRAPHIC_PATH = DATASET_PATH / 'api_data_aadhar_demographic'
BIOMETRIC_PATH = DATASET_PATH / 'api_data_aadhar_biometric'

# Output path
OUTPUT_PATH = BASE_PATH / 'outputs' / 'results'
OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

print("Data Paths:")
print(f"Enrolment: {ENROLMENT_PATH}")
print(f"Demographic: {DEMOGRAPHIC_PATH}")
print(f"Biometric: {BIOMETRIC_PATH}")
print(f"Output: {OUTPUT_PATH}")

Data Paths:
Enrolment: /home/prince/Desktop/UIDAI Hackathon/dataset/api_data_aadhar_enrolment
Demographic: /home/prince/Desktop/UIDAI Hackathon/dataset/api_data_aadhar_demographic
Biometric: /home/prince/Desktop/UIDAI Hackathon/dataset/api_data_aadhar_biometric
Output: /home/prince/Desktop/UIDAI Hackathon/outputs/results


In [3]:
def load_and_concatenate_csvs(folder_path, dataset_name):
    """
    Load all CSV files from a folder and concatenate them into a single DataFrame
    
    Parameters:
    folder_path: Path to the folder containing CSV files
    dataset_name: Name of the dataset for logging
    
    Returns:
    DataFrame: Concatenated data from all CSV files
    """
    try:
        csv_files = sorted(glob.glob(str(folder_path / '*.csv')))
        
        if not csv_files:
            print(f"No CSV files found in {folder_path}")
            return None
        
        print(f"\n{dataset_name} - Found {len(csv_files)} CSV files:")
        for file in csv_files:
            print(f"  - {Path(file).name}")
        
        # Load and concatenate all CSV files
        dfs = []
        for file in csv_files:
            df = pd.read_csv(file)
            dfs.append(df)
            print(f"  Loaded {Path(file).name}: {df.shape[0]:,} rows")
        
        combined_df = pd.concat(dfs, ignore_index=True)
        print(f"\n{dataset_name} - Total combined shape: {combined_df.shape}")
        print(f"Total rows: {combined_df.shape[0]:,} | Total columns: {combined_df.shape[1]}")
        
        return combined_df
    
    except Exception as e:
        print(f"Error loading {dataset_name}: {str(e)}")
        return None

print("Data loading function defined successfully!")

Data loading function defined successfully!


## 3. Load All Datasets

Loading and consolidating all CSV files from each dataset folder.

In [4]:
# Load Enrolment Data
df_enrolment = load_and_concatenate_csvs(ENROLMENT_PATH, "Enrolment Data")


Enrolment Data - Found 3 CSV files:
  - api_data_aadhar_enrolment_0_500000.csv
  - api_data_aadhar_enrolment_1000000_1006029.csv
  - api_data_aadhar_enrolment_500000_1000000.csv
  Loaded api_data_aadhar_enrolment_0_500000.csv: 500,000 rows
  Loaded api_data_aadhar_enrolment_1000000_1006029.csv: 6,029 rows
  Loaded api_data_aadhar_enrolment_500000_1000000.csv: 500,000 rows

Enrolment Data - Total combined shape: (1006029, 7)
Total rows: 1,006,029 | Total columns: 7


In [5]:
# Load Demographic Data
df_demographic = load_and_concatenate_csvs(DEMOGRAPHIC_PATH, "Demographic Data")


Demographic Data - Found 5 CSV files:
  - api_data_aadhar_demographic_0_500000.csv
  - api_data_aadhar_demographic_1000000_1500000.csv
  - api_data_aadhar_demographic_1500000_2000000.csv
  - api_data_aadhar_demographic_2000000_2071700.csv
  - api_data_aadhar_demographic_500000_1000000.csv
  Loaded api_data_aadhar_demographic_0_500000.csv: 500,000 rows
  Loaded api_data_aadhar_demographic_1000000_1500000.csv: 500,000 rows
  Loaded api_data_aadhar_demographic_1500000_2000000.csv: 500,000 rows
  Loaded api_data_aadhar_demographic_2000000_2071700.csv: 71,700 rows
  Loaded api_data_aadhar_demographic_500000_1000000.csv: 500,000 rows

Demographic Data - Total combined shape: (2071700, 6)
Total rows: 2,071,700 | Total columns: 6


In [6]:
# Load Biometric Data
df_biometric = load_and_concatenate_csvs(BIOMETRIC_PATH, "Biometric Data")


Biometric Data - Found 4 CSV files:
  - api_data_aadhar_biometric_0_500000.csv
  - api_data_aadhar_biometric_1000000_1500000.csv
  - api_data_aadhar_biometric_1500000_1861108.csv
  - api_data_aadhar_biometric_500000_1000000.csv
  Loaded api_data_aadhar_biometric_0_500000.csv: 500,000 rows
  Loaded api_data_aadhar_biometric_1000000_1500000.csv: 500,000 rows
  Loaded api_data_aadhar_biometric_1500000_1861108.csv: 361,108 rows
  Loaded api_data_aadhar_biometric_500000_1000000.csv: 500,000 rows

Biometric Data - Total combined shape: (1861108, 6)
Total rows: 1,861,108 | Total columns: 6


## 4. Basic Dataset Information

### 4.1 Enrolment Dataset

In [7]:
print("=" * 80)
print("ENROLMENT DATASET - BASIC INFORMATION")
print("=" * 80)
print(f"\nShape: {df_enrolment.shape}")
print(f"Rows: {df_enrolment.shape[0]:,}")
print(f"Columns: {df_enrolment.shape[1]}")
print(f"\nColumn Names:\n{df_enrolment.columns.tolist()}")
print(f"\nData Types:")
print(df_enrolment.dtypes)
print(f"\nMemory Usage:")
print(df_enrolment.memory_usage(deep=True))
print(f"\nTotal Memory: {df_enrolment.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

ENROLMENT DATASET - BASIC INFORMATION

Shape: (1006029, 7)
Rows: 1,006,029
Columns: 7

Column Names:
['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater']

Data Types:
date              object
state             object
district          object
pincode            int64
age_0_5            int64
age_5_17           int64
age_18_greater     int64
dtype: object

Memory Usage:
Index                  132
date              59355711
state             59166221
district          58078418
pincode            8048232
age_0_5            8048232
age_5_17           8048232
age_18_greater     8048232
dtype: int64

Total Memory: 199.12 MB


### 4.2 Demographic Dataset

In [8]:
print("=" * 80)
print("DEMOGRAPHIC DATASET - BASIC INFORMATION")
print("=" * 80)
print(f"\nShape: {df_demographic.shape}")
print(f"Rows: {df_demographic.shape[0]:,}")
print(f"Columns: {df_demographic.shape[1]}")
print(f"\nColumn Names:\n{df_demographic.columns.tolist()}")
print(f"\nData Types:")
print(df_demographic.dtypes)
print(f"\nMemory Usage:")
print(df_demographic.memory_usage(deep=True))
print(f"\nTotal Memory: {df_demographic.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

DEMOGRAPHIC DATASET - BASIC INFORMATION

Shape: (2071700, 6)
Rows: 2,071,700
Columns: 6

Column Names:
['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_']

Data Types:
date             object
state            object
district         object
pincode           int64
demo_age_5_17     int64
demo_age_17_      int64
dtype: object

Memory Usage:
Index                  132
date             122230300
state            121916773
district         120112391
pincode           16573600
demo_age_5_17     16573600
demo_age_17_      16573600
dtype: int64

Total Memory: 394.80 MB


### 4.3 Biometric Dataset

In [9]:
print("=" * 80)
print("BIOMETRIC DATASET - BASIC INFORMATION")
print("=" * 80)
print(f"\nShape: {df_biometric.shape}")
print(f"Rows: {df_biometric.shape[0]:,}")
print(f"Columns: {df_biometric.shape[1]}")
print(f"\nColumn Names:\n{df_biometric.columns.tolist()}")
print(f"\nData Types:")
print(df_biometric.dtypes)
print(f"\nMemory Usage:")
print(df_biometric.memory_usage(deep=True))
print(f"\nTotal Memory: {df_biometric.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

BIOMETRIC DATASET - BASIC INFORMATION

Shape: (1861108, 6)
Rows: 1,861,108
Columns: 6

Column Names:
['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_']

Data Types:
date            object
state           object
district        object
pincode          int64
bio_age_5_17     int64
bio_age_17_      int64
dtype: object

Memory Usage:
Index                 132
date            109805372
state           109513287
district        107684220
pincode          14888864
bio_age_5_17     14888864
bio_age_17_      14888864
dtype: int64

Total Memory: 354.45 MB


## 5. Data Preview

### 5.1 Enrolment Dataset - First and Last 10 Rows

In [10]:
print("First 10 rows:")
display(df_enrolment.head(10))
print("\nLast 10 rows:")
display(df_enrolment.tail(10))

First 10 rows:


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,02-03-2025,Meghalaya,East Khasi Hills,793121,11,61,37
1,09-03-2025,Karnataka,Bengaluru Urban,560043,14,33,39
2,09-03-2025,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,09-03-2025,Uttar Pradesh,Aligarh,202133,62,29,15
4,09-03-2025,Karnataka,Bengaluru Urban,560016,14,16,21
5,09-03-2025,Bihar,Sitamarhi,843331,20,49,12
6,09-03-2025,Bihar,Sitamarhi,843330,23,24,42
7,09-03-2025,Uttar Pradesh,Bahraich,271865,26,60,14
8,09-03-2025,Uttar Pradesh,Firozabad,283204,28,26,10
9,09-03-2025,Bihar,Purbi Champaran,845418,30,48,10



Last 10 rows:


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
1006019,31-12-2025,Telangana,Hyderabad,500023,9,9,0
1006020,31-12-2025,Telangana,Hyderabad,500027,6,4,0
1006021,31-12-2025,Telangana,Hyderabad,500033,4,0,0
1006022,31-12-2025,Telangana,Hyderabad,500036,6,2,0
1006023,31-12-2025,Telangana,Hyderabad,500040,2,1,0
1006024,31-12-2025,Telangana,Hyderabad,500045,4,5,1
1006025,31-12-2025,Telangana,Hyderabad,500057,0,2,0
1006026,31-12-2025,Telangana,Hyderabad,500061,4,2,0
1006027,31-12-2025,Telangana,Hyderabad,500062,1,4,0
1006028,31-12-2025,Telangana,Hyderabad,500095,0,1,0


### 5.2 Demographic Dataset - First and Last 10 Rows

In [11]:
print("First 10 rows:")
display(df_demographic.head(10))
print("\nLast 10 rows:")
display(df_demographic.tail(10))

First 10 rows:


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
0,01-03-2025,Uttar Pradesh,Gorakhpur,273213,49,529
1,01-03-2025,Andhra Pradesh,Chittoor,517132,22,375
2,01-03-2025,Gujarat,Rajkot,360006,65,765
3,01-03-2025,Andhra Pradesh,Srikakulam,532484,24,314
4,01-03-2025,Rajasthan,Udaipur,313801,45,785
5,01-03-2025,Rajasthan,Sikar,332028,28,285
6,01-03-2025,Karnataka,Tumakuru,572201,88,332
7,01-03-2025,Uttar Pradesh,Gorakhpur,273211,61,836
8,01-03-2025,Andhra Pradesh,Kurnool,518313,83,986
9,01-03-2025,West Bengal,Paschim Medinipur,721148,13,281



Last 10 rows:


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
2071690,31-10-2025,Uttar Pradesh,Deoria,274205,5,20
2071691,31-10-2025,Uttar Pradesh,Deoria,274705,3,11
2071692,31-10-2025,Uttar Pradesh,Etah,207120,7,31
2071693,31-10-2025,Uttar Pradesh,Etah,207125,3,19
2071694,31-10-2025,Uttar Pradesh,Etah,207249,1,51
2071695,31-10-2025,Uttar Pradesh,Etah,207250,2,17
2071696,31-10-2025,Uttar Pradesh,Etah,207401,1,27
2071697,31-10-2025,Uttar Pradesh,Etawah,206003,3,10
2071698,31-10-2025,Uttar Pradesh,Etawah,206125,1,25
2071699,31-10-2025,Uttar Pradesh,Etawah,206126,1,25


### 5.3 Biometric Dataset - First and Last 10 Rows

In [12]:
print("First 10 rows:")
display(df_biometric.head(10))
print("\nLast 10 rows:")
display(df_biometric.tail(10))

First 10 rows:


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
0,01-03-2025,Haryana,Mahendragarh,123029,280,577
1,01-03-2025,Bihar,Madhepura,852121,144,369
2,01-03-2025,Jammu and Kashmir,Punch,185101,643,1091
3,01-03-2025,Bihar,Bhojpur,802158,256,980
4,01-03-2025,Tamil Nadu,Madurai,625514,271,815
5,01-03-2025,Maharashtra,Ratnagiri,416702,155,529
6,01-03-2025,Gujarat,Anand,388130,75,143
7,01-03-2025,Gujarat,Gandhinagar,382421,192,298
8,01-03-2025,Odisha,Dhenkanal,759025,122,214
9,01-03-2025,Gujarat,Valsad,396055,67,85



Last 10 rows:


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
1861098,07-11-2025,Kerala,Alappuzha,690535,3,4
1861099,07-11-2025,Kerala,Alappuzha,690548,0,3
1861100,07-11-2025,Kerala,Ernakulam,680667,0,1
1861101,07-11-2025,Kerala,Ernakulam,682001,2,8
1861102,07-11-2025,Kerala,Ernakulam,682013,1,1
1861103,07-11-2025,Kerala,Ernakulam,682020,1,6
1861104,07-11-2025,Kerala,Ernakulam,682022,1,0
1861105,07-11-2025,Kerala,Ernakulam,682023,0,1
1861106,07-11-2025,Kerala,Ernakulam,682025,3,6
1861107,07-11-2025,Kerala,Ernakulam,682026,1,3


## 6. Missing Values and Duplicates Analysis

### 6.1 Missing Values Check

In [13]:
def analyze_missing_values(df, dataset_name):
    """Analyze missing values in a dataset"""
    print(f"\n{'=' * 80}")
    print(f"{dataset_name.upper()} - MISSING VALUES ANALYSIS")
    print('=' * 80)
    
    missing_count = df.isnull().sum()
    missing_percent = (df.isnull().sum() / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Column': missing_count.index,
        'Missing_Count': missing_count.values,
        'Missing_Percentage': missing_percent.values
    })
    missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
    
    if len(missing_df) == 0:
        print("\n✓ No missing values found!")
    else:
        print(f"\nColumns with missing values:")
        display(missing_df)
    
    print(f"\nTotal missing values: {df.isnull().sum().sum():,}")
    print(f"Percentage of missing values: {(df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100:.4f}%")
    
    return missing_df

# Analyze missing values for all datasets
missing_enrolment = analyze_missing_values(df_enrolment, "Enrolment")
missing_demographic = analyze_missing_values(df_demographic, "Demographic")
missing_biometric = analyze_missing_values(df_biometric, "Biometric")


ENROLMENT - MISSING VALUES ANALYSIS

✓ No missing values found!

Total missing values: 0
Percentage of missing values: 0.0000%

DEMOGRAPHIC - MISSING VALUES ANALYSIS

✓ No missing values found!

Total missing values: 0
Percentage of missing values: 0.0000%

BIOMETRIC - MISSING VALUES ANALYSIS

✓ No missing values found!

Total missing values: 0
Percentage of missing values: 0.0000%


### 6.2 Duplicate Rows Check

In [14]:
def check_duplicates(df, dataset_name):
    """Check for duplicate rows in a dataset"""
    print(f"\n{'=' * 80}")
    print(f"{dataset_name.upper()} - DUPLICATE ROWS CHECK")
    print('=' * 80)
    
    total_rows = len(df)
    duplicate_count = df.duplicated().sum()
    duplicate_percent = (duplicate_count / total_rows) * 100
    
    print(f"\nTotal rows: {total_rows:,}")
    print(f"Duplicate rows: {duplicate_count:,}")
    print(f"Duplicate percentage: {duplicate_percent:.4f}%")
    print(f"Unique rows: {total_rows - duplicate_count:,}")
    
    if duplicate_count > 0:
        print(f"\n⚠ Warning: Found {duplicate_count:,} duplicate rows!")
        print("\nSample duplicate rows:")
        display(df[df.duplicated(keep=False)].head(10))
    else:
        print("\n✓ No duplicate rows found!")
    
    return duplicate_count

# Check duplicates for all datasets
dup_enrolment = check_duplicates(df_enrolment, "Enrolment")
dup_demographic = check_duplicates(df_demographic, "Demographic")
dup_biometric = check_duplicates(df_biometric, "Biometric")


ENROLMENT - DUPLICATE ROWS CHECK

Total rows: 1,006,029
Duplicate rows: 22,957
Duplicate percentage: 2.2819%
Unique rows: 983,072


Sample duplicate rows:


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
359389,13-10-2025,Punjab,Jalandhar,144041,2,1,0
359390,13-10-2025,Punjab,Jalandhar,144101,1,0,0
359391,13-10-2025,Punjab,Jalandhar,144102,2,0,0
359392,13-10-2025,Punjab,Jalandhar,144418,1,0,0
359393,13-10-2025,Punjab,Jalandhar,144419,1,0,0
359394,13-10-2025,Punjab,Jalandhar,144702,1,1,0
359395,13-10-2025,Punjab,Jalandhar,144801,0,1,0
359396,13-10-2025,Punjab,Kapurthala,144401,5,1,1
359397,13-10-2025,Punjab,Kapurthala,144601,4,2,2
359398,13-10-2025,Punjab,Kapurthala,144804,2,0,0



DEMOGRAPHIC - DUPLICATE ROWS CHECK

Total rows: 2,071,700
Duplicate rows: 473,601
Duplicate percentage: 22.8605%
Unique rows: 1,598,099


Sample duplicate rows:


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
113325,18-10-2025,Karnataka,Belagavi,591313,0,1
113326,18-10-2025,Karnataka,Belagavi,591315,0,1
113327,18-10-2025,Karnataka,Belagavi,591316,0,1
113328,18-10-2025,Karnataka,Belgaum,590009,0,1
113329,18-10-2025,Karnataka,Belgaum,591101,1,6
113330,18-10-2025,Karnataka,Belgaum,591106,1,5
113331,18-10-2025,Karnataka,Belgaum,591113,0,2
113332,18-10-2025,Karnataka,Belgaum,591115,1,3
113333,18-10-2025,Karnataka,Belgaum,591118,2,2
113334,18-10-2025,Karnataka,Belgaum,591121,0,6



BIOMETRIC - DUPLICATE ROWS CHECK

Total rows: 1,861,108
Duplicate rows: 94,896
Duplicate percentage: 5.0989%
Unique rows: 1,766,212


Sample duplicate rows:


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
109994,01-09-2025,Chhattisgarh,Kondagaon,494229,0,1
109995,01-09-2025,Chhattisgarh,Kondagaon,494230,1,0
109996,01-09-2025,Chhattisgarh,Korba,495119,5,35
109997,01-09-2025,Chhattisgarh,Korba,495446,0,16
109998,01-09-2025,Chhattisgarh,Korba,495674,10,34
109999,01-09-2025,Chhattisgarh,Korba,495683,0,3
110000,01-09-2025,Chhattisgarh,Kondagaon,494229,0,1
110001,01-09-2025,Chhattisgarh,Kondagaon,494230,1,0
110002,01-09-2025,Chhattisgarh,Korba,495119,5,35
110003,01-09-2025,Chhattisgarh,Korba,495446,0,16


## 7. Statistical Summary

### 7.1 Enrolment Dataset Statistics

In [15]:
print("=" * 80)
print("ENROLMENT DATASET - STATISTICAL SUMMARY")
print("=" * 80)
print("\nDescriptive statistics for numerical columns:")
display(df_enrolment.describe())

print("\nAdditional statistics:")
print(f"Median values:")
display(df_enrolment.median(numeric_only=True))
print(f"\nStandard Deviation:")
display(df_enrolment.std(numeric_only=True))

ENROLMENT DATASET - STATISTICAL SUMMARY

Descriptive statistics for numerical columns:


Unnamed: 0,pincode,age_0_5,age_5_17,age_18_greater
count,1006029.0,1006029.0,1006029.0,1006029.0
mean,518641.45,3.53,1.71,0.17
std,205635.97,17.54,14.37,3.22
min,100000.0,0.0,0.0,0.0
25%,363641.0,1.0,0.0,0.0
50%,517417.0,2.0,0.0,0.0
75%,700104.0,3.0,1.0,0.0
max,855456.0,2688.0,1812.0,855.0



Additional statistics:
Median values:


pincode          517417.00
age_0_5               2.00
age_5_17              0.00
age_18_greater        0.00
dtype: float64


Standard Deviation:


pincode          205635.97
age_0_5              17.54
age_5_17             14.37
age_18_greater        3.22
dtype: float64

### 7.2 Demographic Dataset Statistics

In [16]:
print("=" * 80)
print("DEMOGRAPHIC DATASET - STATISTICAL SUMMARY")
print("=" * 80)
print("\nDescriptive statistics for numerical columns:")
display(df_demographic.describe())

print("\nAdditional statistics:")
print(f"Median values:")
display(df_demographic.median(numeric_only=True))
print(f"\nStandard Deviation:")
display(df_demographic.std(numeric_only=True))

DEMOGRAPHIC DATASET - STATISTICAL SUMMARY

Descriptive statistics for numerical columns:


Unnamed: 0,pincode,demo_age_5_17,demo_age_17_
count,2071700.0,2071700.0,2071700.0
mean,527831.78,2.35,21.45
std,197293.32,14.9,125.25
min,100000.0,0.0,0.0
25%,396469.0,0.0,2.0
50%,524322.0,1.0,6.0
75%,695507.0,2.0,15.0
max,855456.0,2690.0,16166.0



Additional statistics:
Median values:


pincode         524322.00
demo_age_5_17        1.00
demo_age_17_         6.00
dtype: float64


Standard Deviation:


pincode         197293.32
demo_age_5_17       14.90
demo_age_17_       125.25
dtype: float64

### 7.3 Biometric Dataset Statistics

In [17]:
print("=" * 80)
print("BIOMETRIC DATASET - STATISTICAL SUMMARY")
print("=" * 80)
print("\nDescriptive statistics for numerical columns:")
display(df_biometric.describe())

print("\nAdditional statistics:")
print(f"Median values:")
display(df_biometric.median(numeric_only=True))
print(f"\nStandard Deviation:")
display(df_biometric.std(numeric_only=True))

BIOMETRIC DATASET - STATISTICAL SUMMARY

Descriptive statistics for numerical columns:


Unnamed: 0,pincode,bio_age_5_17,bio_age_17_
count,1861108.0,1861108.0,1861108.0
mean,521761.17,18.39,19.09
std,198162.68,83.7,88.07
min,110001.0,0.0,0.0
25%,391175.0,1.0,1.0
50%,522401.0,3.0,4.0
75%,686636.25,11.0,10.0
max,855456.0,8002.0,7625.0



Additional statistics:
Median values:


pincode        522401.00
bio_age_5_17        3.00
bio_age_17_         4.00
dtype: float64


Standard Deviation:


pincode        198162.68
bio_age_5_17       83.70
bio_age_17_        88.07
dtype: float64

## 8. Date Range Analysis

Analyzing the temporal coverage of each dataset to understand the time period of data collection.

In [18]:
def analyze_date_range(df, dataset_name):
    """Analyze date range in a dataset"""
    print(f"\n{'=' * 80}")
    print(f"{dataset_name.upper()} - DATE RANGE ANALYSIS")
    print('=' * 80)
    
    if 'date' in df.columns:
        # Convert to datetime
        df['date'] = pd.to_datetime(df['date'], errors='coerce')
        
        print(f"\nDate column found!")
        print(f"Earliest date: {df['date'].min()}")
        print(f"Latest date: {df['date'].max()}")
        print(f"Date range span: {(df['date'].max() - df['date'].min()).days} days")
        print(f"\nNumber of unique dates: {df['date'].nunique():,}")
        print(f"Number of records per date (average): {len(df) / df['date'].nunique():.2f}")
        
        # Check for any invalid dates
        invalid_dates = df['date'].isnull().sum()
        if invalid_dates > 0:
            print(f"\n⚠ Warning: {invalid_dates:,} invalid/missing dates found")
        else:
            print(f"\n✓ All dates are valid")
        
        # Show date distribution
        print(f"\nTop 10 dates by record count:")
        date_counts = df['date'].value_counts().head(10)
        display(date_counts)
    else:
        print("\n⚠ No 'date' column found in dataset")

# Analyze date ranges for all datasets
analyze_date_range(df_enrolment, "Enrolment")
analyze_date_range(df_demographic, "Demographic")
analyze_date_range(df_biometric, "Biometric")


ENROLMENT - DATE RANGE ANALYSIS

Date column found!
Earliest date: 2025-01-04 00:00:00
Latest date: 2025-12-11 00:00:00
Date range span: 341 days

Number of unique dates: 30
Number of records per date (average): 33534.30


Top 10 dates by record count:


date
2025-02-11    18080
2025-09-09    16789
2025-08-09    16768
2025-10-09    16518
2025-12-09    16107
2025-01-09    15971
2025-11-09    15950
2025-05-11    15745
2025-02-09    15622
2025-03-09    15330
Name: count, dtype: int64


DEMOGRAPHIC - DATE RANGE ANALYSIS

Date column found!
Earliest date: 2025-01-03 00:00:00
Latest date: 2025-12-12 00:00:00
Date range span: 343 days

Number of unique dates: 41
Number of records per date (average): 50529.27


Top 10 dates by record count:


date
2025-12-12    34568
2025-04-12    32603
2025-08-12    31944
2025-03-12    31316
2025-06-11    28891
2025-10-11    28828
2025-08-11    28250
2025-11-12    28144
2025-04-11    26470
2025-08-09    26109
Name: count, dtype: int64


BIOMETRIC - DATE RANGE ANALYSIS

Date column found!
Earliest date: 2025-01-03 00:00:00
Latest date: 2025-12-12 00:00:00
Date range span: 343 days

Number of unique dates: 41
Number of records per date (average): 45392.88


Top 10 dates by record count:


date
2025-02-12    24529
2025-01-11    24192
2025-12-11    24169
2025-09-12    23932
2025-05-12    23869
2025-07-11    23856
2025-11-12    23830
2025-12-12    23778
2025-06-12    23773
2025-04-12    23740
Name: count, dtype: int64

## 9. Categorical Columns Analysis

Analyzing unique counts for categorical columns (state, district, pincode) to understand geographical coverage.

In [19]:
def analyze_categorical_columns(df, dataset_name):
    """Analyze categorical columns in a dataset"""
    print(f"\n{'=' * 80}")
    print(f"{dataset_name.upper()} - CATEGORICAL COLUMNS ANALYSIS")
    print('=' * 80)
    
    categorical_cols = ['state', 'district', 'pincode']
    
    for col in categorical_cols:
        if col in df.columns:
            print(f"\n{col.upper()}:")
            print(f"  Unique values: {df[col].nunique():,}")
            print(f"  Most common values:")
            top_values = df[col].value_counts().head(10)
            for idx, (value, count) in enumerate(top_values.items(), 1):
                print(f"    {idx}. {value}: {count:,} records ({count/len(df)*100:.2f}%)")
        else:
            print(f"\n⚠ Column '{col}' not found in dataset")
    
    return None

# Analyze categorical columns for all datasets
analyze_categorical_columns(df_enrolment, "Enrolment")
analyze_categorical_columns(df_demographic, "Demographic")
analyze_categorical_columns(df_biometric, "Biometric")


ENROLMENT - CATEGORICAL COLUMNS ANALYSIS

STATE:
  Unique values: 55
  Most common values:
    1. Uttar Pradesh: 110,369 records (10.97%)
    2. Tamil Nadu: 92,552 records (9.20%)
    3. Maharashtra: 77,191 records (7.67%)
    4. West Bengal: 76,519 records (7.61%)
    5. Karnataka: 70,198 records (6.98%)
    6. Andhra Pradesh: 65,658 records (6.53%)
    7. Bihar: 60,567 records (6.02%)
    8. Rajasthan: 56,159 records (5.58%)
    9. Madhya Pradesh: 50,225 records (4.99%)
    10. Gujarat: 46,624 records (4.63%)

DISTRICT:
  Unique values: 985
  Most common values:
    1. Pune: 6,663 records (0.66%)
    2. North 24 Parganas: 6,488 records (0.64%)
    3. Barddhaman: 5,362 records (0.53%)
    4. Bengaluru: 5,305 records (0.53%)
    5. Hyderabad: 4,984 records (0.50%)
    6. Malappuram: 4,700 records (0.47%)
    7. Jaipur: 4,670 records (0.46%)
    8. Murshidabad: 4,562 records (0.45%)
    9. South 24 Parganas: 4,559 records (0.45%)
    10. K.v. Rangareddy: 4,550 records (0.45%)

PINCODE:

## 10. Data Quality Checks

Checking for data quality issues such as negative values, outliers, and inconsistencies.

In [20]:
def check_data_quality(df, dataset_name):
    """Check for data quality issues"""
    print(f"\n{'=' * 80}")
    print(f"{dataset_name.upper()} - DATA QUALITY CHECKS")
    print('=' * 80)
    
    # Get numerical columns (excluding date if present)
    numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    if not numerical_cols:
        print("\n⚠ No numerical columns found")
        return
    
    print(f"\nChecking {len(numerical_cols)} numerical columns...")
    
    # Check for negative values
    print("\n1. NEGATIVE VALUES CHECK:")
    negative_found = False
    for col in numerical_cols:
        negative_count = (df[col] < 0).sum()
        if negative_count > 0:
            negative_found = True
            print(f"  ⚠ {col}: {negative_count:,} negative values ({negative_count/len(df)*100:.4f}%)")
    if not negative_found:
        print("  ✓ No negative values found")
    
    # Check for outliers using IQR method
    print("\n2. OUTLIERS CHECK (IQR method):")
    for col in numerical_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
        if outliers > 0:
            print(f"  {col}: {outliers:,} outliers ({outliers/len(df)*100:.2f}%)")
            print(f"    Range: [{df[col].min():.2f}, {df[col].max():.2f}]")
            print(f"    Expected range: [{lower_bound:.2f}, {upper_bound:.2f}]")
    
    # Check for zero values
    print("\n3. ZERO VALUES CHECK:")
    for col in numerical_cols:
        zero_count = (df[col] == 0).sum()
        if zero_count > 0:
            print(f"  {col}: {zero_count:,} zeros ({zero_count/len(df)*100:.2f}%)")
    
    # Check for extreme values (top and bottom 1%)
    print("\n4. EXTREME VALUES (Bottom and Top 1%):")
    for col in numerical_cols:
        bottom_1_percent = df[col].quantile(0.01)
        top_99_percent = df[col].quantile(0.99)
        print(f"  {col}:")
        print(f"    1st percentile: {bottom_1_percent:.2f}")
        print(f"    99th percentile: {top_99_percent:.2f}")

# Check data quality for all datasets
check_data_quality(df_enrolment, "Enrolment")
check_data_quality(df_demographic, "Demographic")
check_data_quality(df_biometric, "Biometric")


ENROLMENT - DATA QUALITY CHECKS

Checking 4 numerical columns...

1. NEGATIVE VALUES CHECK:
  ✓ No negative values found

2. OUTLIERS CHECK (IQR method):
  age_0_5: 102,013 outliers (10.14%)
    Range: [0.00, 2688.00]
    Expected range: [-2.00, 6.00]
  age_5_17: 135,765 outliers (13.50%)
    Range: [0.00, 1812.00]
    Expected range: [-1.50, 2.50]
  age_18_greater: 40,225 outliers (4.00%)
    Range: [0.00, 855.00]
    Expected range: [0.00, 0.00]

3. ZERO VALUES CHECK:
  age_0_5: 115,243 zeros (11.46%)
  age_5_17: 556,737 zeros (55.34%)
  age_18_greater: 965,804 zeros (96.00%)

4. EXTREME VALUES (Bottom and Top 1%):
  pincode:
    1st percentile: 123023.00
    99th percentile: 851128.00
  age_0_5:
    1st percentile: 0.00
    99th percentile: 23.00
  age_5_17:
    1st percentile: 0.00
    99th percentile: 14.00
  age_18_greater:
    1st percentile: 0.00
    99th percentile: 2.00

DEMOGRAPHIC - DATA QUALITY CHECKS

Checking 3 numerical columns...

1. NEGATIVE VALUES CHECK:
  ✓ No nega

## 11. Initial Insights and Data Distribution Summary

Key findings from the exploratory data analysis.

In [21]:
print("=" * 80)
print("INITIAL INSIGHTS AND DATA DISTRIBUTION SUMMARY")
print("=" * 80)

print("\n📊 DATASET OVERVIEW:")
print(f"1. Enrolment Dataset: {df_enrolment.shape[0]:,} rows × {df_enrolment.shape[1]} columns")
print(f"2. Demographic Dataset: {df_demographic.shape[0]:,} rows × {df_demographic.shape[1]} columns")
print(f"3. Biometric Dataset: {df_biometric.shape[0]:,} rows × {df_biometric.shape[1]} columns")

print("\n📅 TEMPORAL COVERAGE:")
if 'date' in df_enrolment.columns:
    print(f"Enrolment: {df_enrolment['date'].min()} to {df_enrolment['date'].max()}")
if 'date' in df_demographic.columns:
    print(f"Demographic: {df_demographic['date'].min()} to {df_demographic['date'].max()}")
if 'date' in df_biometric.columns:
    print(f"Biometric: {df_biometric['date'].min()} to {df_biometric['date'].max()}")

print("\n🌍 GEOGRAPHICAL COVERAGE:")
geo_cols = ['state', 'district', 'pincode']
for col in geo_cols:
    counts = []
    if col in df_enrolment.columns:
        counts.append(f"Enrolment: {df_enrolment[col].nunique():,}")
    if col in df_demographic.columns:
        counts.append(f"Demographic: {df_demographic[col].nunique():,}")
    if col in df_biometric.columns:
        counts.append(f"Biometric: {df_biometric[col].nunique():,}")
    print(f"{col.capitalize()}: {' | '.join(counts)}")

print("\n📈 AGE GROUP DISTRIBUTIONS:")
print("\nEnrolment Dataset Age Groups:")
age_cols_enrol = [col for col in df_enrolment.columns if 'age' in col.lower()]
for col in age_cols_enrol:
    if df_enrolment[col].dtype in [np.int64, np.float64]:
        print(f"  {col}: Total = {df_enrolment[col].sum():,.0f}, Mean = {df_enrolment[col].mean():.2f}")

print("\nDemographic Dataset Age Groups:")
age_cols_demo = [col for col in df_demographic.columns if 'age' in col.lower()]
for col in age_cols_demo:
    if df_demographic[col].dtype in [np.int64, np.float64]:
        print(f"  {col}: Total = {df_demographic[col].sum():,.0f}, Mean = {df_demographic[col].mean():.2f}")

print("\nBiometric Dataset Age Groups:")
age_cols_bio = [col for col in df_biometric.columns if 'age' in col.lower()]
for col in age_cols_bio:
    if df_biometric[col].dtype in [np.int64, np.float64]:
        print(f"  {col}: Total = {df_biometric[col].sum():,.0f}, Mean = {df_biometric[col].mean():.2f}")

print("\n✅ DATA COMPLETENESS:")
print(f"Enrolment missing values: {df_enrolment.isnull().sum().sum():,} ({(df_enrolment.isnull().sum().sum() / (df_enrolment.shape[0] * df_enrolment.shape[1])) * 100:.4f}%)")
print(f"Demographic missing values: {df_demographic.isnull().sum().sum():,} ({(df_demographic.isnull().sum().sum() / (df_demographic.shape[0] * df_demographic.shape[1])) * 100:.4f}%)")
print(f"Biometric missing values: {df_biometric.isnull().sum().sum():,} ({(df_biometric.isnull().sum().sum() / (df_biometric.shape[0] * df_biometric.shape[1])) * 100:.4f}%)")

print("\n🔄 DUPLICATE RECORDS:")
print(f"Enrolment duplicates: {df_enrolment.duplicated().sum():,}")
print(f"Demographic duplicates: {df_demographic.duplicated().sum():,}")
print(f"Biometric duplicates: {df_biometric.duplicated().sum():,}")

print("\n" + "=" * 80)

INITIAL INSIGHTS AND DATA DISTRIBUTION SUMMARY

📊 DATASET OVERVIEW:
1. Enrolment Dataset: 1,006,029 rows × 7 columns
2. Demographic Dataset: 2,071,700 rows × 6 columns
3. Biometric Dataset: 1,861,108 rows × 6 columns

📅 TEMPORAL COVERAGE:
Enrolment: 2025-01-04 00:00:00 to 2025-12-11 00:00:00
Demographic: 2025-01-03 00:00:00 to 2025-12-12 00:00:00
Biometric: 2025-01-03 00:00:00 to 2025-12-12 00:00:00

🌍 GEOGRAPHICAL COVERAGE:
State: Enrolment: 55 | Demographic: 65 | Biometric: 57
District: Enrolment: 985 | Demographic: 983 | Biometric: 974
Pincode: Enrolment: 19,463 | Demographic: 19,742 | Biometric: 19,707

📈 AGE GROUP DISTRIBUTIONS:

Enrolment Dataset Age Groups:
  age_0_5: Total = 3,546,965, Mean = 3.53
  age_5_17: Total = 1,720,384, Mean = 1.71
  age_18_greater: Total = 168,353, Mean = 0.17

Demographic Dataset Age Groups:
  demo_age_5_17: Total = 4,863,424, Mean = 2.35
  demo_age_17_: Total = 44,431,763, Mean = 21.45

Biometric Dataset Age Groups:
  bio_age_5_17: Total = 34,226,855

## 12. Save Data Summary Metadata

Creating a comprehensive summary CSV file with metadata from all datasets.

In [22]:
def create_metadata_summary():
    """Create a comprehensive metadata summary for all datasets"""
    
    summary_data = []
    
    # Helper function to extract metadata
    def extract_metadata(df, dataset_name):
        metadata = {
            'Dataset': dataset_name,
            'Total_Rows': df.shape[0],
            'Total_Columns': df.shape[1],
            'Memory_Usage_MB': df.memory_usage(deep=True).sum() / 1024**2,
            'Missing_Values': df.isnull().sum().sum(),
            'Missing_Percentage': (df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100,
            'Duplicate_Rows': df.duplicated().sum(),
            'Duplicate_Percentage': (df.duplicated().sum() / df.shape[0]) * 100,
        }
        
        # Add date range if date column exists
        if 'date' in df.columns:
            metadata['Date_Start'] = df['date'].min()
            metadata['Date_End'] = df['date'].max()
            metadata['Unique_Dates'] = df['date'].nunique()
        else:
            metadata['Date_Start'] = None
            metadata['Date_End'] = None
            metadata['Unique_Dates'] = None
        
        # Add geographical info
        for col in ['state', 'district', 'pincode']:
            if col in df.columns:
                metadata[f'Unique_{col.capitalize()}'] = df[col].nunique()
            else:
                metadata[f'Unique_{col.capitalize()}'] = None
        
        # Add numerical column statistics
        numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
        metadata['Numerical_Columns'] = len(numerical_cols)
        
        # Add age group totals if present
        age_cols = [col for col in df.columns if 'age' in col.lower()]
        for col in age_cols:
            if df[col].dtype in [np.int64, np.float64]:
                metadata[f'{col}_Total'] = df[col].sum()
                metadata[f'{col}_Mean'] = df[col].mean()
                metadata[f'{col}_Std'] = df[col].std()
        
        return metadata
    
    # Extract metadata for all datasets
    summary_data.append(extract_metadata(df_enrolment, 'Enrolment'))
    summary_data.append(extract_metadata(df_demographic, 'Demographic'))
    summary_data.append(extract_metadata(df_biometric, 'Biometric'))
    
    # Create summary DataFrame
    summary_df = pd.DataFrame(summary_data)
    
    return summary_df

# Create and save metadata summary
metadata_summary = create_metadata_summary()

print("=" * 80)
print("METADATA SUMMARY")
print("=" * 80)
display(metadata_summary)

# Save to CSV
output_file = OUTPUT_PATH / 'data_summary.csv'
metadata_summary.to_csv(output_file, index=False)
print(f"\n✓ Metadata summary saved to: {output_file}")

METADATA SUMMARY


Unnamed: 0,Dataset,Total_Rows,Total_Columns,Memory_Usage_MB,Missing_Values,Missing_Percentage,Duplicate_Rows,Duplicate_Percentage,Date_Start,Date_End,Unique_Dates,Unique_State,Unique_District,Unique_Pincode,Numerical_Columns,age_0_5_Total,age_0_5_Mean,age_0_5_Std,age_5_17_Total,age_5_17_Mean,age_5_17_Std,age_18_greater_Total,age_18_greater_Mean,age_18_greater_Std,demo_age_5_17_Total,demo_age_5_17_Mean,demo_age_5_17_Std,demo_age_17__Total,demo_age_17__Mean,demo_age_17__Std,bio_age_5_17_Total,bio_age_5_17_Mean,bio_age_5_17_Std,bio_age_17__Total,bio_age_17__Mean,bio_age_17__Std
0,Enrolment,1006029,7,150.2,682238,9.69,385118,38.28,2025-01-04,2025-12-11,30,55,985,19463,4,3546965.0,3.53,17.54,1720384.0,1.71,14.37,168353.0,0.17,3.22,,,,,,,,,,,,
1,Demographic,2071700,6,294.08,1187968,9.56,823227,39.74,2025-01-03,2025-12-12,41,65,983,19742,3,,,,,,,,,,4863424.0,2.35,14.9,44431763.0,21.45,125.25,,,,,,
2,Biometric,1861108,6,263.96,944100,8.45,331623,17.82,2025-01-03,2025-12-12,41,57,974,19707,3,,,,,,,,,,,,,,,,34226855.0,18.39,83.7,35536240.0,19.09,88.07



✓ Metadata summary saved to: /home/prince/Desktop/UIDAI Hackathon/outputs/results/data_summary.csv


## 13. Conclusion

This exploratory data analysis has provided a comprehensive overview of the three Aadhaar datasets:

### Key Findings:
1. **Data Volume**: Successfully loaded and consolidated multiple CSV files from each dataset
2. **Data Quality**: Assessed missing values, duplicates, and data consistency
3. **Temporal Coverage**: Analyzed date ranges to understand the time period of data collection
4. **Geographical Coverage**: Examined the distribution across states, districts, and pincodes
5. **Age Group Analysis**: Reviewed enrolment and update patterns across different age groups

### Next Steps:
- Feature engineering based on insights from this exploration
- Time series analysis for trend identification
- Predictive modeling for Aadhaar update demand forecasting
- Visualization of key patterns and trends

The cleaned metadata has been saved to `outputs/results/data_summary.csv` for reference.