# UIDAI Data Hackathon â€“ Data Understanding

**Author:** Ritesh Verma

**Role:** Solo Participant  
**Hackathon:** UIDAI Data Hackathon  

## Objective
This notebook focuses on:
- Loading all UIDAI datasets
- Understanding their structure and schema
- Performing basic sanity checks
- Validating data quality
- Documenting first observations

No transformations or modeling are performed here.

In [None]:
import pandas as pd
import numpy as np

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

print("Libraries loaded successfully.")

Libraries loaded successfully.


In [None]:
# Enrolment dataset files
enrolment_files = [
    "../data/raw/api_data_aadhar_enrolment/api_data_aadhar_enrolment_0_500000.csv",
    "../data/raw/api_data_aadhar_enrolment/api_data_aadhar_enrolment_500000_1000000.csv",
    "../data/raw/api_data_aadhar_enrolment/api_data_aadhar_enrolment_1000000_1006029.csv"
]

# Biometric update dataset files
biometric_files = [
    "../data/raw/api_data_aadhar_biometric/api_data_aadhar_biometric_0_500000.csv",
    "../data/raw/api_data_aadhar_biometric/api_data_aadhar_biometric_500000_1000000.csv",
    "../data/raw/api_data_aadhar_biometric/api_data_aadhar_biometric_1000000_1500000.csv",
    "../data/raw/api_data_aadhar_biometric/api_data_aadhar_biometric_1500000_1861108.csv"
]

# Demographic update dataset files
demographic_files = [
    "../data/raw/api_data_aadhar_demographic/api_data_aadhar_demographic_0_500000.csv",
    "../data/raw/api_data_aadhar_demographic/api_data_aadhar_demographic_500000_1000000.csv",
    "../data/raw/api_data_aadhar_demographic/api_data_aadhar_demographic_1000000_1500000.csv",
    "../data/raw/api_data_aadhar_demographic/api_data_aadhar_demographic_1500000_2000000.csv",
    "../data/raw/api_data_aadhar_demographic/api_data_aadhar_demographic_2000000_2071700.csv"
]

In [4]:
def load_and_combine(file_list):
    df_list = [pd.read_csv(file) for file in file_list]
    return pd.concat(df_list, ignore_index=True)

enrolment_df = load_and_combine(enrolment_files)
biometric_df = load_and_combine(biometric_files)
demographic_df = load_and_combine(demographic_files)

print("Datasets loaded successfully.")

Datasets loaded successfully.


In [5]:
print("Enrolment shape:", enrolment_df.shape)
print("Biometric shape:", biometric_df.shape)
print("Demographic shape:", demographic_df.shape)

Enrolment shape: (1006029, 7)
Biometric shape: (1861108, 6)
Demographic shape: (2071700, 6)


In [6]:
print("Enrolment columns:")
display(enrolment_df.columns)

print("\nBiometric columns:")
display(biometric_df.columns)

print("\nDemographic columns:")
display(demographic_df.columns)

Enrolment columns:


Index(['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater'], dtype='object')


Biometric columns:


Index(['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_'], dtype='object')


Demographic columns:


Index(['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_'], dtype='object')

In [7]:
display(enrolment_df.head())
display(biometric_df.head())
display(demographic_df.head())

Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,02-03-2025,Meghalaya,East Khasi Hills,793121,11,61,37
1,09-03-2025,Karnataka,Bengaluru Urban,560043,14,33,39
2,09-03-2025,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,09-03-2025,Uttar Pradesh,Aligarh,202133,62,29,15
4,09-03-2025,Karnataka,Bengaluru Urban,560016,14,16,21


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
0,01-03-2025,Haryana,Mahendragarh,123029,280,577
1,01-03-2025,Bihar,Madhepura,852121,144,369
2,01-03-2025,Jammu and Kashmir,Punch,185101,643,1091
3,01-03-2025,Bihar,Bhojpur,802158,256,980
4,01-03-2025,Tamil Nadu,Madurai,625514,271,815


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
0,01-03-2025,Uttar Pradesh,Gorakhpur,273213,49,529
1,01-03-2025,Andhra Pradesh,Chittoor,517132,22,375
2,01-03-2025,Gujarat,Rajkot,360006,65,765
3,01-03-2025,Andhra Pradesh,Srikakulam,532484,24,314
4,01-03-2025,Rajasthan,Udaipur,313801,45,785


In [8]:
display(enrolment_df.tail())
display(biometric_df.tail())
display(demographic_df.tail())

Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
1006024,31-12-2025,West Bengal,West Midnapore,721149,2,0,0
1006025,31-12-2025,West Bengal,West Midnapore,721150,2,2,0
1006026,31-12-2025,West Bengal,West Midnapore,721305,0,1,0
1006027,31-12-2025,West Bengal,West Midnapore,721504,1,0,0
1006028,31-12-2025,West Bengal,West Midnapore,721517,2,1,0


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
1861103,29-12-2025,West Bengal,Uttar Dinajpur,733201,4,9
1861104,29-12-2025,West Bengal,Uttar Dinajpur,733213,0,1
1861105,29-12-2025,West Bengal,West Midnapore,721304,0,3
1861106,29-12-2025,West Bengal,West Midnapore,721451,2,0
1861107,29-12-2025,West Bengal,West Midnapore,721457,0,1


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
2071695,29-12-2025,West Bengal,West Midnapore,721212,0,12
2071696,29-12-2025,West Bengal,West Midnapore,721420,0,1
2071697,29-12-2025,West Bengal,West Midnapore,721424,0,5
2071698,29-12-2025,West Bengal,West Midnapore,721426,0,3
2071699,29-12-2025,West Bengal,hooghly,712701,0,1


In [9]:
def data_health_check(df, name):
    print(f"\n{name} info:")
    df.info()
    print("\nMissing values:")
    display(df.isnull().sum())

data_health_check(enrolment_df, "Enrolment")


Enrolment info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1006029 entries, 0 to 1006028
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   date            1006029 non-null  object
 1   state           1006029 non-null  object
 2   district        1006029 non-null  object
 3   pincode         1006029 non-null  int64 
 4   age_0_5         1006029 non-null  int64 
 5   age_5_17        1006029 non-null  int64 
 6   age_18_greater  1006029 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 53.7+ MB

Missing values:


date              0
state             0
district          0
pincode           0
age_0_5           0
age_5_17          0
age_18_greater    0
dtype: int64

In [10]:
data_health_check(biometric_df, "Biometric")


Biometric info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1861108 entries, 0 to 1861107
Data columns (total 6 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   date          object
 1   state         object
 2   district      object
 3   pincode       int64 
 4   bio_age_5_17  int64 
 5   bio_age_17_   int64 
dtypes: int64(3), object(3)
memory usage: 85.2+ MB

Missing values:


date            0
state           0
district        0
pincode         0
bio_age_5_17    0
bio_age_17_     0
dtype: int64

In [11]:
data_health_check(demographic_df, "Demographic")


Demographic info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2071700 entries, 0 to 2071699
Data columns (total 6 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   date           object
 1   state          object
 2   district       object
 3   pincode        int64 
 4   demo_age_5_17  int64 
 5   demo_age_17_   int64 
dtypes: int64(3), object(3)
memory usage: 94.8+ MB

Missing values:


date             0
state            0
district         0
pincode          0
demo_age_5_17    0
demo_age_17_     0
dtype: int64

In [12]:
enrolment_df.describe()

Unnamed: 0,pincode,age_0_5,age_5_17,age_18_greater
count,1006029.0,1006029.0,1006029.0,1006029.0
mean,518641.5,3.525709,1.710074,0.1673441
std,205636.0,17.53851,14.36963,3.220525
min,100000.0,0.0,0.0,0.0
25%,363641.0,1.0,0.0,0.0
50%,517417.0,2.0,0.0,0.0
75%,700104.0,3.0,1.0,0.0
max,855456.0,2688.0,1812.0,855.0


In [13]:
biometric_df.describe()

Unnamed: 0,pincode,bio_age_5_17,bio_age_17_
count,1861108.0,1861108.0,1861108.0
mean,521761.2,18.39058,19.09413
std,198162.7,83.70421,88.06502
min,110001.0,0.0,0.0
25%,391175.0,1.0,1.0
50%,522401.0,3.0,4.0
75%,686636.2,11.0,10.0
max,855456.0,8002.0,7625.0


In [14]:
demographic_df.describe()

Unnamed: 0,pincode,demo_age_5_17,demo_age_17_
count,2071700.0,2071700.0,2071700.0
mean,527831.8,2.347552,21.44701
std,197293.3,14.90355,125.2498
min,100000.0,0.0,0.0
25%,396469.0,0.0,2.0
50%,524322.0,1.0,6.0
75%,695507.0,2.0,15.0
max,855456.0,2690.0,16166.0


In [15]:
for df, name in zip(
    [enrolment_df, biometric_df, demographic_df],
    ["Enrolment", "Biometric", "Demographic"]
):
    print(f"{name} date sample:", df["date"].head().tolist())

Enrolment date sample: ['02-03-2025', '09-03-2025', '09-03-2025', '09-03-2025', '09-03-2025']
Biometric date sample: ['01-03-2025', '01-03-2025', '01-03-2025', '01-03-2025', '01-03-2025']
Demographic date sample: ['01-03-2025', '01-03-2025', '01-03-2025', '01-03-2025', '01-03-2025']


In [16]:
enrolment_df.to_csv("../data/cleaned/enrolment_base.csv", index=False)
biometric_df.to_csv("../data/cleaned/biometric_base.csv", index=False)
demographic_df.to_csv("../data/cleaned/demographic_base.csv", index=False)

### Geographic Coverage Summary

In [17]:
def geographic_coverage(df, name):
    print(f"--- {name} Dataset ---")
    print("Total records:", len(df))
    print("Unique states:", df["state"].nunique())
    print("Unique districts:", df["district"].nunique())
    print("Unique pincodes:", df["pincode"].nunique())
    print()
    
geographic_coverage(enrolment_df, "Enrolment")
geographic_coverage(biometric_df, "Biometric")
geographic_coverage(demographic_df, "Demographic")

--- Enrolment Dataset ---
Total records: 1006029
Unique states: 55
Unique districts: 985
Unique pincodes: 19463

--- Biometric Dataset ---
Total records: 1861108
Unique states: 57
Unique districts: 974
Unique pincodes: 19707

--- Demographic Dataset ---
Total records: 2071700
Unique states: 65
Unique districts: 983
Unique pincodes: 19742



## Initial Observations

- All datasets are aggregated, anonymized, and policy-safe, ensuring that
  analysis can be performed without privacy concerns.

- No missing values were observed in key fields such as date, state,
  district, and pincode across the datasets, indicating good data quality
  and consistency.

- The datasets exhibit large variation in record counts, reflecting
  event-driven enrolment and update activity rather than uniform
  distribution over time or geography.

- Geographic coverage is extensive across all datasets. The data spans
  more than 50 states, close to 1,000 districts, and nearly 20,000
  pincodes, providing strong national-level coverage and enabling
  reliable state, district, and pincode-level aggregation.

- Minor differences in the number of unique states, districts, and
  pincodes across enrolment, biometric, and demographic datasets suggest
  that not all types of Aadhaar events occur uniformly across all
  locations. This reinforces the need for dataset-specific and
  geography-aware analysis.

- The availability of pincode-level granularity enables fine-grained
  regional analysis and supports the identification of localized
  operational hotspots.

- Together, the three datasets represent the full Aadhaar lifecycle:
  - Enrolment (initial entry into the system)
  - Biometric updates (ongoing identity maintenance)
  - Demographic updates (life-event and detail corrections)

- These observations justify proceeding with temporal standardization,
  geographic aggregation, and exploratory analysis to uncover patterns
  and operational insights.