# Biometric Update Dataset – Data Concatenation and Cleaning

## Overview
This notebook focuses on preparing the Aadhaar **Biometric Update Dataset** for downstream analysis as part of the UIDAI Data Hackathon.

The biometric update data has been provided as multiple CSV files, each representing a partition of the same logical dataset.
Before analysis, these files must be combined, validated, and cleaned in a consistent and reproducible manner.

## Objectives
The objectives of this notebook are to:

1. Load and inventory all biometric update data files
2. Validate schema consistency across file partitions
3. Concatenate the files into a unified dataset
4. Perform minimal and justified data cleaning, including:
   - Date parsing (if applicable)
   - Validation of biometric update counts
   - Standardization of geographic attributes
5. Persist a clean biometric update dataset for analysis

## Scope and Design Principles
- This notebook is strictly limited to data preparation
- No exploratory or inferential analysis is performed here
- Cleaning actions are evidence-driven and fully documented
- Administrative semantics are preserved unless explicitly justified

## Output
The final output of this notebook is:

03_Processed_Data/biometric_update_clean.csv

This dataset will be used in downstream analytical notebooks.

## Reproducibility
All steps in this notebook are deterministic and can be rerun end-to-end using the raw source files.


## Step 1: Environment Setup and Library Imports

This step initializes the Python environment and imports the required libraries to ensure consistent data handling and display.


In [40]:
import pandas as pd
from pathlib import Path

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

## Step 2: Defining the Biometric Update Data Source and File Inventory

The biometric update dataset is provided as multiple CSV files.
This step identifies all available files and verifies that the expected number of partitions is present before loading.

In [41]:
# Define path to raw biometric update data
BIOMETRIC_DATA_PATH = Path("../01_Raw_Data_National/biometric_update")

# List all biometric update CSV files
biometric_files = sorted(BIOMETRIC_DATA_PATH.glob("*.csv"))

biometric_files

[WindowsPath('../01_Raw_Data_National/biometric_update/api_data_aadhar_biometric_0_500000.csv'),
 WindowsPath('../01_Raw_Data_National/biometric_update/api_data_aadhar_biometric_1000000_1500000.csv'),
 WindowsPath('../01_Raw_Data_National/biometric_update/api_data_aadhar_biometric_1500000_1861108.csv'),
 WindowsPath('../01_Raw_Data_National/biometric_update/api_data_aadhar_biometric_500000_1000000.csv')]

## Step 3: Loading Biometric Update Data with Provenance Tracking

In this step, each biometric update CSV file is loaded into memory.
A temporary provenance column is added to track the source file for each record, enabling validation of successful ingestion and concatenation.

This column is used only during data preparation and is removed before persisting the final dataset.


In [42]:
# Load biometric update files with provenance tracking
biometric_dfs = []

for file_path in biometric_files:
    df = pd.read_csv(file_path)
    df["source_file"] = file_path.name
    biometric_dfs.append(df)

# Confirm all files are loaded
len(biometric_dfs)

4

## Step 4: Schema Validation Across Biometric Update Files

Before concatenating the biometric update files, it is necessary to confirm that all partitions share a consistent schema.
Schema validation ensures that each column represents the same attribute across all files and prevents silent data corruption during concatenation.


In [43]:
# Extract column schemas from each biometric update DataFrame
schemas = [df.columns.tolist() for df in biometric_dfs]

# Display schemas for comparison
for idx, schema in enumerate(schemas, start=1):
    print(f"Schema for file {idx}:")
    print(schema)
    print("-" * 80)

# Check whether all schemas are identical
all_schemas_identical = all(schema == schemas[0] for schema in schemas)
all_schemas_identical

Schema for file 1:
['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_', 'source_file']
--------------------------------------------------------------------------------
Schema for file 2:
['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_', 'source_file']
--------------------------------------------------------------------------------
Schema for file 3:
['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_', 'source_file']
--------------------------------------------------------------------------------
Schema for file 4:
['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_', 'source_file']
--------------------------------------------------------------------------------


True

## Step 5: Concatenating the Biometric Update Dataset

After confirming schema consistency across all biometric update files, the individual partitions are concatenated into a single unified dataset.
This step consolidates all biometric update records and prepares the data for validation and cleaning.


In [44]:
# Concatenate all biometric update DataFrames
biometric_combined = pd.concat(biometric_dfs, ignore_index=True)

# Inspect the shape of the combined dataset
biometric_combined.shape

(1861108, 7)

In [45]:
# Verify that all source files contributed data
biometric_combined["source_file"].value_counts()

source_file
api_data_aadhar_biometric_0_500000.csv           500000
api_data_aadhar_biometric_1000000_1500000.csv    500000
api_data_aadhar_biometric_500000_1000000.csv     500000
api_data_aadhar_biometric_1500000_1861108.csv    361108
Name: count, dtype: int64

## Step 6: Initial Data Quality Assessment and Cleaning Strategy (Biometric Update)

Before applying any cleaning operations, an initial assessment of the biometric update dataset is performed.
This step focuses on understanding data completeness, data types, and potential structural issues that may affect downstream analysis.

The objective is to identify:
- Columns with missing or inconsistent values
- Data types that require conversion (e.g., dates)
- Fields that may require validation (e.g., biometric update counts)
- Geographic attributes that require standardization


In [46]:
# High-level overview of the biometric update dataset
biometric_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1861108 entries, 0 to 1861107
Data columns (total 7 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   date          object
 1   state         object
 2   district      object
 3   pincode       int64 
 4   bio_age_5_17  int64 
 5   bio_age_17_   int64 
 6   source_file   object
dtypes: int64(3), object(4)
memory usage: 99.4+ MB


In [47]:
biometric_combined.isna().sum()

date            0
state           0
district        0
pincode         0
bio_age_5_17    0
bio_age_17_     0
source_file     0
dtype: int64

## Step 7: Applying Minimal and Justified Data Cleaning (Biometric Update)

Based on the initial data quality assessment, the biometric update dataset is complete and structurally consistent.
Cleaning actions in this step are intentionally minimal and focus on correctness, consistency, and cross-dataset alignment.

The following operations are performed:
- Robust parsing of the date column
- Validation of biometric age-based count fields
- Canonical standardization of State names (aligned with Enrollment and Demographic datasets)
- Structural standardization of District names
- Removal of temporary ingestion-related columns


### Step 7A: Robust Date Parsing

The biometric update dataset contains date values stored as text and expressed in more than one valid format.
To ensure reliable temporal analysis and avoid data loss, a robust date parsing strategy is applied that:
- Handles all observed date formats
- Respects the day-first convention used in Indian administrative datasets
- Preserves all valid records


In [48]:
# Work on a clean copy
biometric_clean = biometric_combined.copy()

# Attempt 1: DD-MM-YYYY
parsed_dash = pd.to_datetime(
    biometric_clean["date"],
    format="%d-%m-%Y",
    errors="coerce"
)

# Attempt 2: Day-first flexible parsing
parsed_slash = pd.to_datetime(
    biometric_clean["date"],
    dayfirst=True,
    errors="coerce"
)

# Combine both parsing attempts
biometric_clean["date"] = parsed_dash.fillna(parsed_slash)

# Validate parsing success
biometric_clean["date"].isna().sum()


np.int64(0)

### Step 7B: Validation of Biometric Age Fields

Biometric update records include age-segmented counts.
These fields are validated to ensure numerical integrity prior to analysis.


In [49]:
bio_age_cols = ["bio_age_5_17", "bio_age_17_"]

# Check for negative values
(biometric_clean[bio_age_cols] < 0).sum()

bio_age_5_17    0
bio_age_17_     0
dtype: int64

### Step 7C: Standardization of State Names

State names are normalized using the canonical mapping adopted across all datasets
to ensure consistent state-level aggregation and comparison.


In [50]:
# Standardize State column
for col in ["state"]:
    biometric_clean[col] = (
        biometric_clean[col]
        .astype(str)
        .str.strip()          # remove leading/trailing whitespace
        .str.title()          # standardize casing (e.g., 'karnataka' -> 'Karnataka')
    )

In [51]:
# Check number of unique states
biometric_clean["state"].nunique()

50

In [52]:
biometric_clean["state"].unique()

array(['Haryana', 'Bihar', 'Jammu And Kashmir', 'Tamil Nadu',
       'Maharashtra', 'Gujarat', 'Odisha', 'West Bengal', 'Kerala',
       'Rajasthan', 'Punjab', 'Himachal Pradesh', 'Uttar Pradesh',
       'Assam', 'Uttarakhand', 'Madhya Pradesh', 'Karnataka',
       'Andhra Pradesh', 'Telangana', 'Goa', 'Nagaland', 'Jharkhand',
       'Delhi', 'Chhattisgarh', 'Meghalaya', 'Chandigarh', 'Orissa',
       'Puducherry', 'Pondicherry', 'Manipur', 'Sikkim', 'Tripura',
       'Mizoram', 'Arunachal Pradesh', 'Ladakh',
       'Dadra And Nagar Haveli And Daman And Diu', 'Daman And Diu',
       'Andaman And Nicobar Islands', 'Andaman & Nicobar Islands',
       'Dadra And Nagar Haveli', 'Lakshadweep', 'Daman & Diu',
       'Dadra & Nagar Haveli', 'Jammu & Kashmir', 'Westbengal',
       'West  Bengal', 'West Bangal', 'Uttaranchal', 'Chhatisgarh',
       'Tamilnadu'], dtype=object)

In [53]:
# Canonical state name mapping
state_normalization_map = {
    # West Bengal variants
    "West Bengal": "West Bengal",
    "West  Bengal": "West Bengal",
    "West Bangal": "West Bengal",
    "Westbengal": "West Bengal",

    # Odisha / Orissa
    "Orissa": "Odisha",

    # Jammu & Kashmir
    "Jammu & Kashmir": "Jammu And Kashmir",

    # Andaman & Nicobar
    "Andaman & Nicobar Islands": "Andaman And Nicobar Islands",

    # UT merger
    "Dadra & Nagar Haveli": "Dadra And Nagar Haveli And Daman And Diu",
    "Daman & Diu": "Dadra And Nagar Haveli And Daman And Diu",
    "Daman And Diu": "Dadra And Nagar Haveli And Daman And Diu",
    "Dadra And Nagar Haveli": "Dadra And Nagar Haveli And Daman And Diu",
    "The Dadra And Nagar Haveli And Daman And Diu":
        "Dadra And Nagar Haveli And Daman And Diu",

    # Puducherry
    "Pondicherry": "Puducherry"
}

biometric_clean["state"] = (
    biometric_clean["state"]
    .astype(str)
    .str.strip()
    .str.title()
    .replace(state_normalization_map)
)

# Remove clearly invalid state values
invalid_states = ["100000"]

biometric_clean = biometric_clean[
    ~biometric_combined["state"].isin(invalid_states)
]

biometric_clean["state"].nunique()

39

In [54]:
biometric_clean["state"].unique()

array(['Haryana', 'Bihar', 'Jammu And Kashmir', 'Tamil Nadu',
       'Maharashtra', 'Gujarat', 'Odisha', 'West Bengal', 'Kerala',
       'Rajasthan', 'Punjab', 'Himachal Pradesh', 'Uttar Pradesh',
       'Assam', 'Uttarakhand', 'Madhya Pradesh', 'Karnataka',
       'Andhra Pradesh', 'Telangana', 'Goa', 'Nagaland', 'Jharkhand',
       'Delhi', 'Chhattisgarh', 'Meghalaya', 'Chandigarh', 'Puducherry',
       'Manipur', 'Sikkim', 'Tripura', 'Mizoram', 'Arunachal Pradesh',
       'Ladakh', 'Dadra And Nagar Haveli And Daman And Diu',
       'Andaman And Nicobar Islands', 'Lakshadweep', 'Uttaranchal',
       'Chhatisgarh', 'Tamilnadu'], dtype=object)

### Step 7D: Structural Standardization of District Names

District names are standardized structurally to remove formatting noise while preserving
original administrative semantics. No semantic remapping is performed.


In [55]:
biometric_clean["district"] = (
    biometric_clean["district"]
    .astype(str)
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
    .str.replace(r"\*", "", regex=True)
    .str.replace(r"[()]", "", regex=True)
    .str.replace("–", "-", regex=False)
    .str.replace("−", "-", regex=False)
    .str.replace("?", "", regex=False)
    .str.title()
)

# Remove clearly invalid direction-only tokens
invalid_districts = {"East", "West", "North", "South", "North East"}
biometric_clean = biometric_clean[
    ~biometric_clean["district"].isin(invalid_districts)
]

In [56]:
# Drop temporary provenance column
if "source_file" in biometric_clean.columns:
    biometric_clean = biometric_clean.drop(columns=["source_file"])

In [57]:
biometric_clean.info()
biometric_clean.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1859634 entries, 0 to 1861107
Data columns (total 6 columns):
 #   Column        Dtype         
---  ------        -----         
 0   date          datetime64[ns]
 1   state         object        
 2   district      object        
 3   pincode       int64         
 4   bio_age_5_17  int64         
 5   bio_age_17_   int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 99.3+ MB


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
0,2025-03-01,Haryana,Mahendragarh,123029,280,577
1,2025-03-01,Bihar,Madhepura,852121,144,369
2,2025-03-01,Jammu And Kashmir,Punch,185101,643,1091
3,2025-03-01,Bihar,Bhojpur,802158,256,980
4,2025-03-01,Tamil Nadu,Madurai,625514,271,815


### Step 7 Summary

The biometric update dataset now satisfies the following:
- Dates are consistently parsed and temporally reliable
- Biometric age-count fields are validated for numerical integrity
- State names are canonical and aligned with other datasets
- District names are structurally standardized without semantic loss
- Temporary ingestion artifacts are removed

The dataset is ready to be persisted for downstream analysis.


## Step 8: Persisting the Clean Biometric Update Dataset

After completing all validation and cleaning steps, the biometric update dataset is finalized.
In this step, the cleaned dataset is persisted to disk to serve as the single source of truth
for all downstream analytical tasks.


In [58]:
# Define output directory
OUTPUT_DIR = Path("../03_Processed_Data")
OUTPUT_DIR.mkdir(exist_ok=True)

# Save cleaned biometric update dataset
output_path = OUTPUT_DIR / "biometric_update_clean.csv"
biometric_clean.to_csv(output_path, index=False)

output_path

WindowsPath('../03_Processed_Data/biometric_update_clean.csv')

In [59]:
# Quick verification
pd.read_csv(output_path, nrows=5)

Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
0,2025-03-01,Haryana,Mahendragarh,123029,280,577
1,2025-03-01,Bihar,Madhepura,852121,144,369
2,2025-03-01,Jammu And Kashmir,Punch,185101,643,1091
3,2025-03-01,Bihar,Bhojpur,802158,256,980
4,2025-03-01,Tamil Nadu,Madurai,625514,271,815
