# Demographic Update Dataset – Data Concatenation and Cleaning

## Overview
This notebook focuses on preparing the Aadhaar **Demographic Update Dataset** for downstream analysis as part of the UIDAI Data Hackathon.

The demographic update data has been provided as multiple CSV files, each representing a partition of the same logical dataset.
Before analysis, these files must be combined, validated, and cleaned in a consistent and reproducible manner.

## Objectives
The objectives of this notebook are to:

1. Load and inventory all demographic update data files
2. Validate schema consistency across file partitions
3. Concatenate the files into a unified dataset
4. Perform minimal and justified data cleaning, including:
   - Date parsing (if applicable)
   - Validation of demographic update fields
   - Standardization of geographic attributes
5. Persist a clean demographic update dataset for analysis

## Scope and Design Principles
- This notebook is strictly limited to data preparation
- No exploratory or inferential analysis is performed here
- Cleaning actions are evidence-driven and fully documented
- Administrative semantics are preserved unless explicitly justified

## Output
The final output of this notebook is:

[03_Processed_Data/demographic_update_clean.csv](..\03_Processed_Data\demographic_update_clean.csv)

This dataset will be used in downstream analytical notebooks.

## Reproducibility
All steps in this notebook are deterministic and can be rerun end-to-end using the raw source files.

## Step 1: Environment Setup and Library Imports

This step initializes the Python environment and imports the required libraries to ensure consistent data handling and display.

In [41]:
import pandas as pd
from pathlib import Path

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

## Step 2: Defining the Demographic Update Data Source and File Inventory

The demographic update dataset is provided as multiple CSV files.
This step identifies all available files and verifies that the expected number of partitions is present before loading.

In [42]:
# Define path to raw demographic update data
DEMOGRAPHIC_DATA_PATH = Path("../01_Raw_Data_National/demographic_update")

# List all demographic update CSV files
demographic_files = sorted(DEMOGRAPHIC_DATA_PATH.glob("*.csv"))

demographic_files

[WindowsPath('../01_Raw_Data_National/demographic_update/api_data_aadhar_demographic_0_500000.csv'),
 WindowsPath('../01_Raw_Data_National/demographic_update/api_data_aadhar_demographic_1000000_1500000.csv'),
 WindowsPath('../01_Raw_Data_National/demographic_update/api_data_aadhar_demographic_1500000_2000000.csv'),
 WindowsPath('../01_Raw_Data_National/demographic_update/api_data_aadhar_demographic_2000000_2071700.csv'),
 WindowsPath('../01_Raw_Data_National/demographic_update/api_data_aadhar_demographic_500000_1000000.csv')]

## Step 3: Loading Demographic Update Data with Provenance Tracking

In this step, each demographic update CSV file is loaded into memory.
A temporary provenance column is added to track the source file for each record, enabling validation of successful ingestion and concatenation.

This column is used only during data preparation and is removed before persisting the final dataset.


In [43]:
# Load demographic update files with provenance tracking
demographic_dfs = []

for file_path in demographic_files:
    df = pd.read_csv(file_path)
    df["source_file"] = file_path.name
    demographic_dfs.append(df)

# Confirm all files are loaded
len(demographic_dfs)

5

## Step 4: Schema Validation Across Demographic Update Files

Before concatenating the demographic update files, it is essential to confirm that all partitions share a consistent schema.
Schema validation ensures that each column represents the same attribute across all files and prevents silent data corruption during concatenation.

This step compares column names, column counts, and column ordering across all demographic update files.


In [44]:
# Extract column schemas from each demographic update DataFrame
schemas = [df.columns.tolist() for df in demographic_dfs]

# Display schemas for comparison
for idx, schema in enumerate(schemas, start=1):
    print(f"Schema for file {idx}:")
    print(schema)
    print("-" * 80)

# Check whether all schemas are identical
all_schemas_identical = all(schema == schemas[0] for schema in schemas)
all_schemas_identical

Schema for file 1:
['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_', 'source_file']
--------------------------------------------------------------------------------
Schema for file 2:
['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_', 'source_file']
--------------------------------------------------------------------------------
Schema for file 3:
['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_', 'source_file']
--------------------------------------------------------------------------------
Schema for file 4:
['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_', 'source_file']
--------------------------------------------------------------------------------
Schema for file 5:
['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_', 'source_file']
--------------------------------------------------------------------------------


True

## Step 5: Concatenating the Demographic Update Dataset

After confirming schema consistency across all demographic update files, the individual partitions are concatenated into a single unified dataset.
This step combines all records while preserving row-level integrity and prepares the data for subsequent validation and cleaning.


In [45]:
# Concatenate all demographic update DataFrames
demographic_combined = pd.concat(demographic_dfs, ignore_index=True)

# Inspect the shape of the combined dataset
demographic_combined.shape

(2071700, 7)

In [46]:
demographic_combined["source_file"].value_counts()

source_file
api_data_aadhar_demographic_0_500000.csv           500000
api_data_aadhar_demographic_1000000_1500000.csv    500000
api_data_aadhar_demographic_1500000_2000000.csv    500000
api_data_aadhar_demographic_500000_1000000.csv     500000
api_data_aadhar_demographic_2000000_2071700.csv     71700
Name: count, dtype: int64

## Step 6: Initial Data Quality Assessment and Cleaning Strategy

Before applying any cleaning operations, an initial assessment of the demographic update dataset is performed.
This step focuses on understanding data completeness, data types, and potential structural issues that may affect downstream analysis.

The objective is to identify:
- Columns with missing or inconsistent values
- Data types that require conversion (e.g., dates)
- Fields that may require standardization (e.g., geographic attributes)

All cleaning decisions applied in subsequent steps are driven by observations made here.


In [47]:
# High-level overview of the demographic update dataset
demographic_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2071700 entries, 0 to 2071699
Data columns (total 7 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   date           object
 1   state          object
 2   district       object
 3   pincode        int64 
 4   demo_age_5_17  int64 
 5   demo_age_17_   int64 
 6   source_file    object
dtypes: int64(3), object(4)
memory usage: 110.6+ MB


In [48]:
demographic_combined.isna().sum()

date             0
state            0
district         0
pincode          0
demo_age_5_17    0
demo_age_17_     0
source_file      0
dtype: int64

## Step 7: Applying Minimal and Justified Data Cleaning (Demographic Update)

Based on the data quality assessment, the demographic update dataset is complete and structurally consistent.
Cleaning actions in this step are intentionally minimal and focused on correctness and consistency, without introducing analytical assumptions.

The operations performed include:
- Robust parsing of the date column
- Validation of demographic age-count fields
- Standardization of State names using the canonical mapping adopted in Enrollment
- Structural standardization of District names (formatting only)
- Removal of temporary ingestion-related columns

### Step 7A: Robust Date Parsing

The demographic update dataset contains date values stored as text and expressed in more than one valid format,
which is common in administrative datasets collected across different reporting systems.

To ensure reliable temporal analysis and avoid data loss, a robust date parsing strategy is applied that:
- Explicitly handles all observed date formats
- Respects the day-first convention used in Indian datasets
- Preserves all records without arbitrary deletion

Date parsing is performed deterministically, and all parsed values are validated before proceeding.


In [49]:
# Work on a clean copy
demographic_clean = demographic_combined.copy()

# Attempt 1: DD-MM-YYYY
parsed_dash = pd.to_datetime(
    demographic_clean["date"],
    format="%d-%m-%Y",
    errors="coerce"
)

# Attempt 2: D/M/YYYY (Indian day-first)
parsed_slash = pd.to_datetime(
    demographic_clean["date"],
    dayfirst=True,
    errors="coerce"
)

# Combine parsing attempts
demographic_clean["date"] = parsed_dash.fillna(parsed_slash)

# Validate parsing
demographic_clean["date"].isna().sum()

np.int64(0)

### Step 7B: Validation of Demographic Age Fields

Demographic update records include age-segmented counts.
These fields are validated to ensure basic numerical integrity prior to analysis.


In [50]:
age_cols = ["demo_age_5_17", "demo_age_17_"]

# Check for negative values
(demographic_clean[age_cols] < 0).sum()

demo_age_5_17    0
demo_age_17_     0
dtype: int64

### Step 7C: Standardization of State Names

To ensure consistency across datasets, State values are normalized using the same canonical mapping
applied to the Enrollment dataset. This enables reliable cross-dataset aggregation.

In [51]:
# Standardize State column
for col in ["state"]:
    demographic_clean[col] = (
        demographic_clean[col]
        .astype(str)
        .str.strip()          # remove leading/trailing whitespace
        .str.title()          # standardize casing (e.g., 'karnataka' -> 'Karnataka')
    )

In [52]:
# Check number of unique states
demographic_clean["state"].nunique()

58

In [53]:
demographic_clean["state"].unique()

array(['Uttar Pradesh', 'Andhra Pradesh', 'Gujarat', 'Rajasthan',
       'Karnataka', 'West Bengal', 'Telangana', 'Odisha', 'Maharashtra',
       'Kerala', 'Bihar', 'Tamil Nadu', 'Madhya Pradesh', 'Assam',
       'Tripura', 'Arunachal Pradesh', 'Punjab', 'Jharkhand', 'Delhi',
       'Chandigarh', 'Chhattisgarh', 'Jammu And Kashmir', 'Mizoram',
       'Nagaland', 'Himachal Pradesh', 'Goa', 'Haryana', 'Meghalaya',
       'Uttarakhand', 'Manipur', 'Daman And Diu', 'Puducherry', 'Sikkim',
       'Ladakh', 'Dadra And Nagar Haveli And Daman And Diu',
       'Dadra And Nagar Haveli', 'Orissa', 'Pondicherry',
       'Andaman & Nicobar Islands', 'Andaman And Nicobar Islands',
       'Daman & Diu', 'West  Bengal', 'Jammu & Kashmir', 'Lakshadweep',
       'Dadra & Nagar Haveli', 'Westbengal', 'West Bangal', 'Chhatisgarh',
       'West Bengli', 'Darbhanga', 'Puttenahalli', 'Uttaranchal',
       'Balanagar', 'Jaipur', 'Madanapalle', '100000', 'Nagpur',
       'Raja Annamalai Puram'], dtype=object)

In [54]:
# Canonical state name mapping
state_normalization_map = {
    # West Bengal variants
    "West Bengal": "West Bengal",
    "West  Bengal": "West Bengal",
    "West Bangal": "West Bengal",
    "Westbengal": "West Bengal",

    # Odisha / Orissa
    "Orissa": "Odisha",

    # Jammu & Kashmir
    "Jammu & Kashmir": "Jammu And Kashmir",

    # Andaman & Nicobar
    "Andaman & Nicobar Islands": "Andaman And Nicobar Islands",

    # UT merger
    "Dadra & Nagar Haveli": "Dadra And Nagar Haveli And Daman And Diu",
    "Daman & Diu": "Dadra And Nagar Haveli And Daman And Diu",
    "Daman And Diu": "Dadra And Nagar Haveli And Daman And Diu",
    "Dadra And Nagar Haveli": "Dadra And Nagar Haveli And Daman And Diu",
    "The Dadra And Nagar Haveli And Daman And Diu":
        "Dadra And Nagar Haveli And Daman And Diu",

    # Puducherry
    "Pondicherry": "Puducherry",

    "Tamilnadu": "Tamil Nadu",
    "West Bengli": "West Bengal",
    "Chhatisgarh": "Chhattisgarh",
    "Uttaranchal": "Uttarakhand"
}

LOCALITY_TO_STATE = {
    "Nagpur": "Maharashtra",                # Nagpur city / district is in Maharashtra. :contentReference[oaicite:2]{index=2}
    "Darbhanga": "Bihar",                  # Darbhanga district in Bihar. :contentReference[oaicite:3]{index=3}
    "Jaipur": "Rajasthan",                 # Jaipur city / district in Rajasthan. :contentReference[oaicite:4]{index=4}
    "Balanagar": "Telangana",              # Balanagar (Hyderabad neighbourhood) in Telangana. :contentReference[oaicite:5]{index=5}
    "Puttenahalli": "Karnataka",           # Puttenahalli (Bengaluru suburb) in Karnataka. :contentReference[oaicite:6]{index=6}
    "Raja Annamalai Puram": "Tamil Nadu",  # R.A. Puram (Chennai neighbourhood) in Tamil Nadu. :contentReference[oaicite:7]{index=7}
    "Madanapalle": "Andhra Pradesh"        # Madanapalle is a town in Andhra Pradesh. (common knowledge / can be verified)
}

demographic_clean["state"] = (
    demographic_clean["state"]
    .astype(str)
    .str.strip()
    .str.title()
    .replace(state_normalization_map)
)

# if state value is one of the localities, map to parent state
demographic_clean["state"] = (
    demographic_clean["state"]
    .replace(LOCALITY_TO_STATE)
)

official_states = [
    "Andhra Pradesh","Arunachal Pradesh","Assam","Bihar","Chhattisgarh","Goa","Gujarat",
    "Haryana","Himachal Pradesh","Jharkhand","Karnataka","Kerala","Madhya Pradesh",
    "Maharashtra","Manipur","Meghalaya","Mizoram","Nagaland","Odisha","Punjab","Rajasthan",
    "Sikkim","Tamil Nadu","Telangana","Tripura","Uttar Pradesh","Uttarakhand","West Bengal",
    "Andaman And Nicobar Islands","Chandigarh","Dadra And Nagar Haveli And Daman And Diu",
    "Delhi","Jammu And Kashmir","Ladakh","Lakshadweep","Puducherry"
]

demographic_clean = demographic_clean[
    demographic_clean["state"].isin(official_states)
].copy()

demographic_clean['state'].nunique()

36

In [55]:
demographic_clean["state"].unique()

array(['Uttar Pradesh', 'Andhra Pradesh', 'Gujarat', 'Rajasthan',
       'Karnataka', 'West Bengal', 'Telangana', 'Odisha', 'Maharashtra',
       'Kerala', 'Bihar', 'Tamil Nadu', 'Madhya Pradesh', 'Assam',
       'Tripura', 'Arunachal Pradesh', 'Punjab', 'Jharkhand', 'Delhi',
       'Chandigarh', 'Chhattisgarh', 'Jammu And Kashmir', 'Mizoram',
       'Nagaland', 'Himachal Pradesh', 'Goa', 'Haryana', 'Meghalaya',
       'Uttarakhand', 'Manipur',
       'Dadra And Nagar Haveli And Daman And Diu', 'Puducherry', 'Sikkim',
       'Ladakh', 'Andaman And Nicobar Islands', 'Lakshadweep'],
      dtype=object)

### Step 7D: Structural Standardization of District Names

District names are standardized structurally to remove formatting noise while preserving
original administrative semantics. No semantic remapping or merging is performed.

In [56]:
demographic_clean["district"] = (
    demographic_clean["district"]
    .astype(str)
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
    .str.replace(r"\*", "", regex=True)
    .str.replace(r"[()]", "", regex=True)
    .str.replace("–", "-", regex=False)
    .str.replace("−", "-", regex=False)
    .str.replace("?", "", regex=False)
    .str.title()
)

# Remove clearly invalid direction-only tokens
invalid_districts = {"East", "West", "North", "South", "North East"}
demographic_clean = demographic_clean[
    ~demographic_clean["district"].isin(invalid_districts)
]

demographic_clean["district"].nunique()

952

In [57]:
# Drop temporary provenance column
if "source_file" in demographic_clean.columns:
    demographic_clean = demographic_clean.drop(columns=["source_file"])

In [58]:
demographic_clean.info()
demographic_clean.head()

<class 'pandas.core.frame.DataFrame'>
Index: 2069863 entries, 0 to 2071699
Data columns (total 6 columns):
 #   Column         Dtype         
---  ------         -----         
 0   date           datetime64[ns]
 1   state          object        
 2   district       object        
 3   pincode        int64         
 4   demo_age_5_17  int64         
 5   demo_age_17_   int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 110.5+ MB


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
0,2025-03-01,Uttar Pradesh,Gorakhpur,273213,49,529
1,2025-03-01,Andhra Pradesh,Chittoor,517132,22,375
2,2025-03-01,Gujarat,Rajkot,360006,65,765
3,2025-03-01,Andhra Pradesh,Srikakulam,532484,24,314
4,2025-03-01,Rajasthan,Udaipur,313801,45,785


### Step 7 Summary

The demographic update dataset now satisfies the following:
- Dates are consistently parsed and temporally reliable
- Age-count fields are validated for numerical integrity
- State names are canonical and consistent with Enrollment
- District names are structurally clean while preserving semantics
- Temporary ingestion artifacts are removed

The dataset is ready to be persisted for downstream analysis.

## Step 8: Persisting the Clean Demographic Update Dataset

After completing all validation and cleaning steps, the demographic update dataset is finalized.
In this step, the cleaned dataset is persisted to disk to serve as the single source of truth
for all downstream analytical tasks.

In [59]:
# Define output directory
OUTPUT_DIR = Path("../03_Processed_Data")
OUTPUT_DIR.mkdir(exist_ok=True)

# Save cleaned demographic update dataset
output_path = OUTPUT_DIR / "demographic_update_clean.csv"
demographic_clean.to_csv(output_path, index=False)

output_path

WindowsPath('../03_Processed_Data/demographic_update_clean.csv')

In [60]:
# Quick verification of saved file
pd.read_csv(output_path, nrows=5)

Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
0,2025-03-01,Uttar Pradesh,Gorakhpur,273213,49,529
1,2025-03-01,Andhra Pradesh,Chittoor,517132,22,375
2,2025-03-01,Gujarat,Rajkot,360006,65,765
3,2025-03-01,Andhra Pradesh,Srikakulam,532484,24,314
4,2025-03-01,Rajasthan,Udaipur,313801,45,785
