# Enrollment Dataset – Data Concatenation and Cleaning

## Overview
This notebook focuses on preparing the **Aadhaar Enrollment Dataset** for downstream analysis as part of the UIDAI Data Hackathon.

The enrollment data has been provided as **three separate CSV files**, which represent partitions of the same logical dataset. Before any meaningful analysis can be performed, these files must be **combined, validated, and cleaned** in a consistent and reproducible manner.

## Objectives
The objectives of this notebook are:

1. To **load and inspect** all enrollment data files provided by UIDAI
2. To **validate schema consistency** across the files
3. To **concatenate** the files into a single unified dataset
4. To perform **basic but essential data cleaning**, including:
   - Standardizing column names
   - Handling missing or malformed values
   - Parsing date fields
   - Removing structurally invalid records
5. To generate a **clean, analysis-ready enrollment dataset** that will serve as a single source of truth for subsequent analysis

## Scope and Design Principles
- This notebook is **limited strictly to data preparation**
- No exploratory analysis, modeling, or insights are derived here
- All transformations are **transparent, minimal, and reversible**
- No assumptions are made beyond what is necessary for data consistency

## Output
The final output of this notebook is a cleaned CSV file:

03_Procesed_Data/enrollment_clean.csv


This file will be used as input for all further exploratory and analytical notebooks.

## Reproducibility
All steps in this notebook are deterministic and can be rerun end-to-end using the raw input files, ensuring full reproducibility of results.

## Step 1: Environment Setup and Library Imports

In this step, we import the required Python libraries and configure display settings to ensure consistent and readable outputs throughout the notebook.

In [22]:
import pandas as pd
from pathlib import Path

# Display configuration for better readability
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

## Step 2: Defining the Enrollment Data Source and File Inventory

The Aadhaar Enrollment data has been provided as multiple CSV files, each representing a partition of the same dataset.
Before loading the data, we explicitly define the data source location and enumerate all available enrollment files.

This step ensures:
- The correct directory structure is being used
- All expected enrollment files are present
- The data loading process is transparent and reproducible

In [23]:
# Define the path to the raw enrollment data
ENROLLMENT_DATA_PATH = Path("../01_Raw_Data_National/enrolment")

# List all enrollment CSV files
enrollment_files = sorted(ENROLLMENT_DATA_PATH.glob("*.csv"))

# Display the discovered files
enrollment_files

[WindowsPath('../01_Raw_Data_National/enrolment/api_data_aadhar_enrolment_0_500000.csv'),
 WindowsPath('../01_Raw_Data_National/enrolment/api_data_aadhar_enrolment_1000000_1006029.csv'),
 WindowsPath('../01_Raw_Data_National/enrolment/api_data_aadhar_enrolment_500000_1000000.csv')]

## Step 3: Loading Enrollment Data with Provenance Tracking

In this step, we load each enrollment CSV file into memory and attach a provenance identifier to every record.
Since the dataset has been split into multiple files for distribution purposes, it is important to retain information about the source file for traceability and validation.

Adding provenance at this stage allows:
- Verification of successful concatenation
- Debugging of anomalies, if any, at the file level
- Transparency in the data preparation process

In [24]:
# Initialize a list to store individual enrollment DataFrames
enrollment_dfs = []

# Load each enrollment file and add provenance information
for file_path in enrollment_files:
    df = pd.read_csv(file_path)
    df["source_file"] = file_path.name
    enrollment_dfs.append(df)

# Confirm the number of loaded DataFrames
len(enrollment_dfs)

3

## Step 4: Schema Validation Across Enrollment Files

Before combining the enrollment data files, it is essential to verify that all files share a consistent schema.
Schema validation ensures that each column represents the same attribute across all partitions and prevents silent data corruption during concatenation.

In this step, we compare:
- Column names
- Column counts
- Column ordering (for reference)

Any discrepancies identified at this stage must be understood and resolved before proceeding further.

In [25]:
# Extract column schemas from each enrollment DataFrame
schemas = [df.columns.tolist() for df in enrollment_dfs]

# Display schema details for comparison
for idx, schema in enumerate(schemas, start=1):
    print(f"Schema for file {idx}:")
    print(schema)
    print("-" * 80)

# Check if all schemas are identical
all_schemas_identical = all(schema == schemas[0] for schema in schemas)
all_schemas_identical

Schema for file 1:
['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater', 'source_file']
--------------------------------------------------------------------------------
Schema for file 2:
['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater', 'source_file']
--------------------------------------------------------------------------------
Schema for file 3:
['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater', 'source_file']
--------------------------------------------------------------------------------


True

## Step 5: Concatenating the Enrollment Dataset

After confirming that all enrollment files share an identical schema, we proceed to concatenate them into a single unified dataset.
This step combines all records while preserving row-level integrity and prepares the data for cleaning and standardization.

The resulting dataset represents the complete enrollment population covered by the provided UIDAI data.

In [26]:
# Concatenate all enrollment DataFrames into a single DataFrame
enrollment_combined = pd.concat(enrollment_dfs, ignore_index=True)

# Inspect the shape of the combined dataset
enrollment_combined.shape

(1006029, 8)

In [27]:
enrollment_combined["source_file"].value_counts()

source_file
api_data_aadhar_enrolment_0_500000.csv           500000
api_data_aadhar_enrolment_500000_1000000.csv     500000
api_data_aadhar_enrolment_1000000_1006029.csv      6029
Name: count, dtype: int64

## Step 6: Initial Data Quality Assessment and Cleaning Strategy

Before applying any cleaning operations, it is important to assess the overall quality of the combined enrollment dataset.
This step focuses on understanding data completeness, data types, and potential structural issues that may affect downstream analysis.

The objective is not to aggressively modify the data, but to:
- Identify columns with high levels of missing values
- Inspect data types and detect obvious inconsistencies
- Define a minimal and justified cleaning strategy

All cleaning decisions made after this step are guided by evidence observed here.


In [28]:
# High-level overview of the combined enrollment dataset
enrollment_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1006029 entries, 0 to 1006028
Data columns (total 8 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   date            1006029 non-null  object
 1   state           1006029 non-null  object
 2   district        1006029 non-null  object
 3   pincode         1006029 non-null  int64 
 4   age_0_5         1006029 non-null  int64 
 5   age_5_17        1006029 non-null  int64 
 6   age_18_greater  1006029 non-null  int64 
 7   source_file     1006029 non-null  object
dtypes: int64(4), object(4)
memory usage: 61.4+ MB


## Step 7: Applying Minimal and Justified Data Cleaning

Following dataset concatenation and schema validation, this step applies a series of minimal and evidence-based data cleaning operations.
All actions in this step are strictly limited to structural correctness, data integrity, and consistency, without introducing analytical assumptions or feature engineering.

Each sub-step below documents a specific validation or cleaning action performed on the enrollment dataset.

### Step 7A: Robust Date Parsing

Inspection of the enrollment dataset revealed that the date column contains valid values expressed in more than one official format,
which is common in Indian administrative data.

To prevent data loss and ensure accurate temporal analysis, a robust date parsing strategy was applied that:
- Explicitly handles all observed date formats
- Respects the day-first convention used in Indian datasets
- Preserves all records without arbitrary deletion

No rows were removed during this process.

In [29]:
# Create a clean copy to avoid mutating the combined dataset
enrollment_clean = enrollment_combined.copy()

# Attempt 1: Parse dates in DD-MM-YYYY format
parsed_dash = pd.to_datetime(
    enrollment_clean["date"],
    format="%d-%m-%Y",
    errors="coerce"
)

# Attempt 2: Parse dates in D/M/YYYY format (Indian day-first convention)
parsed_slash = pd.to_datetime(
    enrollment_clean["date"],
    dayfirst=True,
    errors="coerce"
)

# Combine both parsing strategies
enrollment_clean["date"] = parsed_dash.fillna(parsed_slash)

In [30]:
# Check for any unparsed dates
enrollment_clean["date"].isna().sum()

np.int64(0)

### Step 7B: Validation of Age-Based Numerical Fields

The enrollment dataset contains age-segmented numerical fields representing different age groups.
Before further analysis, these fields were validated to ensure basic numerical integrity.

The following checks were performed:
- Verification that all age-based values are non-negative
- Confirmation that no structurally invalid numerical values exist

No transformations or corrections were required, as all values were found to be valid.

In [31]:
# Validate age columns contain no negative values
age_columns = ["age_0_5", "age_5_17", "age_18_greater"]
(enrollment_clean[age_columns] < 0).sum()

age_0_5           0
age_5_17          0
age_18_greater    0
dtype: int64

### Step 7C: Standardization of State Names

State names in the enrollment dataset exhibited semantic duplication due to legacy spellings,
administrative reorganization, and formatting inconsistencies.

To enable reliable state-level aggregation, all State values were normalized to a canonical set of officially recognized names using an explicit and documented mapping strategy.

This process:
- Eliminated duplicate representations of the same State
- Preserved administrative correctness
- Removed clearly invalid tokens

All normalization decisions were applied transparently and consistently.


In [32]:
# Standardize State column
for col in ["state"]:
    enrollment_clean[col] = (
        enrollment_clean[col]
        .astype(str)
        .str.strip()          # remove leading/trailing whitespace
        .str.title()          # standardize casing (e.g., 'karnataka' -> 'Karnataka')
    )

In [33]:
# Check number of unique states
enrollment_clean["state"].nunique()

49

In [34]:
enrollment_clean["state"].unique()

array(['Meghalaya', 'Karnataka', 'Uttar Pradesh', 'Bihar', 'Maharashtra',
       'Haryana', 'Rajasthan', 'Punjab', 'Delhi', 'Madhya Pradesh',
       'West Bengal', 'Assam', 'Uttarakhand', 'Gujarat', 'Andhra Pradesh',
       'Tamil Nadu', 'Chhattisgarh', 'Jharkhand', 'Nagaland', 'Manipur',
       'Telangana', 'Tripura', 'Mizoram', 'Jammu And Kashmir',
       'Chandigarh', 'Sikkim', 'Odisha', 'Kerala',
       'The Dadra And Nagar Haveli And Daman And Diu',
       'Arunachal Pradesh', 'Himachal Pradesh', 'Goa',
       'Dadra And Nagar Haveli And Daman And Diu', 'Ladakh',
       'Andaman And Nicobar Islands', 'Orissa', 'Pondicherry',
       'Puducherry', 'Lakshadweep', 'Andaman & Nicobar Islands',
       'Dadra & Nagar Haveli', 'Dadra And Nagar Haveli', 'Daman And Diu',
       'Jammu & Kashmir', 'West  Bengal', '100000', 'Daman & Diu',
       'West Bangal', 'Westbengal'], dtype=object)

In [35]:
# Canonical state name mapping
state_normalization_map = {
    # West Bengal variants
    "West Bengal": "West Bengal",
    "West  Bengal": "West Bengal",
    "West Bangal": "West Bengal",
    "Westbengal": "West Bengal",

    # Odisha / Orissa
    "Orissa": "Odisha",

    # Jammu & Kashmir
    "Jammu & Kashmir": "Jammu And Kashmir",

    # Andaman & Nicobar
    "Andaman & Nicobar Islands": "Andaman And Nicobar Islands",

    # UT merger
    "Dadra & Nagar Haveli": "Dadra And Nagar Haveli And Daman And Diu",
    "Daman & Diu": "Dadra And Nagar Haveli And Daman And Diu",
    "Daman And Diu": "Dadra And Nagar Haveli And Daman And Diu",
    "Dadra And Nagar Haveli": "Dadra And Nagar Haveli And Daman And Diu",
    "The Dadra And Nagar Haveli And Daman And Diu":
        "Dadra And Nagar Haveli And Daman And Diu",

    # Puducherry
    "Pondicherry": "Puducherry"
}

# Apply normalization
enrollment_clean["state"] = (
    enrollment_clean["state"]
    .replace(state_normalization_map)
)

# Remove clearly invalid state values
invalid_states = ["100000"]

enrollment_clean = enrollment_clean[
    ~enrollment_clean["state"].isin(invalid_states)
]

enrollment_clean["state"].nunique()

36

In [36]:
enrollment_clean["state"].unique()

array(['Meghalaya', 'Karnataka', 'Uttar Pradesh', 'Bihar', 'Maharashtra',
       'Haryana', 'Rajasthan', 'Punjab', 'Delhi', 'Madhya Pradesh',
       'West Bengal', 'Assam', 'Uttarakhand', 'Gujarat', 'Andhra Pradesh',
       'Tamil Nadu', 'Chhattisgarh', 'Jharkhand', 'Nagaland', 'Manipur',
       'Telangana', 'Tripura', 'Mizoram', 'Jammu And Kashmir',
       'Chandigarh', 'Sikkim', 'Odisha', 'Kerala',
       'Dadra And Nagar Haveli And Daman And Diu', 'Arunachal Pradesh',
       'Himachal Pradesh', 'Goa', 'Ladakh', 'Andaman And Nicobar Islands',
       'Puducherry', 'Lakshadweep'], dtype=object)

### Step 7D: Structural Standardization of District Names

District names in the dataset reflect diverse administrative reporting practices, historical naming conventions, and recent district formations.
Given the frequency of district reorganization in India, full semantic normalization was intentionally avoided.

Instead, district names were standardized at a structural level by:
- Removing extraneous whitespace and formatting artifacts
- Normalizing text casing
- Eliminating clearly invalid non-district tokens

No semantic remapping or merging of district names was performed to preserve the original reporting granularity.


In [37]:
enrollment_clean["district"] = (
    enrollment_clean["district"]
    .astype(str)
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
    .str.replace(r"\*", "", regex=True)
    .str.replace(r"[()]", "", regex=True)
    .str.replace("–", "-", regex=False)
    .str.replace("−", "-", regex=False)
    .str.replace("?", "", regex=False)
    .str.title()
)

In [38]:
invalid_districts = {
    "East", "West", "North", "South", "North East"
}

enrollment_clean = enrollment_clean[
    ~enrollment_clean["district"].isin(invalid_districts)
]

enrollment_clean["district"].nunique()

956

In [39]:
# Drop temporary provenance column used during ingestion
if "source_file" in enrollment_clean.columns:
    enrollment_clean = enrollment_clean.drop(columns=["source_file"])

In [40]:
enrollment_clean.info()
enrollment_clean.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1005293 entries, 0 to 1006028
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   date            1005293 non-null  datetime64[ns]
 1   state           1005293 non-null  object        
 2   district        1005293 non-null  object        
 3   pincode         1005293 non-null  int64         
 4   age_0_5         1005293 non-null  int64         
 5   age_5_17        1005293 non-null  int64         
 6   age_18_greater  1005293 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 61.4+ MB


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,2025-03-02,Meghalaya,East Khasi Hills,793121,11,61,37
1,2025-03-09,Karnataka,Bengaluru Urban,560043,14,33,39
2,2025-03-09,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,2025-03-09,Uttar Pradesh,Aligarh,202133,62,29,15
4,2025-03-09,Karnataka,Bengaluru Urban,560016,14,16,21


### Step 7 Summary

At the conclusion of Step 7, the enrollment dataset satisfies the following conditions:
- Dates are consistently parsed and temporally reliable
- Numerical age fields are validated for integrity
- State names are canonical and aggregation-safe
- District names are structurally clean while preserving administrative semantics

The dataset is now fully prepared for downstream exploratory and analytical tasks.


## Step 8: Persisting the Clean Enrollment Dataset

After completing all validation and cleaning steps, the enrollment dataset is now finalized.
In this step, the cleaned dataset is persisted to disk to serve as the single source of truth for all downstream analysis.


In [41]:
# Define output directory for processed data
OUTPUT_DIR = Path("../03_Processed_Data")
OUTPUT_DIR.mkdir(exist_ok=True)

# Persist the cleaned enrollment dataset
output_path = OUTPUT_DIR / "enrollment_clean.csv"
enrollment_clean.to_csv(output_path, index=False)

output_path

WindowsPath('../03_Processed_Data/enrollment_clean.csv')

In [42]:
# Verify saved file can be read correctly
pd.read_csv(output_path, nrows=5)

Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,2025-03-02,Meghalaya,East Khasi Hills,793121,11,61,37
1,2025-03-09,Karnataka,Bengaluru Urban,560043,14,33,39
2,2025-03-09,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,2025-03-09,Uttar Pradesh,Aligarh,202133,62,29,15
4,2025-03-09,Karnataka,Bengaluru Urban,560016,14,16,21
