# AadhaarPulse: Data Loading & Preprocessing

**Objective**: Ingest raw, split CSV files for Enrolment, Demographic, and Biometric updates. Clean, merge, and prepare them for downstream analysis.

## 1. Setup & Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from utils import load_and_merge_datasets, basic_preprocessing

# %matplotlib inline
sns.set_style('whitegrid')

## 2. Data Loading
Using the custom `load_and_merge_datasets` utility to aggregation split files from the directory structure.

In [2]:
DATA_DIR = 'd:/UIDAI'
raw_data = load_and_merge_datasets(DATA_DIR)

Loading 3 files for enrolment...
Successfully merged enrolment: 1006029 rows, 7 columns.
Loading 5 files for demographic...
Successfully merged demographic: 2071700 rows, 6 columns.
Loading 4 files for biometric...
Successfully merged biometric: 1861108 rows, 6 columns.


## 3. Initial Inspection & Cleaning

In [3]:
processed_data = basic_preprocessing(raw_data)

enrolment_df = processed_data['enrolment']
demographic_df = processed_data['demographic']
biometric_df = processed_data['biometric']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['pincode'] = df['pincode'].str.replace(r'\D', '', regex=True)


Preprocessing completed for enrolment.
Preprocessing completed for demographic.
Preprocessing completed for biometric.


In [4]:
print("Enrolment Shape:", enrolment_df.shape)
print("Demographic Shape:", demographic_df.shape)
print("Biometric Shape:", biometric_df.shape)

display(enrolment_df.head())
display(enrolment_df.info())

Enrolment Shape: (983072, 7)
Demographic Shape: (1598099, 6)
Biometric Shape: (1766212, 6)


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,2025-03-02,Meghalaya,East Khasi Hills,793121,11,61,37
1,2025-03-09,Karnataka,Bengaluru Urban,560043,14,33,39
2,2025-03-09,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,2025-03-09,Uttar Pradesh,Aligarh,202133,62,29,15
4,2025-03-09,Karnataka,Bengaluru Urban,560016,14,16,21


<class 'pandas.core.frame.DataFrame'>
Index: 983072 entries, 0 to 1004911
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   date            983072 non-null  datetime64[ns]
 1   state           983072 non-null  object        
 2   district        983072 non-null  object        
 3   pincode         983072 non-null  object        
 4   age_0_5         983072 non-null  int64         
 5   age_5_17        983072 non-null  int64         
 6   age_18_greater  983072 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 60.0+ MB


None

## 4. Feature Engineering

### 4.1 Time Extraction
Extracting Year, Month, and Quarter from the `date` column for trend analysis.

In [5]:
for name, df in [('enrolment', enrolment_df), ('demographic', demographic_df), ('biometric', biometric_df)]:
    if not df.empty:
        df['Year'] = df['date'].dt.year
        df['Month'] = df['date'].dt.month
        df['MonthName'] = df['date'].dt.month_name()
        df['Quarter'] = df['date'].dt.quarter
        print(f"Added time features to {name} data.")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Year'] = df['date'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Month'] = df['date'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MonthName'] = df['date'].dt.month_name()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,

Added time features to enrolment data.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Month'] = df['date'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MonthName'] = df['date'].dt.month_name()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Quarter'] = df['date'].dt.quarter
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_in

Added time features to demographic data.
Added time features to biometric data.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Month'] = df['date'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MonthName'] = df['date'].dt.month_name()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Quarter'] = df['date'].dt.quarter


### 4.2 Total Activity Calculation
Aggregating age-group columns to get `Total_Enrolments`, `Total_Demo_Updates`, and `Total_Bio_Updates`.

In [6]:
# Enrolment Totals
if not enrolment_df.empty:
    enrolment_df['Total_Enrolments'] = enrolment_df['age_0_5'] + enrolment_df['age_5_17'] + enrolment_df['age_18_greater']

# Demographic Totals
if not demographic_df.empty:
    demographic_df['Total_Demo_Updates'] = demographic_df['demo_age_5_17'] + demographic_df['demo_age_17_']

# Biometric Totals
if not biometric_df.empty:
    biometric_df['Total_Bio_Updates'] = biometric_df['bio_age_5_17'] + biometric_df['bio_age_17_']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  enrolment_df['Total_Enrolments'] = enrolment_df['age_0_5'] + enrolment_df['age_5_17'] + enrolment_df['age_18_greater']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demographic_df['Total_Demo_Updates'] = demographic_df['demo_age_5_17'] + demographic_df['demo_age_17_']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-vie

## 5. Completeness Check
Verifying that we have data for all expected states and identifying any missing temporal chunks.

In [7]:
def check_temporal_continuity(df, name):
    if df.empty: return
    dates = df['date'].unique()
    print(f"{name}: Date Range {dates.min()} to {dates.max()}")
    print(f"{name}: Total Unique Days: {len(dates)}")

check_temporal_continuity(enrolment_df, 'Enrolment')
check_temporal_continuity(demographic_df, 'Demographic')
check_temporal_continuity(biometric_df, 'Biometric')

Enrolment: Date Range 2025-03-02 00:00:00 to 2025-12-31 00:00:00
Enrolment: Total Unique Days: 92
Demographic: Date Range 2025-03-01 00:00:00 to 2025-12-29 00:00:00
Demographic: Total Unique Days: 95
Biometric: Date Range 2025-03-01 00:00:00 to 2025-12-29 00:00:00
Biometric: Total Unique Days: 89


## 6. Export Cleaned Data
Saving the unified datasets for the next stage (EDA & Modeling).

In [8]:
# Create 'processed' directory if not exists
import os
os.makedirs('d:/UIDAI/processed_data', exist_ok=True)

enrolment_df.to_csv('d:/UIDAI/processed_data/enrolment_clean.csv', index=False)
demographic_df.to_csv('d:/UIDAI/processed_data/demographic_clean.csv', index=False)
biometric_df.to_csv('d:/UIDAI/processed_data/biometric_clean.csv', index=False)

print("Cleaned datasets saved to d:/UIDAI/processed_data/")

Cleaned datasets saved to d:/UIDAI/processed_data/
