# Data Sources & Lifecycle Integration

### Datasets Used
- **Aadhaar Enrolment Dataset** – captures initial Aadhaar enrolment activity across geographic units.
- **Aadhaar Demographic Update Dataset** – records updates to demographic details such as name, address, and date of birth.
- **Aadhaar Biometric Update Dataset** – reflects biometric update activity linked to Aadhaar records.

### Why These Datasets Are Merged
Each dataset represents a different stage of the Aadhaar lifecycle.  
By integrating them, we construct a **single lifecycle view** that enables geographic analysis of enrolment continuity and update coverage.

### Outcome of Integration
- Unified lifecycle indicators at the **PIN-code level**
- Aggregation to **District** and **State / Union Territory** levels
- Foundation for identifying areas with incomplete or inactive Aadhaar lifecycle patterns

In [95]:
import pandas as pd
import numpy as np
import glob

enrolment_files = glob.glob("InputEnrolled/api_data_aadhar_enrolment_*.csv")

enrolment_df = pd.concat(
    [pd.read_csv(f) for f in enrolment_files],
    ignore_index=True
)

print("Files loaded:", len(enrolment_files))
print("Shape:", enrolment_df.shape)

enrolment_df.head()

Files loaded: 3
Shape: (1006029, 7)


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,31-12-2025,Karnataka,Bidar,585330,2,3,0
1,31-12-2025,Karnataka,Bidar,585402,6,0,0
2,31-12-2025,Karnataka,Bidar,585413,1,0,0
3,31-12-2025,Karnataka,Bidar,585418,1,2,0
4,31-12-2025,Karnataka,Bidar,585421,4,3,0


In [96]:
demographic_files = glob.glob("InputDemographic/api_data_aadhar_demographic_*.csv")

demographic_df = pd.concat(
    [pd.read_csv(f) for f in demographic_files],
    ignore_index=True
)

print("Files loaded:", len(demographic_files))
print("Shape:", demographic_df.shape)

demographic_df.head()

Files loaded: 5
Shape: (2071700, 6)


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_
0,18-11-2025,Karnataka,Hasan,573118,0,2
1,18-11-2025,Karnataka,Hasan,573124,1,3
2,18-11-2025,Karnataka,Hasan,573150,0,2
3,18-11-2025,Karnataka,Hassan,573113,0,1
4,18-11-2025,Karnataka,Hassan,573120,1,4


In [97]:
biometric_files = glob.glob("InputBiometricupdate/api_data_aadhar_biometric_*.csv")

biometric_df = pd.concat(
    [pd.read_csv(f) for f in biometric_files],
    ignore_index=True
)

print("Files loaded:", len(biometric_files))
print("Shape:", biometric_df.shape)

biometric_df.head()

Files loaded: 4
Shape: (1861108, 6)


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_
0,19-09-2025,Goa,North Goa,403502,0,4
1,19-09-2025,Goa,North Goa,403508,1,4
2,19-09-2025,Goa,North Goa,403513,2,0
3,19-09-2025,Goa,North Goa,403527,2,2
4,19-09-2025,Goa,South Goa,403601,7,3


In [98]:
# keep only rows where state and district are NOT numeric
enrolment_df_copy = (enrolment_df[
    ~pd.to_numeric(enrolment_df["state"], errors="coerce").notna() &
    ~pd.to_numeric(enrolment_df["district"], errors="coerce").notna()
]).copy()

enrolment_df_copy.head()

Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,31-12-2025,Karnataka,Bidar,585330,2,3,0
1,31-12-2025,Karnataka,Bidar,585402,6,0,0
2,31-12-2025,Karnataka,Bidar,585413,1,0,0
3,31-12-2025,Karnataka,Bidar,585418,1,2,0
4,31-12-2025,Karnataka,Bidar,585421,4,3,0


In [99]:
enrolment_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1006007 entries, 0 to 1006028
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   date            1006007 non-null  object
 1   state           1006007 non-null  object
 2   district        1006007 non-null  object
 3   pincode         1006007 non-null  int64 
 4   age_0_5         1006007 non-null  int64 
 5   age_5_17        1006007 non-null  int64 
 6   age_18_greater  1006007 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 61.4+ MB


✅ TABLE 1 — Enrolment presence table

Columns should be:

state | district | pincode | has_enrolment_5_17

In [100]:
groupcolumns = ["state","district","pincode"]
Enrolment_presence_table = (enrolment_df_copy
                            .groupby(groupcolumns)
                            .agg(has_enrolment_5_17=("age_5_17",lambda x: (x>0).any()))
                            .reset_index())

In [101]:
Enrolment_presence_table.head()

Unnamed: 0,state,district,pincode,has_enrolment_5_17
0,Andaman & Nicobar Islands,Andamans,744101,True
1,Andaman & Nicobar Islands,Andamans,744103,True
2,Andaman & Nicobar Islands,Andamans,744105,False
3,Andaman & Nicobar Islands,Andamans,744106,True
4,Andaman & Nicobar Islands,Andamans,744107,True


✅ TABLE 2 — Enrolment presence table

Columns should be:

state | district | pincode | has_enrolment_5_17

In [102]:
# keep only rows where state and district are NOT numeric
demographic_df_copy = (demographic_df[
    ~pd.to_numeric(demographic_df["state"], errors="coerce").notna() &
    ~pd.to_numeric(demographic_df["district"], errors="coerce").notna()
]).copy()

In [103]:
demographic_presence_table = (demographic_df_copy
                              .groupby(groupcolumns)
                              .agg(has_demographic_5_17=("demo_age_5_17",lambda x: (x>0).any()))
                              .reset_index())

In [104]:
demographic_presence_table.head()

Unnamed: 0,state,district,pincode,has_demographic_5_17
0,Andaman & Nicobar Islands,Andamans,744101,True
1,Andaman & Nicobar Islands,Andamans,744102,False
2,Andaman & Nicobar Islands,Andamans,744103,True
3,Andaman & Nicobar Islands,Andamans,744105,True
4,Andaman & Nicobar Islands,Andamans,744106,False


✅ TABLE 3 — Biometric updates presence table

Columns should be:

state | district | pincode | has_enrolment_5_17

In [105]:
# keep only rows where state and district are NOT numeric
biometric_df_copy = (biometric_df[
    ~pd.to_numeric(biometric_df["state"], errors="coerce").notna() &
    ~pd.to_numeric(biometric_df["district"], errors="coerce").notna()
]).copy()

In [106]:
biometric_presence_table = (biometric_df_copy
                            .groupby(groupcolumns)
                            .agg(has_biometric_5_17=("bio_age_5_17",lambda x: (x>0).any()))
                            .reset_index())

In [107]:
biometric_presence_table.head()

Unnamed: 0,state,district,pincode,has_biometric_5_17
0,Andaman & Nicobar Islands,Andamans,744101,True
1,Andaman & Nicobar Islands,Andamans,744102,False
2,Andaman & Nicobar Islands,Andamans,744103,True
3,Andaman & Nicobar Islands,Andamans,744105,True
4,Andaman & Nicobar Islands,Andamans,744106,True


In [108]:
pin_presence = Enrolment_presence_table.merge(
    demographic_presence_table,
    on=["state","district","pincode"],
    how="outer")

In [109]:
pin_presence = pin_presence.merge(
    biometric_presence_table,
    on=["state","district","pincode"],
    how="outer")

In [110]:
pin_presence = pin_presence.fillna(False)

  pin_presence = pin_presence.fillna(False)


In [111]:
pin_presence.head()

Unnamed: 0,state,district,pincode,has_enrolment_5_17,has_demographic_5_17,has_biometric_5_17
0,Andaman & Nicobar Islands,Andamans,744101,True,True,True
1,Andaman & Nicobar Islands,Andamans,744102,False,False,False
2,Andaman & Nicobar Islands,Andamans,744103,True,True,True
3,Andaman & Nicobar Islands,Andamans,744105,False,True,True
4,Andaman & Nicobar Islands,Andamans,744106,True,False,True


### Each lifecycle category represents the type of Aadhaar-related activity observed across PIN codes for the 5–17 age group:

Complete – Enrolment, demographic update, and biometric update activity are all observed, indicating full lifecycle continuity.

No Biometric – Enrolment and demographic updates are present, but biometric updates are missing.

Only Enrolment – Initial enrolment activity is observed without subsequent updates.

Only Demographic – Demographic update activity exists without corresponding enrolment activity.

No Activity – No enrolment or update activity is observed for the PIN code.

Unclear – Activity exists but does not follow a clear or expected lifecycle sequence.

In [112]:
conditions = [((pin_presence["has_enrolment_5_17"] == True) & (pin_presence["has_demographic_5_17"] == True) & (pin_presence["has_biometric_5_17"] == True)),
              ((pin_presence["has_enrolment_5_17"] == True) & (pin_presence["has_demographic_5_17"] == True) & (pin_presence["has_biometric_5_17"] == False)),
              ((pin_presence["has_enrolment_5_17"] == True) & (pin_presence["has_demographic_5_17"] == False) & (pin_presence["has_biometric_5_17"] == False)),
              ((pin_presence["has_enrolment_5_17"] == False) & (pin_presence["has_demographic_5_17"] == True) & (pin_presence["has_biometric_5_17"] == False)),
              ((pin_presence["has_enrolment_5_17"] == False) & (pin_presence["has_demographic_5_17"] == False) & (pin_presence["has_biometric_5_17"] == False))]

lifecycle_state_names = ['Complete','No Biometric','Only Enrolment','Only Demographic','No Activity']

In [149]:
pin_presence["lifecycle_state"] = np.select(condlist=conditions,choicelist=lifecycle_state_names,default='Unclear')

### Pincode wise lifecycle states

In [None]:
pin_presence.head()
#pin_presence.to_csv("Output/state_ut_district_pincode_Aadhaar_lifecycle_distribution.csv", index=True)


In [151]:
pin_presence["lifecycle_state"].value_counts()

lifecycle_state
Complete            23844
Unclear              5521
No Activity          2634
Only Enrolment        483
Only Demographic      447
No Biometric           96
Name: count, dtype: int64

In [116]:
District_level_lifecycle = (pin_presence
                            .groupby(["state","district","lifecycle_state"])
                            .agg(pincode_count=("lifecycle_state","size"))
                            .reset_index())

In [117]:
District_level_lifecycle.head()

Unnamed: 0,state,district,lifecycle_state,pincode_count
0,Andaman & Nicobar Islands,Andamans,Complete,3
1,Andaman & Nicobar Islands,Andamans,No Activity,5
2,Andaman & Nicobar Islands,Andamans,Unclear,4
3,Andaman & Nicobar Islands,Nicobars,Unclear,1
4,Andaman & Nicobar Islands,South Andaman,No Activity,2


### Sorting based on lifecycle wise

In [118]:
District_level_lifecycle_sorted = (
    District_level_lifecycle
    .pivot_table(
        index=["state","district"],
        columns="lifecycle_state",
        values="pincode_count",
        fill_value=0
    )
    .reset_index()
)

In [119]:
District_level_lifecycle_sorted.columns.name = None
District_level_lifecycle_sorted.head(20)

Unnamed: 0,state,district,Complete,No Activity,No Biometric,Only Demographic,Only Enrolment,Unclear
0,Andaman & Nicobar Islands,Andamans,3.0,5.0,0.0,0.0,0.0,4.0
1,Andaman & Nicobar Islands,Nicobars,0.0,0.0,0.0,0.0,0.0,1.0
2,Andaman & Nicobar Islands,South Andaman,0.0,2.0,0.0,0.0,0.0,4.0
3,Andaman and Nicobar Islands,Nicobar,1.0,0.0,0.0,0.0,0.0,3.0
4,Andaman and Nicobar Islands,North And Middle Andaman,2.0,1.0,0.0,0.0,0.0,5.0
5,Andaman and Nicobar Islands,South Andaman,7.0,1.0,0.0,0.0,0.0,4.0
6,Andhra Pradesh,Adilabad,39.0,0.0,0.0,0.0,0.0,9.0
7,Andhra Pradesh,Alluri Sitharama Raju,25.0,4.0,0.0,0.0,0.0,14.0
8,Andhra Pradesh,Anakapalli,9.0,0.0,0.0,0.0,0.0,31.0
9,Andhra Pradesh,Anantapur,76.0,0.0,0.0,0.0,0.0,16.0


### Percentage wise district lifecycle
#### Percentages show the share of PIN codes in each district that fall into different Aadhaar lifecycle states (for age 5–17).

In [None]:
district_pct = District_level_lifecycle_sorted.copy()
lifecycle_cols = [
    col for col in district_pct.columns
    if col not in ["state", "district"]
]

district_pct["total_pincodes"] = district_pct[lifecycle_cols].sum(axis=1)

for col in lifecycle_cols:
    district_pct[col] = (district_pct[col] / district_pct["total_pincodes"]) * 100

district_pct[lifecycle_cols] = district_pct[lifecycle_cols].round(2).astype(str) + " %"

In [None]:
district_pct.head(20)
#district_pct.to_csv("Output/state_ut_district_Aadhaar_lifecycle_distribution.csv", index=True)


In [122]:
def clean_state_name(s):
    if pd.isna(s):
        return s
    
    s = s.strip().lower()
    
    replacements = {
        "andaman and nicobar islands": "Andaman & Nicobar Islands",
        "andaman & nicobar islands": "Andaman & Nicobar Islands",
        
        "dadra and nagar haveli": "Dadra & Nagar Haveli and Daman & Diu",
        "daman and diu": "Dadra & Nagar Haveli and Daman & Diu",
        "dadra and nagar haveli and daman and diu": "Dadra & Nagar Haveli and Daman & Diu",
        
        "jammu and kashmir": "Jammu & Kashmir",
        "jammu & kashmir": "Jammu & Kashmir",
        
        "west bengal": "West Bengal",
        "west bengal ": "West Bengal",
        "west bengli": "West Bengal",
        "west bangal": "West Bengal",
        "westbengal": "West Bengal",
        
        "odisha": "Odisha",
        "orissa": "Odisha",
        
        "tamilnadu": "Tamil Nadu",
        "andhra pradesh": "Andhra Pradesh",
        
        "chhatisgarh": "Chhattisgarh",
        "chhattisgarh": "Chhattisgarh",
    }
    
    return replacements.get(s, s.title())


In [123]:
District_level_lifecycle_sorted["state_clean"] = (
    District_level_lifecycle_sorted["state"].apply(clean_state_name)
)

State-level percentages represent the distribution of PIN codes across Aadhaar lifecycle states.

In [124]:
state_counts = (
    District_level_lifecycle_sorted
        .groupby("state_clean", as_index=False)
        .sum(numeric_only=True)
)

In [125]:
state_counts.head(10)

Unnamed: 0,state_clean,Complete,No Activity,No Biometric,Only Demographic,Only Enrolment,Unclear
0,Andaman & Nicobar Islands,13.0,9.0,0.0,0.0,0.0,21.0
1,Andhra Pradesh,2053.0,160.0,6.0,23.0,22.0,947.0
2,Arunachal Pradesh,65.0,2.0,0.0,0.0,0.0,11.0
3,Assam,715.0,103.0,13.0,14.0,19.0,98.0
4,Balanagar,0.0,1.0,0.0,0.0,0.0,0.0
5,Bihar,1052.0,133.0,9.0,14.0,65.0,113.0
6,Chandigarh,16.0,1.0,0.0,0.0,0.0,13.0
7,Chhattisgarh,382.0,95.0,2.0,19.0,7.0,131.0
8,Dadra & Nagar Haveli,1.0,2.0,0.0,0.0,0.0,1.0
9,Dadra & Nagar Haveli and Daman & Diu,16.0,1.0,0.0,0.0,0.0,3.0


In [142]:
lifecycle_cols = [
    col for col in state_counts.columns
    if col != "state_clean"
]

state_pct = state_counts.copy()

state_pct["total_pincodes"] = state_pct[lifecycle_cols].sum(axis=1)

for col in lifecycle_cols:
    state_pct[col] = (state_pct[col] / state_pct["total_pincodes"]) * 100

state_pct[lifecycle_cols] = state_pct[lifecycle_cols].round(2).astype(str) + " %"

state_pct = state_pct.drop(columns=["total_pincodes"])

state_pct.head()

Unnamed: 0,state_clean,Complete,No Activity,No Biometric,Only Demographic,Only Enrolment,Unclear
0,Andaman & Nicobar Islands,30.23 %,20.93 %,0.0 %,0.0 %,0.0 %,48.84 %
1,Andhra Pradesh,63.94 %,4.98 %,0.19 %,0.72 %,0.69 %,29.49 %
2,Arunachal Pradesh,83.33 %,2.56 %,0.0 %,0.0 %,0.0 %,14.1 %
3,Assam,74.32 %,10.71 %,1.35 %,1.46 %,1.98 %,10.19 %
4,Bihar,75.9 %,9.6 %,0.65 %,1.01 %,4.69 %,8.15 %


State and Union Territory names were standardized by fixing casing, spacing, and known naming variants, and by removing district or city names that appeared incorrectly in the state field. This ensured one consistent record per administrative unit.

In [130]:
STATE_MERGE_MAP = {
    "West bengal": "West Bengal",
    "West Bengal": "West Bengal",

    "Pondicherry": "Puducherry",
    "Puducherry": "Puducherry",

    "Dadra & Nagar Haveli": "Dadra & Nagar Haveli and Daman & Diu",
    "Daman & Diu": "Dadra & Nagar Haveli and Daman & Diu",
    "Dadra & Nagar Haveli and Daman & Diu": "Dadra & Nagar Haveli and Daman & Diu",
    "The Dadra And Nagar Haveli And Daman And Diu": "Dadra & Nagar Haveli and Daman & Diu"
}

REMOVE_STATES = [
    "Balanagar",
    "Darbhanga",
    "Jaipur",
    "Madanapalle",
    "Nagpur",
    "Puttenahalli",
    "Raja Annamalai Puram",
    "Uttaranchal"
]

In [134]:
state_counts["state_clean"] = (
    state_counts["state_clean"]
        .astype(str)
        .str.strip()                 # remove leading/trailing spaces
        .str.replace(r"\s+", " ", regex=True)  # collapse double spaces
)

In [135]:
state_counts["state_clean"] = state_counts["state_clean"].replace(STATE_MERGE_MAP)

state_counts = (
    state_counts
        .groupby("state_clean", as_index=False)
        .sum(numeric_only=True)
)

state_counts = state_counts[~state_counts["state_clean"].isin(REMOVE_STATES)]

In [136]:
lifecycle_cols = [
    col for col in state_counts.columns
    if col != "state_clean"
]

state_pct = state_counts.copy()

state_pct["total_pincodes"] = state_pct[lifecycle_cols].sum(axis=1)

for col in lifecycle_cols:
    state_pct[col] = (state_pct[col] / state_pct["total_pincodes"]) * 100

state_pct[lifecycle_cols] = state_pct[lifecycle_cols].round(2).astype(str) + " %"

state_pct = state_pct.drop(columns=["total_pincodes"]).reset_index(drop=True)

### State & Union Territory Aadhaar Lifecycle Distribution (Age 5–17)
This table shows the percentage distribution of Aadhaar lifecycle states across PIN codes within each State and Union Territory, based on enrolment, demographic updates, and biometric updates for the 5–17 age group.

The Output folder contains three kinds of CSVs:
 1. State/UT-District-Pincode lifecycle wise CSV
 2. State/UT-District lifecycle wise CSV
 3. State/UT lifecycle wise CSV

In [None]:
state_pct.head(37)
#state_pct.to_csv("Output/state_ut_Aadhaar_lifecycle_distribution.csv", index=True)
