<a href="https://colab.research.google.com/github/NEHALSMITTAL/aadhaar-enrolment-analysis/blob/main/Aadhar_Enrolement_Anlysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**
To analyse Aadhaar enrolment patterns during 2025 across states and age groups, identify regional and earlyâ€‘age enrolment gaps, and provide dataâ€‘driven insights to support targeted planning and policy interventions by UIDAI.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### **Dataset** **Description**
The analysis uses anonymised Aadhaar enrolment data for 2025, containing state, district, PIN code, date, and age-group wise enrolment counts (0-5, 5-17, and 18+).


The dataset contains districtâ€‘ and PINâ€‘level granularity, enabling further microâ€‘level analysis if required.

In [None]:
df1 = pd.read_csv("/content/api_data_aadhar_enrolment_0_500000.csv")
df3 = pd.read_csv("/content/api_data_aadhar_enrolment_1000000_1006029.csv")
df2 = pd.read_csv("/content/api_data_aadhar_enrolment_500000_1000000.csv")

df = pd.concat([df1, df2, df3] , ignore_index= True)

### Data Cleaning and Preparation
- Standardised state and UT names to resolve spelling and casing inconsistencies
- Aligned records with postâ€‘2019 administrative boundaries
- Removed invalid or malformed records
- Derived total enrolment and ageâ€‘group share metrics


In [None]:
df.shape

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:
df['date'] = pd.to_datetime(df['date'] , errors = 'coerce')
df['year'] = df['date'].dt.year

In [None]:
df['date'].isna().sum()

In [None]:
df_clean = df.dropna(subset=['date', 'state', 'age_0_5', 'age_5_17', 'age_18_greater']).copy()

In [None]:
df.shape

In [None]:
df_clean['year'] = df_clean['date'].dt.year
df_clean['total_enrolment'] = (df_clean['age_0_5'] + df_clean['age_5_17'] + df_clean['age_18_greater'] )

In [None]:
yearly = (df_clean.groupby('year')['total_enrolment'].sum().reset_index().sort_values('year'))

In [None]:
df_clean['date'].nunique()
df_clean['date'].min(), df_clean['date'].max()

In [None]:
df_clean['state'] = (df_clean['state'].astype(str).str.strip().str.lower())

In [None]:
official_states = [
    'andhra pradesh', 'arunachal pradesh', 'assam', 'bihar',
    'chhattisgarh', 'goa', 'gujarat', 'haryana',
    'himachal pradesh', 'jharkhand', 'karnataka', 'kerala',
    'madhya pradesh', 'maharashtra', 'manipur', 'meghalaya',
    'mizoram', 'nagaland', 'odisha', 'punjab', 'rajasthan',
    'sikkim', 'tamil nadu', 'telangana', 'tripura',
    'uttar pradesh', 'uttarakhand', 'west bengal',
    'andaman and nicobar islands', 'chandigarh',
    'dadra and nagar haveli', 'daman and diu',
    'delhi', 'jammu and kashmir', 'ladakh',
    'lakshadweep', 'puducherry'
]


In [None]:
from difflib import get_close_matches
def standardize_state(state):
  match = get_close_matches(state, official_states, n=1, cutoff=0.8)
  return match[0] if match else state

df_clean['state'] = df_clean['state'].apply(standardize_state)



In [None]:
df_clean['state'] = df_clean['state'].str.title()

In [None]:
df_clean['state'].value_counts().head(20)

In [None]:
df_clean['state'].value_counts().tail(20)

In [None]:
df_clean['state'].value_counts()


In [None]:
df_clean['state'] = (df_clean['state'].astype(str).str.lower().str.strip())


In [None]:
state_corrections = {
    # West Bengal variants
    'west bangal': 'west bengal',
    'westbengal': 'west bengal',

    # Puducherry
    'pondicherry': 'puducherry',

    # Dadra & Nagar Haveli + Daman & Diu (MERGED UT)
    'daman and diu': 'dadra and nagar haveli and daman and diu',
    'dadra and nagar haveli': 'dadra and nagar haveli and daman and diu',
    'the dadra and nagar haveli and daman and diu': 'dadra and nagar haveli and daman and diu',

    # Odisha
    'orissa': 'odisha'
}


In [None]:
df_clean['state'] = df_clean['state'].replace(state_corrections)

In [None]:
df_clean['state'] = df_clean['state'].str.title()

In [None]:
df_clean['state'].value_counts()

In [None]:
df_clean[df_clean['state'].str.isnumeric()]


In [None]:
df_clean[~df_clean['state'].str.contains('[a-zA-Z]', regex=True)]


In [None]:
df_clean = df_clean[df_clean['state'].str.contains('[a-zA-Z]', regex=True)].copy()

In [None]:
df_clean['state'].value_counts()

In [None]:
df_clean['total_enrolment'] = (df_clean['age_0_5'] + df_clean['age_5_17'] + df_clean['age_18_greater'] )
statewise = (df_clean.groupby('state')['total_enrolment'].sum().reset_index().sort_values('total_enrolment', ascending = False))

# STATEâ€‘WISE ANALYSIS

In [None]:
statewise.head(10)

In [None]:
plt.figure(figsize =(12,6))
sns.barplot(data = statewise.head(10),
            x= 'total_enrolment', y= 'state')

plt.title('Top 10 States/UT by Aadhaar Enrolment(2025)')
plt.xlabel('Total Enrolments')
plt.ylabel('State/UT')
plt.tight_layout()
plt.show()


# **Top 10 States by Aadhaar Enrolment**

**KEY INSIGHTS:**

*   Uttar Pradesh contributes the highest Aadhaar enrolment volume among all states
*   Highâ€‘population states such as Bihar, Maharashtra, Madhya Pradesh, and Gujarat dominate total enrolments


*   Aadhaar enrolment volume is strongly influenced by population size and service demand


*   High enrolment volume does not necessarily indicate full coverage or saturation








In [None]:
statewise.tail(10)

In [None]:
plt.figure(figsize =(12,6))
sns.barplot(data = statewise.tail(10),
            x= 'total_enrolment', y= 'state')

plt.title('Bottom 10 States/UT by Aadhaar Enrolment(2025)')
plt.xlabel('Total Enrolments')
plt.ylabel('State/UT')
plt.tight_layout()
plt.show()


# **Bottom 10 States / UTs by Aadhaar Enrolment**

**KEY INSIGHTS:**



*   Smaller states and Union Territories show significantly lower enrolment volumes
*   UTs such as Lakshadweep, Ladakh, and Andaman & Nicobar Islands have limited enrolment activity
*   Lower volumes reflect population size, geographic isolation, and accessibility challenges
*   These regions may require customised and regionâ€‘specific enrolment strategies




In [None]:
total_national = statewise['total_enrolment'].sum()
statewise['percentage'] = (statewise['total_enrolment']/total_national*100)

statewise[['state' , 'percentage']].head()

In [None]:
top5 = statewise.head(5).copy()
others_sum = statewise.iloc[5:]['total_enrolment'].sum()

top5 = pd.concat([ top5, pd.DataFrame({'state': [' All Other States'], 'total_enrolment': [others_sum]})])

In [None]:
plt.figure(figsize=(7,7))
plt.pie(
    top5['total_enrolment'],
    labels = top5['state'],
    autopct = '%1.1f%%',
    startangle = 140)

plt.title('Share of Aadhaar Enrolment: Top 5 States vs Rest (2025)')
plt.tight_layout()
plt.show()

# **Share of Aadhaar Enrolment (Top 5 States vs Rest of India)**

KEY INSIGHTS:



*   Aadhaar enrolment is highly concentrated in a small number of large states
*   Top 5 states account for a substantial share of total enrolments
*   Remaining states and UTs collectively contribute a smaller portion
*   Highlights the need for differentiated strategies for highâ€‘volume and lowâ€‘volume regions



# AGEâ€‘GROUP ANALYSIS

In [None]:
age_state = (df_clean.groupby('state')[['age_0_5', 'age_5_17', 'age_18_greater']].sum().reset_index())

In [None]:
age_state['total_enrolment'] = (
    age_state['age_0_5'] +
    age_state['age_5_17'] +
    age_state['age_18_greater'])

age_state['child_share_0_5'] = (
    age_state['age_0_5'] / age_state['total_enrolment'] * 100)


In [None]:
top10_states = statewise.head(10)['state']

age_top10 = age_state[age_state['state'].isin(top10_states)]

plt.figure(figsize=(12,6))
sns.barplot(
    data=age_top10.sort_values('child_share_0_5'),
    x='child_share_0_5',
    y='state')
plt.xlabel('Child Enrolment Share (0â€“5 years) %')
plt.ylabel('State')
plt.title('Child (0â€“5) Aadhaar Enrolment Share in Top States (2025)')
plt.tight_layout()
plt.show()


# **Child (0â€“5) Aadhaar Enrolment Share**

KEY INSIGHTS:
*   Significant variation exists in child enrolment share across highâ€‘enrolment states
*   Southern and western states demonstrate stronger earlyâ€‘age Aadhaar coverage
*   Some highâ€‘population states show relatively lower child enrolment share
*   Meghalaya and certain northeastern states exhibit uneven child enrolment patterns
*   Earlyâ€‘age Aadhaar enrolment remains a key opportunity area for improvement



In [None]:
overall_age = df_clean[['age_0_5', 'age_5_17', 'age_18_greater']].sum()

overall_age

In [None]:
overall_total = overall_age.sum()
overall_age_percent = (overall_age / overall_total * 100).round(2)
overall_age_percent

# **Overall Ageâ€‘Group Distribution**

KEY INSIGHTS:
*   Children aged 0â€“5 years account for the largest share (â‰ˆ60%) of Aadhaar enrolments in 2025
*   The 5â€“17 age group contributes around 36%, indicating strong enrolment during schoolâ€‘age years
*   Adult (18+) enrolment forms a very small share (â‰ˆ4%), suggesting Aadhaar enrolment is largely driven by earlyâ€‘life registration and updates rather than firstâ€‘time adult enrolment
*   The distribution indicates a shift towards earlyâ€‘age Aadhaar coverage, which supports longâ€‘term inclusion and lifecycleâ€‘based service delivery







In [None]:
age_state['child_share_0_5'] = (age_state['age_0_5'] / age_state['total_enrolment'] * 100)


In [None]:
print(age_state.columns)


In [None]:
# Step 1: Take top 10 states from statewise
top10_states = statewise[['state', 'total_enrolment']].head(10)

# Step 2: Merge child share from age_state
summary_top10 = top10_states.merge(
    age_state[['state', 'child_share_0_5']],
    on='state',
    how='left'
)

# Step 3: Rename & round
summary_top10 = summary_top10.rename(columns={
    'total_enrolment': 'Total Enrolment',
    'child_share_0_5': 'Child (0â€“5) Share %'
})

summary_top10['Child (0â€“5) Share %'] = summary_top10['Child (0â€“5) Share %'].round(2)

summary_top10



# **District Level Analysis(UP)**

In [None]:
up_df = df_clean[df_clean['state']== 'Uttar Pradesh'].copy()

In [None]:
up_district = (up_df.groupby('district')['total_enrolment'].sum().reset_index().sort_values('total_enrolment', ascending = False))

In [None]:
top5_up = up_district.head(25)
bottom5_up = up_district.tail(25)

top5_up, bottom5_up

In [None]:
up_df['district'] = (up_df['district'].astype(str).str.lower().str.strip().str.replace('*', '', regex = False))

In [None]:
district_corrections_up = {
    # Bhadohi / Sant Ravidas Nagar
    'bhadohi': 'sant ravidas nagar',
    'sant ravidas nagar bhadohi': 'sant ravidas nagar',
    'sant ravidas nagar': 'sant ravidas nagar',

    # Kushinagar
    'kushi nagar': 'kushinagar',
    'kushinagar': 'kushinagar',

    # Maharajganj
    'mahrajganj': 'maharajganj',
    'maharajganj': 'maharajganj',

    # Ayodhya
    'faizabad': 'ayodhya',
    'ayodhya': 'ayodhya',

    # Amroha
    'jyotiba phule nagar': 'amroha',
    'amroha': 'amroha'}


In [None]:
up_df['district'] = up_df['district'].replace(district_corrections_up)


In [None]:
up_df['district'] = up_df['district'].str.title()


In [None]:
up_district = (
    up_df
    .groupby('district')['total_enrolment']
    .sum()
    .reset_index()
    .sort_values('total_enrolment', ascending=False))


In [None]:
top5_up = up_district.head(25)
bottom5_up = up_district.tail(25)

top5_up, bottom5_up


In [None]:
up_df['district'] = (up_df['district'].astype(str).str.lower().str.strip().str.replace('*', '', regex=False))


In [None]:
district_corrections_up = {
    # Rae Bareli
    'rae bareli': 'raebareli',
    'raebareli': 'raebareli',

    # Shravasti
    'shravasti': 'shravasti',
    'shrawasti': 'shravasti',

    # Sant Ravidas Nagar / Bhadohi
    'bhadohi': 'sant ravidas nagar',
    'sant ravidas nagar': 'sant ravidas nagar',

    # Kushinagar
    'kushi nagar': 'kushinagar',
    'kushinagar': 'kushinagar',

    # Maharajganj
    'mahrajganj': 'maharajganj',
    'maharajganj': 'maharajganj',

    # Ayodhya
    'faizabad': 'ayodhya',
    'ayodhya': 'ayodhya',

    # Amroha
    'jyotiba phule nagar': 'amroha',
    'amroha': 'amroha'
}


In [None]:
up_df['district'] = up_df['district'].replace(district_corrections_up)


In [None]:
up_df['district'] = up_df['district'].str.title()


In [None]:
up_district = (up_df.groupby('district')['total_enrolment'].sum().reset_index().sort_values('total_enrolment', ascending=False))


In [None]:
up_district[up_district['district'].isin(
    ['Rae Bareli', 'Raebareli', 'Shravasti', 'Shrawasti'])]


In [None]:
plt.figure(figsize=(10,5))
sns.barplot(
    data=top5_up,
    x='total_enrolment',
    y='district'
)
plt.title('Top 5 Districts by Aadhaar Enrolment in Uttar Pradesh (2025)')
plt.xlabel('Total Enrolments')
plt.ylabel('District')
plt.tight_layout()
plt.show()

# **Districtâ€‘Level Insights**

KEY INSIGHTS(UP):



*   Aadhaar enrolment within Uttar Pradesh is unevenly distributed across districts
*   A small number of districts contribute a disproportionately high share of total enrolments
*   Several districts show very low enrolment volumes, indicating possible access or awareness gaps
*   Districtâ€‘level variation highlights the importance of targeted operational planning within large states



# **Why this matters:**



*   Aadhaar is critical for access to welfare, education, and healthcare
*   Low child enrolment can delay access to benefits like nutrition schemes
*   Regional disparities indicate where targeted interventions are required

*   Early Aadhaar coverage improves longâ€‘term service delivery efficiency







# **ðŸ”¹ Recommendations for UIDAI**



*   Launch targeted child enrolment drives in lowâ€‘share states
*   Integrate Aadhaar enrolment with birth registration systems
*   Use Anganwadi and school networks for early enrolment
*   Deploy mobile enrolment units in rural and underserved districts
*   Monitor child enrolment as a key KPI alongside total enrolment



# **Conclusion**

This analysis highlights significant regional and ageâ€‘wise disparities in Aadhaar enrolment. While overall enrolment is high in populous states, child enrolment remains uneven across regions. Targeted, earlyâ€‘age enrolment strategies can help UIDAI achieve more inclusive and futureâ€‘ready Aadhaar coverage.