# SAMS Student Data Analysis

This notebook processes and analyzes student application and enrollment data across multiple education modules (ITI, Diploma, HSS, DEG) to create unique student identifiers and generate summary statistics.

**Main Steps:**
1. Load data from SQLite database and Parquet files
2. Clean and standardize student records
3. Decrypt and validate roll numbers
4. Generate unique student keys for identity matching
5. Merge applications with enrollment data
6. Analyze Aadhaar coverage and generate summaries

## 1. Setup and Imports

Import core libraries and project helpers used across the notebook.

In [1]:
import sqlite3
import pandas as pd
import seaborn as sns
import json
import numpy as np
import os
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sams.utils import load_data
from sams.config import datasets
import gc
import polars as pl
import duckdb
from pathlib import Path

[32m2025-11-13 12:03:54.674[0m | [1mINFO    [0m | [36msams.config[0m:[36m<module>[0m:[36m15[0m - [1mPROJ_ROOT path is: C:\Users\Admin\Documents\GitHub\sams[0m
[32m2025-11-13 12:03:54.711[0m | [1mINFO    [0m | [36msams.config[0m:[36m<module>[0m:[36m92[0m - [1mLoaded 0 geocodes from cache[0m
[32m2025-11-13 12:03:54.711[0m | [1mINFO    [0m | [36msams.config[0m:[36m<module>[0m:[36m92[0m - [1mLoaded 0 geocodes from cache[0m


## 2. Data Loading

### 2.1 Connect to SQLite Database
Check connection and list available tables.

In [2]:
# Use the path from datasets metadata 
db_path = datasets["sams"]["path"]
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
print("Tables:", [t[0] for t in tables])

cursor.close()
conn.close()

Tables: ['students', 'institutes', 'results']


### 2.2 Extract ITI & Diploma Data from SQLite
Define a helper that explodes `mark_data` JSON using DuckDB and returns tidy rows.

## 2. Data Loading

### 2.1 Connect to SQLite Database

Verify database connection and check available tables.

HA=ad to use this method because for iti amrks and diploma amrks parquet file doesn't include the stdunet name and to expand them varefully, i made this approach since we have non reliable aadhr numbe rfor some year

In [3]:
db_path = datasets["sams"]["path"]

# Load ITI + Diploma
conn = sqlite3.connect(db_path)
query = """
SELECT academic_year, aadhar_no, student_name, dob,
       admission_status, mark_data, module
FROM students
WHERE module IN ('ITI', 'Diploma');
"""
students_df = pd.read_sql_query(query, conn)
conn.close()

# Extract needed keys (first entry only)
keep_keys = ["YearofPassing", "RollNo", "ExaminationType",
             "HighestQualificationExamBoard", "ExamName"]

def extract_mark(row):
    md = row["mark_data"]
    try:
        md = json.loads(md) if isinstance(md, str) else md
        if isinstance(md, dict): md = [md]
    except:
        md = []
    rec = md[0] if isinstance(md, list) and md else {}
    return {k: rec.get(k) for k in keep_keys}

mark_df = pd.json_normalize(students_df.apply(extract_mark, axis=1))

# Merge + rename
df = pd.concat([students_df.drop(columns=["mark_data"]),
                mark_df.rename(columns={
                    "YearofPassing": "passing_year",
                    "RollNo": "roll_no",
                    "ExaminationType": "exam_type",
                    "HighestQualificationExamBoard": "exam_board",
                    "ExamName": "exam_name"
                })],
               axis=1)

In [4]:
# ITI (all rows)
iti_enrollments = df[df["module"] == "ITI"].reset_index(drop=True)

# Diploma (ONLY exam_name = "10th")
diploma_enrollments = (
    df[(df["module"] == "Diploma") &
       (df["exam_name"].str.lower() == "10th")]
    .reset_index(drop=True)
)

In [5]:
# change the dob in iti_enrollments to YYYY-MM-DD format
iti_enrollments['dob'] = pd.to_datetime(iti_enrollments['dob'], format="%d-%b-%Y", errors="coerce").dt.strftime("%Y-%m-%d")
diploma_enrollments['dob'] = pd.to_datetime(diploma_enrollments['dob'], format="%d-%b-%Y", errors="coerce").dt.strftime("%Y-%m-%d")

In [None]:
iti_marks = load_data(datasets['iti_marks'])
diploma_marks = load_data(datasets['diploma_marks'])

### 2.3 Load Application Data from Parquet Files
Load interim Parquet files for ITI and Diploma applications.

In [6]:
# save df in parquet format
DATA_DIR = Path("C:/Users/Admin/Documents/GitHub/sams/data")
RAW_DATA_DIR = DATA_DIR / "interim"
iti_applications = pd.read_parquet(RAW_DATA_DIR / "iti_applications.pq")
diploma_applications = pd.read_parquet(RAW_DATA_DIR / "diploma_applications.pq")

### 2.4 Load Enrollment Data for All Modules
Load enrollment/application datasets used later for keys, merges, and summaries.

In [7]:
hss_enrollments = load_data(datasets["hss_enrollments"])
deg_enrollments = load_data(datasets["deg_enrollments"]) 

[32m2025-11-13 12:06:41.081[0m | [1mINFO    [0m | [36msams.utils[0m:[36mload_data[0m:[36m70[0m - [1mLoading data from C:\Users\Admin\Documents\GitHub\sams\data\interim\hss_enrollments.pq[0m
[32m2025-11-13 12:07:27.369[0m | [1mINFO    [0m | [36msams.utils[0m:[36mload_data[0m:[36m70[0m - [1mLoading data from C:\Users\Admin\Documents\GitHub\sams\data\interim\deg_enrollments.pq[0m
[32m2025-11-13 12:07:27.369[0m | [1mINFO    [0m | [36msams.utils[0m:[36mload_data[0m:[36m70[0m - [1mLoading data from C:\Users\Admin\Documents\GitHub\sams\data\interim\deg_enrollments.pq[0m


In [8]:
deg_applications = load_data(datasets["deg_applications"])
hss_applications = load_data(datasets["hss_applications"])

[32m2025-11-13 12:08:06.520[0m | [1mINFO    [0m | [36msams.utils[0m:[36mload_data[0m:[36m70[0m - [1mLoading data from C:\Users\Admin\Documents\GitHub\sams\data\interim\deg_applications.pq[0m
[32m2025-11-13 12:08:19.243[0m | [1mINFO    [0m | [36msams.utils[0m:[36mload_data[0m:[36m70[0m - [1mLoading data from C:\Users\Admin\Documents\GitHub\sams\data\interim\hss_applications.pq[0m
[32m2025-11-13 12:08:19.243[0m | [1mINFO    [0m | [36msams.utils[0m:[36mload_data[0m:[36m70[0m - [1mLoading data from C:\Users\Admin\Documents\GitHub\sams\data\interim\hss_applications.pq[0m


### 2.5 Extract CHSE and BSE Results Data

Extract board exam results for CHSE (Higher Secondary) and BSE (Secondary) from the results table.

In [9]:
db_path = datasets["sams"]["path"]
conn = sqlite3.connect(db_path)

query = """
SELECT
    academic_year,
    student_name,
    dob,
    module,
    CASE 
        WHEN module = 'CHSE' THEN 'CHSE, Odisha'
        WHEN module = 'BSE' THEN 'BSE, Odisha'
        ELSE NULL
    END AS exam_board,
    academic_year AS passing_year,    
    roll_no,
    NULL AS roll_no_decrypted,
    exam_type
    
FROM results
WHERE module IN ('CHSE', 'BSE');
"""

df = pd.read_sql_query(query, conn)
conn.close()

# Split into CHSE and BSE datasets
chse_df = df[df["module"] == "CHSE"].reset_index(drop=True)
bse_df  = df[df["module"] == "BSE"].reset_index(drop=True)

## 3. Data Cleaning and Standardization

### 3.1 Standardize Date Formats and Exam Types

Normalize date formats for BSE and standardize exam type terminology for CHSE.

In [10]:
bse_df["dob"] = pd.to_datetime(bse_df["dob"], format="%d-%b-%Y", errors="coerce").dt.strftime("%Y-%m-%d")
chse_df['exam_type'] = chse_df['exam_type'].replace({'REGULAR': 'annual'})

## 4. Roll Number Decryption and Validation

### 4.1 Define Decryption Function

Decrypt AES-encrypted roll numbers from the database.

In [11]:
# code for decryption
from base64 import b64decode
from Crypto.Cipher import AES

def decrypt_roll(enc_text: str,
                 key: bytes = b"y6idXfCVRG5t2dkeBnmHy9jLu6TEn5Du",
                 enforce_min_length: bool = False,
                 min_length: int = None) -> str:
    try:
        if not enc_text or not isinstance(enc_text, str):
            return "NA"

        raw = b64decode(enc_text)
        cipher = AES.new(key, AES.MODE_ECB)
        decrypted = cipher.decrypt(raw)

        pad_len = decrypted[-1]
        if pad_len < 1 or pad_len > 16:
            return "NA"
        decrypted = decrypted[:-pad_len]

        roll_no = decrypted.decode("utf-8").strip()
        return roll_no
    except Exception:
        return "NA"    

### 4.2 Decrypt and Validate Roll Numbers

Validate decrypted roll numbers based on board-specific length requirements:
- BSE Odisha: 9 characters
- CHSE Odisha: 8 characters

In [12]:
def process_roll_numbers_len_format(df: pd.DataFrame, roll_col: str = 'roll_no') -> pd.DataFrame:
    """
    Decrypt roll numbers and validate only by length rule:
    - BSE Odisha: length must be 9
    - CHSE Odisha: length must be 8
    - Other boards: keep decrypted roll as-is
    """

    # Decrypt roll numbers
    df['roll_no_decrypted'] = df[roll_col].map(decrypt_roll)

    # Identify Odisha boards 
    board_col = df['exam_board'].fillna("NA").str.upper()
    # Put the condition to pass these input values of board name        
    mask_bse = (board_col.str.contains(r'\bBOARD OF SECONDARY EDUCATION,\s*ODISHA\b', regex=True)  
                | (board_col.str.contains(r'\bBSE\b(?! MADHYAMA).*ODISHA\b', regex=True) & ~board_col.str.contains(r'\bICSE\b|\bCBSE\b', regex=True)))
    
    mask_chse = (board_col.str.contains(r'\bCOUNCIL OF HIGHER SECONDARY EDUCATION,\s*ODISHA\b', regex=True) 
                 | board_col.str.contains(r'\bCHSE\b.*ODISHA\b', regex=True))

    # Apply validation
    if mask_bse.any():
        rolls_bse = df.loc[mask_bse & df['roll_no_decrypted'].notna(), 'roll_no_decrypted'].astype(str)
        valid_bse = rolls_bse.str.len() == 9
        df.loc[mask_bse & ~valid_bse, 'roll_no_decrypted'] = 'NA'

    if mask_chse.any():
        rolls_chse = df.loc[mask_chse & df['roll_no_decrypted'].notna(), 'roll_no_decrypted'].astype(str)
        valid_chse = rolls_chse.str.len() == 8
        df.loc[mask_chse & ~valid_chse, 'roll_no_decrypted'] = 'NA'

    return df

### 3.2 Prepare HSS and DEG Data

Select relevant columns and rename to match standard schema.

In [13]:
# Columns to keep, then drop everything else and rename finally
keep_cols = [
    'barcode', 'aadhar_no', 'academic_year', 'module', 'student_name',
    'dob', 'examination_board_of_the_highest_qualification','examination_type', 'year_of_passing' , 'roll_no'
]
hss_enrollments = hss_enrollments[keep_cols].copy()
deg_enrollments = deg_enrollments[keep_cols].copy()
rename_map = {
    'examination_board_of_the_highest_qualification': 'exam_board',
    'examination_type': 'exam_type',
    'year_of_passing': 'passing_year'
}
hss_enroll = hss_enrollments.rename(columns=rename_map)
deg_enroll = deg_enrollments.rename(columns=rename_map)

### 4.3 Apply Decryption to All Datasets

Process roll numbers for all modules (ITI, Diploma, HSS, DEG, BSE, CHSE).

In [14]:
iti_df = process_roll_numbers_len_format(iti_enrollments)
diploma_df = process_roll_numbers_len_format(diploma_enrollments)
hss_df = process_roll_numbers_len_format(hss_enroll)
deg_df = process_roll_numbers_len_format(deg_enroll)
# bse_df = process_roll_numbers_len_format(bse_df)
# chse_df = process_roll_numbers_len_format(chse_df)

## 5. Generate Student Keys

### 5.1 Define Student Key Generation Functions

Create unique student identifiers using name, roll number, DOB, passing year, exam board, and exam type.

In [15]:
def encode_part(s: pd.Series, *, na_label="NA", missing_label="MISSING", lower=False) -> pd.Series:
    """
    Encode parts of a student key by handling missing/NA values consistently.
    """
    is_nan = s.isna()
    t = s.astype(str).str.strip()
    t = t.str.strip('"').str.strip("'")   # remove quotes if present

    out = t.copy()

    # Replace explicit NA and missing values
    out = out.mask(t.eq("NA"), na_label)
    out = out.mask(t.eq("") | is_nan, missing_label)

    # Normalize casing if requested
    if lower:
        out = out.where(out.isin([na_label, missing_label]), out.str.lower().str.strip())

    return out

In [16]:
def generate_student_key_df(df, module_name: str) -> pd.DataFrame:
    """
    Clean key columns in-place, then generate a student_key
    and print diagnostics about duplicates.
    """
    new_df = df.copy()

    key_vars = ["passing_year", "dob",
                "roll_no_decrypted", "exam_board", "exam_type"]

    # Normalize and ensure all key parts are strings
    for col in key_vars + ["student_name"]:
        new_df[col] = new_df[col].astype(str).fillna("").str.strip().str.lower()

    # Construct student key safely
    new_df["student_key"] = (
        new_df["student_name"] + "_" +
        new_df["roll_no_decrypted"] + "_" +
        new_df["dob"] + "_" +
        new_df["passing_year"] + "_" +
        new_df["exam_board"] + "_" +
        new_df["exam_type"]
    )

    # Diagnostics
    total_records = len(new_df)
    unique_aadhar = new_df["aadhar_no"].nunique(dropna=True)
    unique_keys = new_df["student_key"].nunique()

    # Problematic duplicates = same key linked to multiple Aadhaar numbers
    dup_check = (
        new_df.groupby("student_key")["aadhar_no"]
        .nunique(dropna=True)
        .reset_index(name="unique_aadhar_count")
    )
    problematic_keys = dup_check[dup_check["unique_aadhar_count"] > 1]["student_key"]
    duplicate_keys_count = len(problematic_keys)

    print(f"\n[{module_name}]")
    print("Total student records:", total_records)
    print("Unique student keys generated:", unique_keys)
    print("Duplicate student keys:", duplicate_keys_count)
    print("Unique Aadhar numbers:", unique_aadhar)

    return new_df

In [27]:
def generate_student_key_four_var(df: pd.DataFrame, module_name: str) -> pd.DataFrame:
    """
    Generate a standardized 4-variable student key (`student_key_4_var`) 
    for identity matching across datasets.

    The key is built using module-specific rules:
    - CHSE (Higher Secondary): roll_no_decrypted + passing_year + exam_board + exam_type
    - BSE  (Secondary):        roll_no_decrypted + dob + passing_year + exam_board
    - DEG  (Degree):           roll_no_decrypted + passing_year + exam_board + exam_type
    - HSS  (Higher Secondary): roll_no_decrypted + dob + passing_year + exam_board

    All fields are normalized (lowercase, stripped) before concatenation.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing student records.
    module_name : str
        Module name ("CHSE", "BSE", "DEG", or "HSS")

    Returns
    -------
    pd.DataFrame
        DataFrame with a new column `student_key_4_var`
    """

    new_df = df.copy()
    module = module_name.upper()

    # Select key components based on module
    if module in ["CHSE", "DEG"]:
        key_parts = ["roll_no_decrypted", "passing_year", "exam_board", "exam_type"]
    elif module in ["BSE", "HSS"]:
        key_parts = ["roll_no_decrypted", "dob", "passing_year", "exam_board"]
    else:
        raise ValueError(f"Invalid module '{module_name}'. Use 'CHSE', 'BSE', 'DEG', or 'HSS'.")

    # Normalize fields
    for col in key_parts:
        new_df[col] = (
            new_df[col].astype(str).fillna("").str.strip().str.lower()
            if col in new_df.columns else ""
        )

    # Create composite key
    new_df["student_key_4_var"] = new_df[key_parts].agg("_".join, axis=1)

    # Summary
    print(f"\n[{module}] Student Key (4-var) Summary")
    print("Total records:", len(new_df))
    print("Unique keys:", new_df["student_key_4_var"].nunique())

    return new_df

### 5.2 Generate Keys for All Modules

Create student keys for ITI, Diploma, HSS, and DEG with diagnostic output.

In [17]:
iti_key_df = generate_student_key_df(iti_df, "ITI")
diploma_key_df = generate_student_key_df(diploma_df, "Diploma")
# hss_key_df = generate_student_key_df(hss_df, "HSS")
deg_key_df = generate_student_key_df(deg_df, "DEG")


[ITI]
Total student records: 559575
Unique student keys generated: 524796
Duplicate student keys: 1807
Unique Aadhar numbers: 518024

[Diploma]
Total student records: 445850
Unique student keys generated: 414078
Duplicate student keys: 4285
Unique Aadhar numbers: 392904

[Diploma]
Total student records: 445850
Unique student keys generated: 414078
Duplicate student keys: 4285
Unique Aadhar numbers: 392904

[DEG]
Total student records: 2054491
Unique student keys generated: 1634565
Duplicate student keys: 1917
Unique Aadhar numbers: 1506304

[DEG]
Total student records: 2054491
Unique student keys generated: 1634565
Duplicate student keys: 1917
Unique Aadhar numbers: 1506304


In [18]:
hss_key_df = generate_student_key_df(hss_df, "HSS")


[HSS]
Total student records: 3453401
Unique student keys generated: 2961052
Duplicate student keys: 5011
Unique Aadhar numbers: 2650769


## 6. Merge Applications with Enrollment Data

Link application records with student identity information from enrollment data using barcode.

In [29]:
def merge_enrollment_applications(enroll_df, app_df):
    """
    Merge application rows with student identity columns using barcode.
    Keeps all application rows and adds student info from enrollment table.
    """

    # Columns we want from enrollment (as you listed)
    enroll_cols = [
        "barcode", "student_name", "aadhar_no", "dob", "module",
        "academic_year", "exam_board", "exam_type", "passing_year",
        "roll_no", "roll_no_decrypted", "student_key"
    ]
    
    enroll_reduced = enroll_df[enroll_cols].copy()

    # Only barcode from applications
    app_reduced = app_df[["barcode"]].copy()

    # Merge keeping all application rows
    merged = app_reduced.merge(enroll_reduced, on="barcode", how="left")

    return merged

In [30]:
hss_df2 = merge_enrollment_applications(hss_key_df, hss_applications)
deg_df2 = merge_enrollment_applications(deg_key_df, deg_applications)

## 7. Summary Statistics and Analysis

### 7.1 Repeated Aadhaar Analysis

Check for Aadhaar numbers repeated within a year (above a chosen threshold, e.g., > 10) to flag potential data quality issues.

In [49]:
# Simple repeat Aadhaar check by academic year

def repeated_aadhaar_summary(df, module_name):
    print(f"\n--- {module_name.upper()}: Repeated Aadhaar by Year ---")
    repeat_summary = (
        df.groupby(['academic_year', 'aadhar_no'])
          .size()
          .reset_index(name='count')
          .query('count > 5')  # Only keep Aadhaar appearing more than five times
          .sort_values(['academic_year', 'count'], ascending=[True, False])
    )
    print(repeat_summary.to_string(index=False))

# Run for ITI and Diploma
repeated_aadhaar_summary(iti_enrollments, "ITI ENROLLMENT")


--- ITI ENROLLMENT: Repeated Aadhaar by Year ---
 academic_year                                    aadhar_no  count
          2017 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=   3181
          2018 UIC27nlODwzAwV13RAZD1vk8kiSxo2GLRDviArS4Ktg=     10
 academic_year                                    aadhar_no  count
          2017 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=   3181
          2018 UIC27nlODwzAwV13RAZD1vk8kiSxo2GLRDviArS4Ktg=     10


In [50]:
repeated_aadhaar_summary(diploma_enrollments, "Diploma ENROLLMENT")


--- DIPLOMA ENROLLMENT: Repeated Aadhaar by Year ---
 academic_year                                    aadhar_no  count
          2018 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=  10118
          2019 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=  13183
 academic_year                                    aadhar_no  count
          2018 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=  10118
          2019 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=  13183


In [60]:
# Missing Aadhaar by academic_year from enrollments
# Treat repeated non-null Aadhaar values (hashed/placeholder duplicates) as missing as well

def missing_from_enrollments(df: pd.DataFrame, prefix: str) -> pd.DataFrame:
    if 'aadhar_no' not in df.columns or 'academic_year' not in df.columns:
        # Return empty/zero table if required columns are missing
        return pd.DataFrame({'academic_year': [], f'{prefix}_missing_aadhar': []})

    g = df[['academic_year', 'aadhar_no']].copy()

    # Base missing = explicit nulls
    missing_null = (
        g.groupby('academic_year', dropna=False)['aadhar_no']
         .apply(lambda s: s.isna().sum())
         .rename('missing_null')
    )

    # Hash/placeholder duplicates: non-null Aadhaar values that repeat > 1 in the year
    non_null = g[g['aadhar_no'].notna()].copy()
    if not non_null.empty:
        dup_counts = (
            non_null.groupby(['academic_year', 'aadhar_no'])
                    .size()
                    .reset_index(name='cnt')
        )
        hashed_rows_by_year = (
            dup_counts[dup_counts['cnt'] > 1]
            .groupby('academic_year')['cnt']
            .sum()
            .rename('hashed_rows')
        )
    else:
        hashed_rows_by_year = pd.Series(dtype='int64', name='hashed_rows')

    # Combine null-missing and hashed duplicate rows as total missing
    out = missing_null.to_frame().join(hashed_rows_by_year, how='left').fillna(0)
    out['missing_aadhar'] = out['missing_null'] + out['hashed_rows']

    res = (
        out[['missing_aadhar']]
        .reset_index()
        .rename(columns={'missing_aadhar': f'{prefix}_missing_aadhar'})
    )
    return res

iti_missing = missing_from_enrollments(iti_key_df, 'iti')
dip_missing = missing_from_enrollments(diploma_key_df, 'diploma')

print('ITI missing (null + repeated hashes):')
print(iti_missing.head(7).to_string(index=False))
print('\nDiploma missing (null + repeated hashes):')
print(dip_missing.head(7).to_string(index=False))

ITI missing (enrollments, null + repeated hashes):
 academic_year  iti_missing_aadhar
          2017             10949.0
          2018              8840.0
          2019                 0.0
          2020                 4.0
          2021                 0.0
          2022                 0.0
          2023                 0.0

Diploma missing (enrollments, null + repeated hashes):
 academic_year  diploma_missing_aadhar
          2018                 12079.0
          2019                 13329.0
          2020                   107.0
          2021                     0.0
          2022                     0.0
          2023                     0.0
          2024                     0.0


### 7.3 Aadhaar Coverage Summary

Calculate, by academic year:
- Total applications
- Unique Aadhaar count
- Non-unique (duplicate) Aadhaar IDs

In [64]:
# 7.x Unreliable Aadhaar detector (per-year)
# Treats nulls as missing and flags heavy repeated placeholders per year.
# Also allows explicit known placeholder hashes.

KNOWN_AADHAAR_PLACEHOLDERS = {
    # SHA-1 of empty string (Base64) often used as placeholder
    '47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=',
}


def build_unreliable_aadhaar_index(df: pd.DataFrame,
                                    per_year_threshold: int = 100,
                                    known_placeholders: set[str] | None = None) -> pd.DataFrame:
    """
    Return a DataFrame of (academic_year, aadhar_no, unreliable=True) pairs.
    - Unreliable if value appears >= per_year_threshold within that year
    - Unreliable if value is in known_placeholders (for any year it appears)
    """
    known_placeholders = known_placeholders or set()

    # Per-year counts (ignore nulls)
    counts = (
        df.dropna(subset=['aadhar_no'])
          .groupby(['academic_year', 'aadhar_no'])
          .size().reset_index(name='cnt')
    )

    # Heavy repeats within the year
    heavy = counts[counts['cnt'] >= per_year_threshold][['academic_year', 'aadhar_no']]

    # Known placeholders wherever they appear
    known = counts[counts['aadhar_no'].isin(known_placeholders)][['academic_year', 'aadhar_no']]

    out = pd.concat([heavy, known], axis=0).drop_duplicates()
    if out.empty:
        out = pd.DataFrame(columns=['academic_year', 'aadhar_no'])
    out['unreliable'] = True
    return out

In [65]:
# 7.x Clean Aadhaar summary (per-year) using unreliable index
# Computes per-year metrics after excluding nulls and unreliable (hashed/heavy-repeat) values.


def summarize_clean_aadhaar(df: pd.DataFrame,
                            prefix: str,
                            unreliable_idx: pd.DataFrame) -> pd.DataFrame:
    """
    Returns per-year metrics:
    - {prefix}_unique_aadhar: distinct Aadhaar among reliable rows
    - {prefix}_supposed_missing: rows with null or unreliable Aadhaar
    - {prefix}_non_unique_aadhar: count of Aadhaar IDs still appearing >1 among reliable rows
    """
    g = df[['academic_year', 'aadhar_no']].copy()

    # Join unreliable pairs on (year, aadhar)
    if unreliable_idx is not None and not unreliable_idx.empty:
        g = g.merge(unreliable_idx, on=['academic_year', 'aadhar_no'], how='left')
        g['unreliable'] = g['unreliable'].fillna(False)
    else:
        g['unreliable'] = False

    g['is_null'] = g['aadhar_no'].isna()
    g['supposed_missing'] = g['is_null'] | g['unreliable']

    # Reliable subset
    reliable = g[~g['supposed_missing']].dropna(subset=['aadhar_no'])

    # Unique Aadhaar among reliable
    unique_ids = (
        reliable.groupby('academic_year')['aadhar_no']
                .nunique()
                .rename(f'{prefix}_unique_aadhar')
    )

    # Count of Aadhaar IDs still appearing >1 among reliable rows
    if not reliable.empty:
        dup_ids = (
            reliable.groupby(['academic_year', 'aadhar_no']).size()
                    .reset_index(name='cnt')
                    .query('cnt > 1')
                    .groupby('academic_year').size()
                    .rename(f'{prefix}_non_unique_aadhar')
        )
    else:
        dup_ids = pd.Series(dtype='int64', name=f'{prefix}_non_unique_aadhar')

    # Supposed-missing row counts
    missing_counts = (
        g.groupby('academic_year')['supposed_missing']
         .sum()
         .rename(f'{prefix}_supposed_missing')
    )

    out = pd.concat([unique_ids, missing_counts, dup_ids], axis=1).reset_index().fillna(0)
    for c in out.columns:
        if c != 'academic_year':
            out[c] = out[c].astype(int)
    return out


# Build per-year unreliable indices (tune threshold if needed)
iti_unrel = build_unreliable_aadhaar_index(
    iti_key_df,
    per_year_threshold=100,
    known_placeholders=KNOWN_AADHAAR_PLACEHOLDERS,
)

diploma_unrel = build_unreliable_aadhaar_index(
    diploma_key_df,
    per_year_threshold=100,
    known_placeholders=KNOWN_AADHAAR_PLACEHOLDERS,
)

# Compute clean summaries and merge
iti_clean = summarize_clean_aadhaar(iti_key_df, 'iti', iti_unrel)
diploma_clean = summarize_clean_aadhaar(diploma_key_df, 'diploma', diploma_unrel)

final_clean_summary = (
    iti_clean.merge(diploma_clean, on='academic_year', how='outer')
             .rename(columns={'academic_year': 'year'})
             [[
                 'year',
                 'iti_unique_aadhar', 'iti_supposed_missing', 'iti_non_unique_aadhar',
                 'diploma_unique_aadhar', 'diploma_supposed_missing', 'diploma_non_unique_aadhar',
             ]]
             .sort_values('year')
             .reset_index(drop=True)
)

final_clean_summary.head(20)

  g['unreliable'] = g['unreliable'].fillna(False)
  g['unreliable'] = g['unreliable'].fillna(False)
  g['unreliable'] = g['unreliable'].fillna(False)


Unnamed: 0,year,iti_unique_aadhar,iti_supposed_missing,iti_non_unique_aadhar,diploma_unique_aadhar,diploma_supposed_missing,diploma_non_unique_aadhar
0,2017,27336,3181,3631,,,
1,2018,70612,1,4242,49933.0,10118.0,975.0
2,2019,64148,0,0,41589.0,13183.0,73.0
3,2020,67411,4,0,52958.0,5.0,51.0
4,2021,68000,0,0,54979.0,0.0,0.0
5,2022,74104,0,0,71768.0,0.0,0.0
6,2023,92085,0,0,73513.0,0.0,0.0
7,2024,83958,0,0,76694.0,0.0,0.0


### 7.6 Final Combined Summary

Merged, per academic year, with columns:
- Year
- ITI and Diploma: applications, unique Aadhaar, 1-by-1 matches

1) Applications per year from the application datasets.
2) Missing Aadhaar counts from enrollments.
3) 1-by-1 Aadhaar ↔ student_key matches.
4) Merge everything by academic_year.


In [38]:
# Yearly summary for ITI & Diploma with application counts
def create_yearly_summary(enrollment_df, application_df):
    """
    Generate yearly summary combining enrollment and application data.
    
    Parameters:
    - enrollment_df: DataFrame with student_key and aadhar_no (from key generation)
    - application_df: DataFrame with barcode for counting total applications
    """
    
    # Count total applications by year
    app_counts = (
        application_df.groupby('academic_year')
        .size()
        .reset_index(name='total_applications')
    )
    
    # Calculate Aadhaar and 1-by-1 match metrics from enrollment data
    def _metrics(g):
        mapping = g[['aadhar_no', 'student_key']].drop_duplicates()

        # Find unique Aadhaar and unique keys within the year
        unique_aadhar = mapping['aadhar_no'].value_counts() == 1
        unique_key = mapping['student_key'].value_counts() == 1

        # Count 1-by-1 matches (unique Aadhaar to unique key)
        one_to_one = mapping[
            mapping['aadhar_no'].isin(unique_aadhar[unique_aadhar].index) &
            mapping['student_key'].isin(unique_key[unique_key].index)
        ]

        return pd.Series({
            'aadhar': g['aadhar_no'].nunique(),
            '1by1_match': len(one_to_one),
        })

    enroll_summary = (
        enrollment_df.groupby('academic_year', group_keys=False)
          .apply(_metrics)
          .reset_index()
    )
    
    # Merge application counts with enrollment metrics
    summary = app_counts.merge(enroll_summary, on='academic_year', how='outer')
    
    return summary

In [39]:
# Create summaries for ITI and Diploma
iti_summary = create_yearly_summary(iti_key_df, iti_applications)
diploma_summary = create_yearly_summary(diploma_key_df, diploma_applications)

final_summary = (
    iti_summary.merge(
        diploma_summary, on='academic_year',how='outer', suffixes=('_iti', '_diploma'))
    )

# Rename columns for clarity
final_summary = final_summary.rename(columns={
    'academic_year': 'year',
    'total_applications_iti': 'iti_applications',
    'aadhar_iti': 'iti_aadhar',
    '1by1_match_iti': 'iti_1by1_match',
    'total_applications_diploma': 'diploma_applications',
    'aadhar_diploma': 'diploma_aadhar',
    '1by1_match_diploma': 'diploma_1by1_match',
})

# Final column ordering
final_summary = final_summary[
    [
        'year',
        'iti_applications', 'iti_aadhar', 'iti_1by1_match',
        'diploma_applications', 'diploma_aadhar', 'diploma_1by1_match', 
    ]
]

# Print total applications across all years
print(f"Total ITI Applications: {final_summary['iti_applications'].sum():,}")
print(f"Total Diploma Applications: {final_summary['diploma_applications'].sum():,}\n")

final_summary

  .apply(_metrics)


Total ITI Applications: 2,218,985
Total Diploma Applications: 1,552,260.0



  .apply(_metrics)


Unnamed: 0,year,iti_applications,iti_aadhar,iti_1by1_match,diploma_applications,diploma_aadhar,diploma_1by1_match
0,2017,145612,27337,26065,,,
1,2018,215662,70613,69617,203293.0,49934.0,48543.0
2,2019,261505,64148,64119,186285.0,41590.0,41484.0
3,2020,208581,67412,67407,181710.0,52959.0,52868.0
4,2021,300121,68000,67974,175738.0,54979.0,54949.0
5,2022,303082,74104,74077,244165.0,71768.0,71747.0
6,2023,388380,92085,92069,270782.0,73513.0,73503.0
7,2024,396042,83958,83958,290287.0,76694.0,76682.0


#### Step 4 — Merge all metrics (by academic_year)

Combine ITI and Diploma into one table with:
- ITI: applications, unique Aadhaar, 1-by-1 matches
- Diploma: applications, unique Aadhaar, 1-by-1 matches

In [76]:
# Final ITI Summary Table
def compute_1by1(df):
    p = df[['academic_year', 'aadhar_no', 'student_key']].dropna().drop_duplicates()
    a = p.groupby(['academic_year', 'aadhar_no']).size()
    k = p.groupby(['academic_year', 'student_key']).size()
    m = p.merge(a.rename('a_cnt'), on=['academic_year', 'aadhar_no']).merge(k.rename('k_cnt'), on=['academic_year', 'student_key'])
    return m[(m['a_cnt'] == 1) & (m['k_cnt'] == 1)].groupby('academic_year').size().reset_index(name='1by1_match')

iti_final = (
    iti_applications.groupby('academic_year').size().reset_index(name='applications')
    .merge(iti_clean, on='academic_year', how='outer')
    .merge(compute_1by1(iti_key_df), on='academic_year', how='outer')
    .rename(columns={'academic_year': 'year', 'iti_unique_aadhar': 'unique_aadhar', 
                     'iti_supposed_missing': 'supposed_missing', 'iti_non_unique_aadhar': 'non_unique_aadhar'})
    [['year', 'applications', 'unique_aadhar', 'supposed_missing', 'non_unique_aadhar', '1by1_match']]
    .sort_values('year').fillna(0).reset_index(drop=True)
    .astype({c: int for c in ['applications', 'unique_aadhar', 'supposed_missing', 'non_unique_aadhar', '1by1_match']})
)

print("ITI Summary Table:")
print(iti_final.to_string(index=False))
print(f"\nTotals: {iti_final[['applications', 'unique_aadhar', 'supposed_missing', 'non_unique_aadhar', '1by1_match']].sum().to_dict()}")


ITI Summary Table:
 year  applications  unique_aadhar  supposed_missing  non_unique_aadhar  1by1_match
 2017        145612          27336              3181               3631       26065
 2018        215662          70612                 1               4242       69617
 2019        261505          64148                 0                  0       64119
 2020        208581          67411                 4                  0       67407
 2021        300121          68000                 0                  0       67974
 2022        303082          74104                 0                  0       74077
 2023        388380          92085                 0                  0       92069
 2024        396042          83958                 0                  0       83958

Totals: {'applications': 2218985, 'unique_aadhar': 547654, 'supposed_missing': 3186, 'non_unique_aadhar': 7873, '1by1_match': 545286}


In [None]:
# Final Diploma Summary Table
diploma_final = (
    diploma_applications.groupby('academic_year').size().reset_index(name='applications')
    .merge(diploma_clean, on='academic_year', how='outer')
    .merge(compute_1by1(diploma_key_df), on='academic_year', how='outer')
    .rename(columns={'academic_year': 'year', 'diploma_unique_aadhar': 'unique_aadhar',
                     'diploma_supposed_missing': 'supposed_missing', 'diploma_non_unique_aadhar': 'non_unique_aadhar'})
    [['year', 'applications', 'unique_aadhar', 'supposed_missing', 'non_unique_aadhar', '1by1_match']]
    .sort_values('year').fillna(0).reset_index(drop=True)
    .astype({c: int for c in ['applications', 'unique_aadhar', 'supposed_missing', 'non_unique_aadhar', '1by1_match']})
)

print("Diploma Final Summary Table:")
print(diploma_final.to_string(index=False))