# Student Data Creation and Initial Preprocessing

**Objective:** This notebook outlines the process for generating a large-scale synthetic dataset simulating student attributes relevant to predicting dropout in Indian government schools. The aim is to create a foundational dataset for subsequent exploratory data analysis (EDA), feature engineering, and model development.

**Process Overview:**
1.  Define the schema and characteristics for student data, drawing inspiration from common educational datasets (e.g., UDISE+) and domain knowledge.
2.  Programmatically generate synthetic records (e.g., 3 million students).
3.  Introduce controlled variations and relationships between features to mimic real-world scenarios.
4.  Perform very basic data cleaning (e.g., type conversions, renaming columns for consistency).
5.  Save the generated raw dataset (`synthetic_student_data_combined.csv`) for more detailed exploration and preprocessing.

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time 
import glob
import math

In [None]:
df_enroll = pd.read_csv('/Users/macbookpro/Desktop/POC1/data/raw/100_enr1.csv')
df_enroll.head() 

Unnamed: 0,pseudocode,item_group,item_id,cpp_b,cpp_g,c1_b,c1_g,c2_b,c2_g,c3_b,...,c8_b,c8_g,c9_b,c9_g,c10_b,c10_g,c11_b,c11_g,c12_b,c12_g
0,1000002,1,2,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1000002,1,3,0,0,7,7,6,3,8,...,0,0,0,0,0,0,0,0,0,0
2,1000002,1,4,0,0,3,2,1,3,1,...,0,0,0,0,0,0,0,0,0,0
3,1000002,3,13,0,0,10,9,7,5,10,...,0,0,0,0,0,0,0,0,0,0
4,1000019,1,2,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
df_fac = pd.read_csv('/Users/macbookpro/Desktop/POC1/data/raw/100_fac.csv')
df_fac.head()

Unnamed: 0,pseudocode,building_status,no_building_blocks,pucca_building_blocks,boundary_wall,total_class_rooms,other_rooms,classrooms_in_good_condition,classrooms_needs_minor_repair,classrooms_needs_major_repair,...,desktop,digiboard,teachdev_tot,server_tot,smart_class_tv_tot,projector,printer,internet,dth,spl_educator_yn
0,1341742,1,15,15,3,13,7,13,0,0,...,5,0,1,0,1,1,1,1,2,1
1,4508465,1,1,1,1,3,6,3,0,0,...,1,0,0,0,0,0,0,1,2,3
2,1884303,3,1,1,1,2,1,0,2,0,...,0,0,0,0,0,0,0,2,2,3
3,2601515,3,4,2,5,5,2,5,0,0,...,0,0,0,0,0,0,0,1,2,3
4,7173432,3,2,2,1,4,2,0,4,0,...,0,0,0,0,0,0,0,2,2,2


In [4]:

df_tch = pd.read_csv('/Users/macbookpro/Desktop/POC1/data/raw/100_tch.csv')
df_tch.head()



Unnamed: 0,pseudocode,total_tch,male,female,transgender,gen_tch,sc_tch,st_tch,obc_tch,regular,...,bed_equivalent,med_equivalent,other,none,diploma_special_edu,pursuing_rpc,diploma_ele_edu,early_childhood_tch,bed_nursery,trained_cwsn
0,1000002,3,2,1,0,0,0,1,2,3,...,2,0,0,0,0,0,0,0,0,0
1,1000019,2,0,2,0,0,0,0,2,2,...,2,0,0,0,0,0,0,0,0,1
2,1000021,1,0,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,1,0,1
3,1000028,3,0,3,0,2,0,0,1,3,...,0,0,0,0,0,0,3,0,0,0
4,1000029,15,3,12,0,1,5,1,8,14,...,4,3,1,0,0,1,0,0,0,0


In [5]:
df_tch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1471891 entries, 0 to 1471890
Data columns (total 38 columns):
 #   Column                                   Non-Null Count    Dtype
---  ------                                   --------------    -----
 0   pseudocode                               1471891 non-null  int64
 1   total_tch                                1471891 non-null  int64
 2   male                                     1471891 non-null  int64
 3   female                                   1471891 non-null  int64
 4   transgender                              1471891 non-null  int64
 5   gen_tch                                  1471891 non-null  int64
 6   sc_tch                                   1471891 non-null  int64
 7   st_tch                                   1471891 non-null  int64
 8   obc_tch                                  1471891 non-null  int64
 9   regular                                  1471891 non-null  int64
 10  contract                                 1

In [6]:
df_enroll.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8234734 entries, 0 to 8234733
Data columns (total 29 columns):
 #   Column      Dtype
---  ------      -----
 0   pseudocode  int64
 1   item_group  int64
 2   item_id     int64
 3   cpp_b       int64
 4   cpp_g       int64
 5   c1_b        int64
 6   c1_g        int64
 7   c2_b        int64
 8   c2_g        int64
 9   c3_b        int64
 10  c3_g        int64
 11  c4_b        int64
 12  c4_g        int64
 13  c5_b        int64
 14  c5_g        int64
 15  c6_b        int64
 16  c6_g        int64
 17  c7_b        int64
 18  c7_g        int64
 19  c8_b        int64
 20  c8_g        int64
 21  c9_b        int64
 22  c9_g        int64
 23  c10_b       int64
 24  c10_g       int64
 25  c11_b       int64
 26  c11_g       int64
 27  c12_b       int64
 28  c12_g       int64
dtypes: int64(29)
memory usage: 1.8 GB


In [7]:
df_fac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1471891 entries, 0 to 1471890
Data columns (total 70 columns):
 #   Column                         Non-Null Count    Dtype
---  ------                         --------------    -----
 0   pseudocode                     1471891 non-null  int64
 1   building_status                1471891 non-null  int64
 2   no_building_blocks             1471891 non-null  int64
 3   pucca_building_blocks          1471891 non-null  int64
 4   boundary_wall                  1471891 non-null  int64
 5   total_class_rooms              1471891 non-null  int64
 6   other_rooms                    1471891 non-null  int64
 7   classrooms_in_good_condition   1471891 non-null  int64
 8   classrooms_needs_minor_repair  1471891 non-null  int64
 9   classrooms_needs_major_repair  1471891 non-null  int64
 10  separate_room_for_hm           1471891 non-null  int64
 11  total_boys_toilet              1471891 non-null  int64
 12  total_boys_func_toilet         1471891 non

In [8]:
df_profile = pd.read_csv('/Users/macbookpro/Desktop/POC1/data/raw/100_prof1.csv')
df_profile.head()

Unnamed: 0,pseudocode,state,district,block,rural_urban,lgd_urban_local_body_name,lgd_ward_name,lgd_vill_name,lgd_vill_panchayat_name,lgd_block_name,...,pre_primary,anganwadi_yn,avg_instr_days,cce_yn,same_sch_b,same_sch_g,other_sch_b,other_sch_g,anganwadi_ecce_b,anganwadi_ecce_g
0,6313415,HARYANA,PANIPAT,PANIPAT,2,Panipat-Municipal Corporations,Panipat (M Cl ) - Ward No.25,,,,...,1,9,170,1,15,10,2,0,0,0
1,9511206,UTTAR PRADESH,KHERI,LAKHIMPUR,1,,,Tusaura,TUSAURA,LAKHIMPUR,...,2,2,246,1,0,0,0,0,4,11
2,8316226,BIHAR,PURNIA,JALALGARH,1,,,Dhanganwan Kankhudia,DANSAR,JALALGARH,...,2,2,220,1,0,0,0,0,9,11
3,1793728,KERALA,THRISSUR,PAZHAYANNUR,1,,,Kaniyarkode (CT),THIRUVILWAMALA,PAZHAYANNUR,...,1,2,197,1,5,6,0,0,0,0
4,2925597,CHHATTISGARH,SURGUJA,UDAIPUR,1,,,Khamhariya,KHAMHARIYA,UDAIPUR,...,2,9,235,1,4,5,1,0,0,0


In [9]:
df_profile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1471891 entries, 0 to 1471890
Data columns (total 38 columns):
 #   Column                     Non-Null Count    Dtype 
---  ------                     --------------    ----- 
 0   pseudocode                 1471891 non-null  int64 
 1   state                      1471891 non-null  object
 2   district                   1471891 non-null  object
 3   block                      1471891 non-null  object
 4   rural_urban                1471891 non-null  int64 
 5   lgd_urban_local_body_name  255812 non-null   object
 6   lgd_ward_name              255690 non-null   object
 7   lgd_vill_name              1193885 non-null  object
 8   lgd_vill_panchayat_name    1190305 non-null  object
 9   lgd_block_name             1185878 non-null  object
 10  school_category            1471891 non-null  int64 
 11  school_type                1471891 non-null  int64 
 12  lowclass                   1471891 non-null  int64 
 13  highclass                  

In [10]:
df_profile_2 = pd.read_csv('/Users/macbookpro/Desktop/POC1/data/raw/100_prof2.csv')


In [11]:
df_profile_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1471891 entries, 0 to 1471890
Data columns (total 17 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   pseudocode             1471891 non-null  int64  
 1   balavatika_located_yn  1471891 non-null  int64  
 2   special_training       1471891 non-null  int64  
 3   material_training      1471891 non-null  int64  
 4   free_text_books_pr     1471891 non-null  int64  
 5   free_uniform_pr        1471891 non-null  int64  
 6   free_text_books_up     1471891 non-null  int64  
 7   free_uniform_up        1471891 non-null  int64  
 8   acad_inspections       1471891 non-null  int64  
 9   crc_coordinator        1471891 non-null  int64  
 10  block_level_officers   1471891 non-null  int64  
 11  district_officers      1471891 non-null  int64  
 12  smc_exists             1471891 non-null  int64  
 13  smc_smdc_same          1471891 non-null  int64  
 14  smc_smdc_meetings 

In [12]:
df_profile['school_category' ].unique()

array([ 2,  1,  4,  8,  3,  6,  7,  5, 10, 12, 11])

In [13]:

# Create a base school dataset with essential information
school_data = df_profile[['pseudocode', 'state', 'district', 'block', 'rural_urban', 
                          'school_category', 'school_type', 'lowclass', 'highclass', 'special_school_for_cwsn', 'approachable_road',
                          'avg_instr_days']].copy()

school_agg = df_profile_2.groupby('pseudocode').agg({
    'free_text_books_pr': 'first',
    'free_uniform_pr': 'first',
    'free_text_books_up': 'first',
    'free_uniform_up': 'first',
}).reset_index()



school_data = pd.merge(school_data, school_agg, on='pseudocode', how='left')


# Add teacher information
teacher_agg = df_tch.groupby('pseudocode').agg({
    'total_tch': 'first',
    'male': 'first',
    'female': 'first',
    'transgender': 'first',
    'gen_tch': 'first',
    'sc_tch': 'first',
    'st_tch': 'first',
    'obc_tch': 'first',
    'regular': 'first'
}).reset_index()

school_data = pd.merge(school_data, teacher_agg, on='pseudocode', how='left')




In [14]:

# Add facility information
facility_agg = df_fac.groupby('pseudocode').agg({
    'total_class_rooms': 'first',
    'total_boys_func_toilet': 'first', 
    'total_girls_func_toilet': 'first',
    'electricity_availability': 'first',
    'library_availability': 'first',
    'internet': 'first',
    'medical_checkups': 'first',
    'spl_educator_yn': 'first'
}).reset_index()

school_data = pd.merge(school_data, facility_agg, on='pseudocode', how='left')







In [15]:


# Add enrollment information
# First, identify the columns for each grade and gender
boy_cols = [col for col in df_enroll.columns if col.endswith('_b') and col.startswith('c')]
girl_cols = [col for col in df_enroll.columns if col.endswith('_g') and col.startswith('c')]

# Calculate total enrollments
df_enroll['total_boys'] = df_enroll[boy_cols].sum(axis=1)
df_enroll['total_girls'] = df_enroll[girl_cols].sum(axis=1)
df_enroll['total_enrollment'] = df_enroll['total_boys'] + df_enroll['total_girls']

# For grades 1-12, create separate enrollment counts


for grade in range(1, 13):
    df_enroll[f'grade_{grade}_boys'] = df_enroll[f'c{grade}_b']
    df_enroll[f'grade_{grade}_girls'] = df_enroll[f'c{grade}_g']
    df_enroll[f'grade_{grade}_total'] = df_enroll[f'c{grade}_b'] + df_enroll[f'c{grade}_g']

# Aggregate enrollment data
enroll_agg = df_enroll.groupby('pseudocode').agg({
    'total_boys': 'sum',
    'total_girls': 'sum',
    'total_enrollment': 'sum',
    **{f'grade_{i}_total': 'sum' for i in range(1, 13)}
}).reset_index()

school_data = pd.merge(school_data, enroll_agg, on='pseudocode', how='left')

# Calculate derived metrics
school_data['student_teacher_ratio'] = school_data['total_enrollment'] / school_data['total_tch']
school_data['girl_ratio'] = school_data['total_girls'] / school_data['total_enrollment']
school_data['female_teacher_ratio'] = school_data['female'] / school_data['total_tch']

# Create infrastructure score
infra_cols = ['electricity_availability', 'library_availability', 'internet', 
              'approachable_road', 'total_boys_func_toilet', 'total_girls_func_toilet']
school_data['infrastructure_score'] = school_data[infra_cols].fillna(0).sum(axis=1) / len(infra_cols)


In [16]:
# 1=Yes, 2=No mapping

# Calculate 'free_text_books': True if free_text_books_pr IS 1 OR free_text_books_up IS 1, else False
# Then convert True/False to 1/0
school_data['free_text_books'] = (
    (school_data['free_text_books_pr'] == 1) | (school_data['free_text_books_up'] == 1)
).astype(int)

# Calculate 'free_uniform': True if free_uniform_pr IS 1 OR free_uniform_up IS 1, else False
# Then convert True/False to 1/0
school_data['free_uniforms'] = (
    (school_data['free_uniform_pr'] == 1) | (school_data['free_uniform_up'] == 1)
).astype(int)

In [17]:

# Percentage of rows to remove
percentage_to_remove = 0.5  # 50%

# Calculate the number of rows to remove
num_rows_to_remove = int(len(school_data) * percentage_to_remove)

# Generate random indices to drop
drop_indices = np.random.choice(school_data.index, num_rows_to_remove, replace=False)

# Drop the rows
school_data.drop(drop_indices,inplace = True)


print("DataFrame shape after dropping rows:", school_data.shape)

DataFrame shape after dropping rows: (735946, 54)


In [41]:
school_data['free_text_books'].value_counts()

free_text_books
1    502518
0    233428
Name: count, dtype: int64

In [42]:
#Save school-level data
school_data.to_csv('/Users/macbookpro/Desktop/POC1/data/processed/comprehensive_school_data.csv', index=False)

In [20]:
school_data['rural_urban'].value_counts()

rural_urban
1    606032
2    129914
Name: count, dtype: int64

## 🧬 Generating Synthetic Student Records

This section details the generation of the synthetic student dataset. We simulate various features categorized into demographics, academic performance, socio-economic factors, and school-related attributes. The `dropout` target variable is also generated, with its likelihood influenced by a combination of these factors.

**Key Features Being Generated (Initial Set):**
* **Demographics:** `student_id`, `age`, `gender`, `caste_category` (later `caste`), `father_education`, `family_income`.
* **Academic:** `grade`, `attendance_rate`, `grade_performance`.
* **School & Access:** `midday_meal` (later `midday_meal_access`), `free_uniforms` (later `free_uniform_access`), `free_textbooks` (later `free_text_books_access`), `internet_access` (later `internet_access_home`), `distance_to_school`.
* **Target:** `dropout`.

* Dropout assignment incorporates rules based on simulated low attendance, poor grades, and adverse socio-economic conditions.

In [27]:

# --- Configuration Parameters ---
NUM_SCHOOLS_TO_PROCESS = 100000
TARGET_TOTAL_STUDENTS = 3000000
AVG_STUDENTS_PER_SCHOOL = int(TARGET_TOTAL_STUDENTS / NUM_SCHOOLS_TO_PROCESS) if NUM_SCHOOLS_TO_PROCESS > 0 else 0
CHUNK_SIZE_SCHOOLS = 5000 
OUTPUT_DIR = '/Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks' 

os.makedirs(OUTPUT_DIR, exist_ok=True)

# Family Income Categories (Annual Household Income)
# Based on general understanding
INCOME_CATEGORIES_MAP = { 
    'EWS': '< ₹2 Lakhs',
    'LIG1': '₹2 - ₹3.5 Lakhs',
    'LIG2': '₹3.5 - ₹5 Lakhs',
    'MIG_HIG': '> ₹5 Lakhs' # Combined Middle & High Income Groups
}
INCOME_CATEGORY_KEYS = list(INCOME_CATEGORIES_MAP.keys())


# Base Annual Dropout Rates (Source: UDISE+ 2023-24 Report, Table 6.13, page 119 [cite: 1233])
# These are national averages. Rates can vary significantly by state and other factors.
BASE_DROPOUT_RATES = {
    'Primary': {'Boys': 0.021, 'Girls': 0.017, 'Total': 0.019},         # Grades 1-5
    'UpperPrimary': {'Boys': 0.052, 'Girls': 0.053, 'Total': 0.052},    # Grades 6-8
    'Secondary': {'Boys': 0.155, 'Girls': 0.126, 'Total': 0.141},      # Grades 9-10
    # UDISE+ Report Table 6.13 does not provide a separate rate for Higher Secondary (11-12).
    # Using Secondary rate as a proxy. Ideal: Find specific data for Higher Secondary.
    'HigherSecondary': {'Boys': 0.155, 'Girls': 0.126, 'Total': 0.141} # Grades 11-12 (Proxy)
}
MAX_DROPOUT_PROB_CAP = 0.85 # Cap final dropout probability

# Rural/Urban Mapping for display
rural_urban_map_display = {1: 'Rural', 2: 'Urban', np.nan: 'Unknown'}


# --- Feature generation functions ---
def generate_age(grade):
    return int(np.round(np.random.normal(6 + (grade - 1), 0.5)))

def generate_gender(school_girl_ratio):
    # school_girl_ratio: proportion of girls in that specific school (from your input school_data)
    # UDISE+ 2023-24, Table 1, pg 27: Overall girls 48.2% of enrolment (Primary to Hr.Sec.) [cite: 955]
    valid_girl_ratio = np.clip(school_girl_ratio, 0, 1)
    if pd.isna(valid_girl_ratio): valid_girl_ratio = 0.482 # Default to national average if school specific is NaN
    return 'Female' if np.random.rand() < valid_girl_ratio else 'Male'

def generate_caste():
    # Based on UDISE+ 2023-24, Chart 2, page 12: Overall student distribution [cite: 867, 877]
    # General 27%, SC 18%, ST 10%, OBC 45%
    # Current probabilities [0.269, 0.18, 0.099, 0.452] are very close.
    return np.random.choice(['General', 'SC', 'ST', 'OBC'], p=[0.269, 0.180, 0.099, 0.452])

def generate_attendance():
    # Skewed towards higher attendance, mean 80, std 15, allowing lower tail.
    return round(np.clip(np.random.normal(80, 15), 30, 100),1) # Min attendance 30%

def generate_performance():
    # Mean 55, std 18, allowing very low performers.
    return round(np.clip(np.random.normal(55, 18), 10, 95),1) # Min 10, Max 95 to avoid perfect 0 or 100 often

def generate_father_education():
    # This is illustrative, slightly adjusted for assumed lower education levels in general for this demographic.
    return np.random.choice(['None', 'Primary', 'Secondary', 'HigherSecondary', 'Graduate', 'PostGraduate'],
                            p=[0.30, 0.30, 0.20, 0.10, 0.07, 0.03])

def generate_midday_meal(is_government_school=True):
    # UDISE+ focuses on government schools. For these, access is policy-driven and high.
    if is_government_school:
        return 1 if np.random.rand() < 0.98 else 0 # 98% access, allowing tiny chance of local issue/non-availment
    return 1 if np.random.rand() < 0.7 else 0 # Lower for other hypothetical school types

def generate_internet(rural_urban_code): 
    # UDISE+ 2023-24 Table 1 (pg 28) [cite: 957] states 53.9% schools have internet. Household access is different.
    # Assuming lower household access, especially in rural areas for govt school children.
    prob = 0.50 if rural_urban_code == 2 else 0.15 # Illustrative: 50% Urban, 15% Rural household internet
    return 1 if np.random.rand() < prob else 0

def generate_distance(rural_urban_code):
    if rural_urban_code == 2:  # Urban
        return round(np.clip(np.random.normal(1.8, 1.2), 0.1, 10), 1)
    elif rural_urban_code == 1:  # Rural
        return round(np.clip(np.random.normal(4.5, 3.0), 0.2, 25), 1)
    else: # Default (if rural_urban_code from school_data is NaN or other)
        return round(np.clip(np.random.normal(3.0, 2.5), 0.1, 15), 1)

def generate_family_income(father_education, caste, rural_urban_code):
    """
    Generates family income category.
    UDISE+ 2023-24 PDF does NOT provide direct student family income statistics. !!
    This function is ILLUSTRATIVE and needs to be informed by external sources !!
    """
    # Base probabilities, heavily skewed to EWS/LIG1 for government school demographic assumption
    # Probabilities for: 'EWS', 'LIG1', 'LIG2', 'MIG_HIG'
    base_dist = {'EWS': 0.55, 'LIG1': 0.25, 'LIG2': 0.15, 'MIG_HIG': 0.05}

    # Adjustments (these are heuristics and need statistical backing)
    if rural_urban_code == 1: # Rural
        base_dist['EWS'] += 0.10; base_dist['LIG1'] += 0.05
        base_dist['LIG2'] -= 0.05; base_dist['MIG_HIG'] -= 0.10
    else: # Urban
        base_dist['EWS'] -= 0.05; base_dist['LIG1'] -= 0.03
        base_dist['LIG2'] += 0.03; base_dist['MIG_HIG'] += 0.05
    
    edu_impact = {
        'None': {'EWS': 0.10, 'LIG1': 0.05, 'MIG_HIG': -0.15},
        'Primary': {'EWS': 0.05, 'LIG1': 0.02, 'MIG_HIG': -0.07},
        'Secondary': {'EWS': -0.02, 'LIG2': 0.02},
        'HigherSecondary': {'EWS': -0.05, 'LIG1': -0.02, 'LIG2': 0.03, 'MIG_HIG': 0.04},
        'Graduate': {'EWS': -0.10, 'LIG1': -0.05, 'LIG2': 0.05, 'MIG_HIG': 0.10},
        'PostGraduate': {'EWS': -0.15, 'LIG1': -0.07, 'LIG2': 0.10, 'MIG_HIG': 0.12}
    }
    if father_education in edu_impact:
        for cat, val_adj in edu_impact[father_education].items():
            if cat in base_dist: base_dist[cat] += val_adj

    caste_impact = { # Based on general societal disadvantages
        'SC': {'EWS': 0.05, 'LIG1': 0.03, 'MIG_HIG': -0.08},
        'ST': {'EWS': 0.08, 'LIG1': 0.05, 'MIG_HIG': -0.13}
    }
    if caste in caste_impact:
        for cat, val_adj in caste_impact[caste].items():
            if cat in base_dist: base_dist[cat] += val_adj

    # Normalize probabilities
    final_probs = np.array([max(0.01, base_dist[cat]) for cat in INCOME_CATEGORY_KEYS]) # Ensure a tiny min probability
    final_probs = final_probs / final_probs.sum()
    
    return np.random.choice(INCOME_CATEGORY_KEYS, p=final_probs)

def get_education_level_category(grade):
    if 1 <= grade <= 5: return 'Primary'
    if 6 <= grade <= 8: return 'UpperPrimary'
    if 9 <= grade <= 10: return 'Secondary'
    if 11 <= grade <= 12: return 'HigherSecondary'
    return 'Primary' # Default for any edge cases

def generate_dropout(grade, attendance, performance, family_income_cat, caste, gender,
                     rural_urban_code, internet_access, distance_to_school):
    """
    Generates dropout status using UDISE+ 2023-24 base rates and illustrative risk factor adjustments.
    !! The risk adjustment logic is ILLUSTRATIVE. !!
    """
    level = get_education_level_category(grade)
    
    # Get gender-specific base dropout rate from UDISE+ 2023-24 Report, Table 6.13, page 119 [cite: 1233]
    if gender == 'Boys':
        base_prob = BASE_DROPOUT_RATES.get(level, {'Boys': 0.05})['Boys']
    elif gender == 'Girls':
        base_prob = BASE_DROPOUT_RATES.get(level, {'Girls': 0.05})['Girls']
    else: # Fallback to total if gender is not 'Boys'/'Girls'
        base_prob = BASE_DROPOUT_RATES.get(level, {'Total': 0.05})['Total']

    # Risk factor accumulation (starts at 1.0, meaning base_prob)
    risk_multiplier = 1.0

    # Attendance (Source: General understanding)
    if attendance < 50: risk_multiplier *= 2.5  # Very significant risk
    elif attendance < 70: risk_multiplier *= 1.8 # Significant risk
    elif attendance < 85: risk_multiplier *= 1.2 # Moderate risk

    # Performance (Source: General understanding )
    if performance < 35: risk_multiplier *= 2.2  # Very high risk (likely failing)
    elif performance < 50: risk_multiplier *= 1.5 # High risk

    # Family Income (Source: PoC; logic needs real data for multipliers)
    # Assumes EWS is highest risk, then LIG1, LIG2. MIG might be baseline or slightly protective.
    if family_income_cat == 'EWS': risk_multiplier *= 1.6
    elif family_income_cat == 'LIG1': risk_multiplier *= 1.3
    elif family_income_cat == 'LIG2': risk_multiplier *= 1.1

    # Caste (Source: PoC; captures systemic disadvantages - small consistent factor)
    if caste in ['SC', 'ST']: risk_multiplier *= 1.15

    # Gender-specific adjustments ON TOP of gendered base rates if further evidence suggests
    # UDISE+ 2023-24 for secondary already shows higher for boys. So this is an additional factor.
    # Factors like early marriage for girls, or boys pulled into labor earlier.
    # This section needs strong statistical backing for specific additional multipliers.
    # For now, the gendered base rates handle the primary difference.
    # if gender == 'Female' and level in ['Secondary', 'HigherSecondary']: risk_multiplier *= 1.1 # Example
    # if gender == 'Male' and level in ['Secondary', 'HigherSecondary']: risk_multiplier *= 1.05 # Example

    # Internet Access (Minor factor, potentially more for older grades)
    if internet_access == 0 and grade >= 9:
        risk_multiplier *= 1.05

    # Distance (Minor factor, more relevant for rural & girls - complex)
    if rural_urban_code == 1: # Rural
        if (level == 'Primary' and distance_to_school > 3) or \
           (level != 'Primary' and distance_to_school > 5): # Stricter for primary
            risk_multiplier *= 1.05

    # Calculate final probability
    final_dropout_prob = base_prob * risk_multiplier
    final_dropout_prob = min(final_dropout_prob, MAX_DROPOUT_PROB_CAP) # Cap probability
    final_dropout_prob = max(final_dropout_prob, 0.0005) # Ensure a tiny minimum chance, not absolute zero

    return 1 if np.random.rand() < final_dropout_prob else 0


# --- Main Script ---
start_time = time.time()


school_data_full = school_data

actual_num_schools_to_process = NUM_SCHOOLS_TO_PROCESS
if len(school_data_full) < NUM_SCHOOLS_TO_PROCESS:
    print(f"Warning: Available schools ({len(school_data_full)}) is < requested ({NUM_SCHOOLS_TO_PROCESS}).")
    actual_num_schools_to_process = len(school_data_full)
    if actual_num_schools_to_process > 0:
        AVG_STUDENTS_PER_SCHOOL = int(TARGET_TOTAL_STUDENTS / actual_num_schools_to_process)
    else: AVG_STUDENTS_PER_SCHOOL = 0
if actual_num_schools_to_process == 0:
    print("No schools available to process from the pre-loaded school_data. Exiting.")
    exit()

school_data_to_process = school_data_full.head(actual_num_schools_to_process).copy()
print(f"Selected {len(school_data_to_process)} schools for processing.")
print(f"Targeting ~{AVG_STUDENTS_PER_SCHOOL} students per school.")

num_chunks = (len(school_data_to_process) + CHUNK_SIZE_SCHOOLS - 1) // CHUNK_SIZE_SCHOOLS
print(f"Total schools to process: {len(school_data_to_process)}. Chunk size: {CHUNK_SIZE_SCHOOLS} schools/file. Num chunks: {num_chunks}")

overall_student_count = 0
processed_schools_total_count = 0 

for chunk_idx in range(num_chunks):
    chunk_process_start_time = time.time()
    start_idx_school_chunk = chunk_idx * CHUNK_SIZE_SCHOOLS
    end_idx_school_chunk = min((chunk_idx + 1) * CHUNK_SIZE_SCHOOLS, len(school_data_to_process))
    
    current_school_data_chunk = school_data_to_process.iloc[start_idx_school_chunk:end_idx_school_chunk]
    
    print(f"\nProcessing Chunk {chunk_idx + 1}/{num_chunks} (Schools Index {start_idx_school_chunk}-{end_idx_school_chunk -1})...")
    students_for_current_chunk = []
    
    for school_df_idx, school_details_row in current_school_data_chunk.iterrows():
        processed_schools_total_count += 1
        
        # Extract school-level features 
        school_id = school_details_row.get('pseudocode', f'sch_{school_df_idx}')
        state = school_details_row.get('state', 'Unknown State')
        district = school_details_row.get('district', 'Unknown District')
        # rural_urban_code: 1 for Rural, 2 for Urban. Default to random if not present.
        rural_urban_val = school_details_row.get('rural_urban', np.random.choice([1,2]))
        if pd.isna(rural_urban_val) : rural_urban_val = np.random.choice([1,2]) # handle potential NaNs from .get
            
        school_specific_girl_ratio = school_details_row.get('girl_ratio', 0.482) # Default to national average
        if pd.isna(school_specific_girl_ratio) : school_specific_girl_ratio = 0.482

        # For PoC, assuming all schools in your list are government schools
        is_government_school_val = True 


        current_school_student_count = 0
        for _ in range(AVG_STUDENTS_PER_SCHOOL):
            grade = np.random.randint(1, 13) # Random grade from 1 to 12
            
            age = generate_age(grade)
            gender = generate_gender(school_specific_girl_ratio)
            caste = generate_caste() # Uses UDISE+ like proportions
            father_edu = generate_father_education() # Needs research for govt. school parent profile
            
            family_income_key = generate_family_income(father_edu, caste, rural_urban_val)
            
            attendance = generate_attendance() # Needs research on typical distributions
            performance = generate_performance() # Needs research on typical distributions
            
            # Assumes school provides these if it's a govt school
            midday_meal_access = generate_midday_meal(is_government_school=is_government_school_val)
            # The following could come from school_data if available, else default/generate
            free_textbooks = school_details_row.get('free_text_books', 1 if is_government_school_val else 0)
            free_uniforms = school_details_row.get('free_uniforms', 1 if is_government_school_val else 0)

            internet_access_home = generate_internet(rural_urban_val) # Student's home internet
            distance = generate_distance(rural_urban_val)
            
            dropout_status = generate_dropout(grade, attendance, performance, family_income_key, caste, gender,
                                              rural_urban_val, internet_access_home, distance)
            
            students_for_current_chunk.append({
                'school_pseudocode': school_id, 'state': state, 'district': district,
                'rural_urban': rural_urban_map_display.get(rural_urban_val, 'Unknown'),
                'grade': grade, 'age': age, 'gender': gender, 'caste': caste,
                'father_education': father_edu,
                'family_income': INCOME_CATEGORIES_MAP[family_income_key], # Descriptive label
                'attendance_rate': attendance, 'grade_performance': performance,
                'midday_meal_access': midday_meal_access,
                'free_text_books_access': free_textbooks,
                'free_uniform_access': free_uniforms,
                'internet_access_home': internet_access_home,
                'distance_to_school': distance,
                'dropout': dropout_status
            })
            current_school_student_count += 1
        overall_student_count += current_school_student_count

        # Progress Reporting
        if processed_schools_total_count % (CHUNK_SIZE_SCHOOLS // 20 or 1) == 0 or processed_schools_total_count == len(school_data_to_process):
            current_elapsed_time = time.time() - start_time
            avg_processing_time_per_school = current_elapsed_time / processed_schools_total_count if processed_schools_total_count > 0 else 0
            estimated_time_remaining = avg_processing_time_per_school * (len(school_data_to_process) - processed_schools_total_count)
            print(f"  Processed school {processed_schools_total_count}/{len(school_data_to_process)} ({school_id}). Students for this school: {current_school_student_count}.")
            print(f"  Total students so far: {overall_student_count}. Est. time remaining for schools: {estimated_time_remaining:.2f}s")

    if students_for_current_chunk:
        df_student_chunk = pd.DataFrame(students_for_current_chunk)
        chunk_output_filename = os.path.join(OUTPUT_DIR, f'synthetic_students_udise_enhanced_chunk_{chunk_idx + 1}.csv')
        df_student_chunk.to_csv(chunk_output_filename, index=False)
        chunk_duration = time.time() - chunk_process_start_time
        print(f"Chunk {chunk_idx + 1} (Schools {start_idx_school_chunk + 1}-{end_idx_school_chunk}) saved to '{chunk_output_filename}'.")
        print(f"   Contains {len(students_for_current_chunk)} students. Time for this chunk: {chunk_duration:.2f}s.")
    else:
        print(f"Chunk {chunk_idx + 1} had no students to save.")
    
    del students_for_current_chunk # Free memory
    if 'df_student_chunk' in locals(): del df_student_chunk


final_total_time = time.time() - start_time
print(f"\n--- Synthetic Data Generation Complete ---")
print(f"Processed {processed_schools_total_count} schools.")
print(f"Generated a total of {overall_student_count} student records.")
print(f"Data saved in chunks in directory: '{OUTPUT_DIR}'.")
print(f"Total script execution time: {final_total_time:.2f} seconds ({final_total_time/60:.2f} minutes).")


Selected 100000 schools for processing.
Targeting ~30 students per school.
Total schools to process: 100000. Chunk size: 5000 schools/file. Num chunks: 20

Processing Chunk 1/20 (Schools Index 0-4999)...
  Processed school 250/100000 (3940526). Students for this school: 30.
  Total students so far: 7500. Est. time remaining for schools: 901.64s
  Processed school 500/100000 (1486269). Students for this school: 30.
  Total students so far: 15000. Est. time remaining for schools: 886.96s
  Processed school 750/100000 (7459649). Students for this school: 30.
  Total students so far: 22500. Est. time remaining for schools: 884.66s
  Processed school 1000/100000 (3562621). Students for this school: 30.
  Total students so far: 30000. Est. time remaining for schools: 883.25s
  Processed school 1250/100000 (1779878). Students for this school: 30.
  Total students so far: 37500. Est. time remaining for schools: 879.32s
  Processed school 1500/100000 (2818533). Students for this school: 30.
  T

## 💾 Saving Raw Synthetic Dataset

The synthetically generated and initially cleaned dataset is now saved. This dataset (`synthetic_student_data_combined.csv`) will be the input for the `EDA.ipynb` notebook, where more in-depth analysis and feature engineering will take place.


In [34]:

print("\n--- To combine the chunks later, you can use pandas like this: ---")

OUTPUT_DIR = '/Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks' # Use the same output directory
csv_files = glob.glob(os.path.join(OUTPUT_DIR, 'synthetic_students_udise_enhanced_chunk_*.csv'))
df_list = []
if not csv_files:
    print(f"No chunk files found in {OUTPUT_DIR}")
else:
    for file in csv_files:
        print(f"Reading {file}...")
        try:
            df_list.append(pd.read_csv(file))
        except Exception as e:
            print(f"Error reading {file}: {e}")
if df_list:
    combined_df = pd.concat(df_list, ignore_index=True)
    combined_filename = os.path.join(OUTPUT_DIR, 'synthetic_student_data_combined.csv')
    combined_df.to_csv(combined_filename, index=False)
    print(f"Combined data saved to {combined_filename}")
elif csv_files: # If files were found but list is empty (e.g. read errors)
    print("Could not read any chunk files to combine.")
else: # No files found initially
    pass # Already printed message


--- To combine the chunks later, you can use pandas like this: ---
Reading /Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks/synthetic_students_udise_enhanced_chunk_12.csv...
Reading /Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks/synthetic_students_udise_enhanced_chunk_13.csv...
Reading /Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks/synthetic_students_udise_enhanced_chunk_11.csv...
Reading /Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks/synthetic_students_udise_enhanced_chunk_10.csv...
Reading /Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks/synthetic_students_udise_enhanced_chunk_14.csv...
Reading /Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks/synthetic_students_udise_enhanced_chunk_15.csv...
Reading /Users/macbookpro/Desktop/POC1/data/processed/student_data_udise_enhanced_chunks/synthetic

In [35]:
student_df = pd.read_csv('/Users/macbookpro/Desktop/POC1/data/processed/synthetic_student_data_combined.csv')

In [38]:
student_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 18 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   school_pseudocode       int64  
 1   state                   object 
 2   district                object 
 3   rural_urban             object 
 4   grade                   int64  
 5   age                     int64  
 6   gender                  object 
 7   caste                   object 
 8   father_education        object 
 9   family_income           object 
 10  attendance_rate         float64
 11  grade_performance       float64
 12  midday_meal_access      int64  
 13  free_text_books_access  int64  
 14  free_uniform_access     int64  
 15  internet_access_home    int64  
 16  distance_to_school      float64
 17  dropout                 int64  
dtypes: float64(3), int64(8), object(7)
memory usage: 412.0+ MB


In [39]:
student_df.shape

(3000000, 18)