
# EduGap: AI-Powered Digital Readiness Analyzer

**Date:** 2025-07-23

This notebook demonstrates the methodology behind **EduGap**, a machine learning–based tool designed to predict digital readiness using demographic features, and assess the effectiveness of interventions based on skill gain.

It supports a companion research paper and repository hosted on GitHub.


## Step 1: Import Required Libraries

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from statsmodels.stats.outliers_influence import variance_inflation_factor


## Step 2: Load Dataset

In [None]:

# Load CSV file
data = pd.read_csv("edugap_data.csv")
data.head()


## Step 3: Preprocess and Clean the Data
This includes handling missing values and standardizing category labels.

In [None]:

data.dropna(inplace=True)
data = data[data['education'] != 'unknown']
data['gender'] = data['gender'].str.lower().replace({'male': 'M', 'female': 'F'})


### Transform Categorical Data to Numerical Format

In [None]:

data['gender'] = data['gender'].map({'M': 0, 'F': 1})
data['education'] = data['education'].map({'none': 0, 'high school': 1, 'some college': 2, 'associate': 3, 'bachelor': 4, 'master': 5})
data['employment'] = data['employment'].map({'unemployed': 0, 'part-time': 1, 'full-time': 2})


Using pre and post training skill scores to create columns to measure skill gains.

In [None]:
# Creating columns
data['Computer_Gain'] = data['Post_Training_Basic_Computer_Knowledge_Score'] - data['Basic_Computer_Knowledge_Score']
data['Internet_Gain'] = data['Post_Training_Internet_Usage_Score'] - data['Internet_Usage_Score']
data['Mobile_Gain'] = data['Post_Training_Mobile_Literacy_Score'] - data['Mobile_Literacy_Score']

## Step 4: Generate Target Labels from Raw Scores

### Labeling for identifying underserved groups.

In order to prepare the data for better analysis, they are turned into binary labels to measure gains as "Below Average" (class 0) and "Above Average" (class 1). Note: For simplicity purposes, the terms above and below average are used even though the median is used as the splitting point, not the mean. The median is used in order to improve equity and decrease potential skewness of the data and\or results. 

In [None]:

# Categorize and Labels (Access)
data['Computer_Skill_Label'] = data['Basic_Computer_Knowledge_Score'].apply(lambda x: 0 if x <= 25 else 1)
data['Internet_Usage_Label'] = data['Internet_Usage_Score'].apply(lambda x: 0 if x <= 25 else 1)
data['Mobile_Literacy_Label'] = data['Mobile_Literacy_Score'].apply(lambda x: 0 if x <= 26 else 1)

# Categorize and Labels (Skills)
data['Computer_Gain_Label'] = data['Basic_Computer_Knowledge_Score'].apply(lambda x: 0 if x <= 25 else 1)
data['Internet_Gain_Label'] = data['Internet_Usage_Score'].apply(lambda x: 0 if x <= 25 else 1)
data['Mobile_Gain_Label'] = data['Mobile_Literacy_Score'].apply(lambda x: 0 if x <= 26 else 1)



### Labeling for model performance improvements

Quantile based labeling was used in a seperate model in order to improve it's performance and abilities to identify underserved or high-risk groups (class 0) better. Median based labeling is better when using analysis and visualizations for equity purposes due to its interpretability and robustness against outliers. Quantile based labeling allows the model to learn patterns better and increase its ability to recognize class 0 (underserved) groups.

In [None]:
# Categorize and Labels (Access)
data['Computer_Skill_Label'] = data['Basic_Computer_Knowledge_Score'].apply(lambda x: 0 if x <= data['Basic_Computer_Knowledge_Score'].quantile(0.63) else 1)
data['Internet_Usage_Label'] = data['Internet_Usage_Score'].apply(lambda x: 0 if x <= data['Internet_Usage_Score'].quantile(0.56) else 1)
data['Mobile_Literacy_Label'] = data['Mobile_Literacy_Score'].apply(lambda x: 0 if x <= data['Mobile_Literacy_Score'].quantile(0.69) else 1)

# Categorize and Labels (Skills)
data['Computer_Gain_Label'] = data['Basic_Computer_Knowledge_Score'].apply(lambda x: 0 if x <= np.quantile(data['Computer_Gain'], .56) else 1)
data['Internet_Gain_Label'] = data['Internet_Usage_Score'].apply(lambda x: 0 if x <= np.quantile(data['Internet_Gain'], .55) else 1)
data['Mobile_Gain_Label'] = data['Mobile_Literacy_Score'].apply(lambda x: 0 if x <= np.quantile(data['Mobile_Gain'], .6) else 1)

## Step 5: Create X and Y Variables and Check for Multicollinearity using VIF

Multicollinearity, or the occurrence of two or more explanatory variables being very highly linearly related which can lead to misleading conclusions. Values above five are generally considered as having high multicollinearity present.

In [None]:

X = data[['gender', 'age', 'education', 'employment']]
y = data[['access_label', 'skills_label']].apply(lambda col: col.astype('category').cat.codes)

features = data[["Education_Level", "Household_Income", "Employment_Status", "Age_Group"]]
x_df = pd.DataFrame(x, columns=features.columns)
vif_data = pd.DataFrame()
vif_data["Feature"] = x_df.columns
vif_data["VIF"] = [variance_inflation_factor(x_df.values, i) for i in range(x_df.shape[1])]

print(vif_data)

## Step 6: Train Random Forest Multi-Output Classifier

Various different models were tested and their accuracy score and classification metrics were tracked in order to find the most efficient model for the task at hand: MultiOutputClassifier. Different parameters and values were tested, as well, to train the model to perform its best. 

### Model Configuration
- 'n_estimators = 200' to increase the model's robustness
- 'max_depth = 15' to prevent overfitting by limiting how deep trees go
- 'max_features = 'sqrt'' to improve generalization by adding randomness
- 'min_samples_lead = 4' to prevent the model from creating tiny, noisy leaves (overfitting)
- 'min_samples_split = 5' to make sure splits are meanigful
- 'class_weight =  'balanced'' to ensure both classes are equally represented

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = MultiOutputClassifier(RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_leaf=4,
    min_samples_split=5,
    class_weight='balanced',
    max_features='sqrt',
    bootstrap=True,
    random_state=42,
    n_jobs=-1
))
model.fit(X_train, y_train)

## Step 7: Evaluate the Model’s Performance

In [None]:

y_pred = model.predict(X_test)
print("Access Label:", classification_report(y_test.iloc[:, 0], y_pred[:, 0]))
print("Skills Label:", classification_report(y_test.iloc[:, 1], y_pred[:, 1]))


## Step 8: Calculations and Visualizations (Graphs)

To identify underserved groups and how to improve future digital divide intervention efforts, functions to calculate amount below "average" (the median or middle value) ['averages_calculate'] and graph the top two graphs most below "average" ['bar_graph'] were created.

In [None]:
# Plotting and Calculating
def averages_calculate(df, group_col, label_col, title_prefix=""):
    grouped = df.groupby(group_col)[label_col].value_counts(normalize=True).unstack().fillna(0)
    grouped.columns = ['Below Average', 'Above Average']
    print(grouped)


def bar_graph(df, group_cols, label_col, top_n, title_prefix):
    worst_groups = []
    proportions = []
    for group in group_cols:
        grouped = df.groupby(group)[label_col].value_counts(normalize=True).unstack().fillna(0)
        grouped.columns = ['Below Average', 'Above Average']
        top_worst = grouped['Below Average'].sort_values(ascending=False).head(top_n)
        for idx, val in top_worst.items():
            worst_groups.append(f"{group}: {idx}")
            proportions.append(val)
    plt.ylabel("Proportion Below Average")
    bars = plt.bar(worst_groups, proportions, color='firebrick')
    plt.title(f"{title_prefix} Most Below Average Groups")
    plt.xticks(rotation=25)
    plt.tight_layout()
    plt.show()
