Preprocess Data:

Standardize numerical features (e.g., Academic_Performance, k6_overall).

Encode categorical features (e.g., SchoolActivityNet as one-hot encoded columns).

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Load data
df = pd.read_csv("synthetic_student_data.csv")

# Preprocess categorical data (activities)
activities_split = df["SchoolActivityNet"].str.get_dummies(", ")
df = pd.concat([df, activities_split], axis=1)

# Drop non-feature columns (StudentID, network columns)
feature_cols = [
    'Academic_Performance', 'isolated', 'WomenDifferent', 'language',
    'pwi_wellbeing', 'GrowthMindset', 'k6_overall', 'Manbox5_overall',
    'Masculinity_contrained', 'School_support_engage6', 'School_support_engage'
] + activities_split.columns.tolist()

X = df[feature_cols]

# Standardize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce dimensionality (optional but recommended for high-dimensional data)
pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)

Cluster Students:

Use K-means to group students into n_clusters = NUM_STUDENTS // 30 (e.g., 34 clusters for 1,000 students).

Assign each student to a cluster (class).

In [2]:
NUM_STUDENTS = len(df)
n_clusters = (NUM_STUDENTS // 30) + (1 if NUM_STUDENTS % 30 != 0 else 0)

kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df["Class"] = kmeans.fit_predict(X_pca)

Balance Class Sizes:

Ensure each cluster has ~30 students. Adjust using cluster_size checks.

In [3]:
# Check cluster sizes
class_sizes = df["Class"].value_counts().sort_index()

# If clusters are uneven, redistribute students
# (Example: Split oversized clusters)
for class_id in class_sizes[class_sizes > 30].index:
    excess = class_sizes[class_id] - 30
    students_to_move = df[df["Class"] == class_id].sample(excess, random_state=42)
    df.loc[students_to_move.index, "Class"] = df["Class"].max() + 1

Validate Diversity:

Check key metrics (e.g., academic performance, wellbeing) across classes to ensure balance.

In [4]:
# Example: Compare means across classes
class_stats = df.groupby("Class")[['Academic_Performance', 'k6_overall', 'GrowthMindset']].mean()
print(class_stats)

       Academic_Performance  k6_overall  GrowthMindset
Class                                                 
0                 78.714286   17.380952       3.857143
1                 73.233333   13.233333       4.083333
2                 72.900000   14.733333       3.733333
3                 85.478261   16.434783       4.608696
4                 80.133333   18.166667       4.816667
...                     ...         ...            ...
465               73.222222   19.888889       4.333333
466               60.666667   19.666667       3.500000
467               79.461538   19.923077       5.230769
468               63.500000   19.500000       5.000000
469               79.428571   18.142857       4.714286

[470 rows x 3 columns]
