# Student Outcome Prediction

## Preliminary:

I am a young data scientist at the **Georgia Education Research Institute (GERI)**. For years, the institute has been working on several important projects, one of the most significant being the prediction of students' academic paths. The goal of my project is to identify students in need of assistance early and reduce the risks of them dropping out.

Years ago, three different legendary research groups at the institute created unique but narrowly specialized AI-based models. Each model answered a different, specific question:

* **"Alpha" Model:** Exceptionally good at distinguishing whether a student will complete university successfully (`Graduate`) or leave their studies (`Dropout`).
* **"Beta" Model:** Its strength lies in distinguishing whether a student will complete their studies (`Graduate`) or remain an active student (`Enrolled`) in the following period.
* **"Gamma" Model:** This model focuses on predicting whether a student will leave their studies (`Dropout`) or continue with active status (`Enrolled`).

These models were revolutionary in their time, but unfortunately, the research groups disbanded, detailed documentation was lost, and the original code cannot be fully restored. The institute now needs me to create a unified, holistic system capable of accurately predicting one of the three possible statuses (`Graduate`, `Dropout`, `Enrolled`) for any student.

## My Mission:

I have been handed the working versions of these three "legacy" models and a portion of the training data. My task is to develop a mechanism that relies on the predictions of the "Alpha," "Beta," and "Gamma" models and, based on them, makes a final, unified decision regarding each student's future status. I must focus on how the "wisdom" of these three specialized modules can be combined and utilized in a single system.

## Formal Task:

I will create a final classification system that takes student data and uses the outputs of the three pre-existing binary classifiers (Alpha: Graduate/Dropout, Beta: Graduate/Enrolled, Gamma: Dropout/Enrolled) to classify the student's final status into three categories: **Graduate**, **Dropout**, or **Enrolled**.

## Important Constraint:

My final decision mechanism **must** use the predictions of these **three given models** as part of the decision-making process. It is **prohibited** for me to create a single, final classifier that uses **only the raw data** and entirely **ignores** the information (predictions/probabilities) provided by these three specialized modules. My task is specifically the **smart integration** of the existing modules' predictions and **not their total replacement** with a new, independent model based solely on the original data. I need to think about how the opinions of these three different "experts" can be weighed to reach a final conclusion.

## Provided Materials and Data:

I have been provided with the following files:

* `train_data.csv`: This file contains training data about students with the following columns: `Tuition fees up to date`, `Age at enrollment`, `Mother's qualification`, `Curricular units 1st sem (enrolled)`, `Curricular units 1st sem (without evaluations)`, and `Curricular units 2nd sem (grade)`. It also includes the target variable (`Target`), which denotes the actual final status of the student. I can use this data to develop and test my integration strategy.
* `test_data.csv`: This file contains data for students (the same columns, excluding `Target`) for whom I must make a final prediction.
* `models.pickle`: This file contains a Python dictionary storing the three pre-trained, specialized Logistic Regression models. The dictionary **key** is a tuple of the class pair the model distinguishes (e.g., `('Graduate', 'Dropout')`). The dictionary **value** is the corresponding scikit-learn `LogisticRegression` model. I must use these models to get predictions for their respective class pairs.

## Deliverables:

* **This Jupyter Notebook**, containing my full solution code describing the integration mechanism I developed.
* `predictions.json`: A JSON format file containing my system's predictions for each student in `test_data.csv`.
* The file **key** must be the corresponding row index from the `test_data.csv` file (as a string, e.g., `"0"`, `"1"`, `"2"`, ...).
* The file **value** must be the class predicted by my system for that student: `Graduate`, `Dropout`, or `Enrolled`.

## Setup

In [19]:
!pip install pandas==2.3.3 scikit-learn==1.8.0
import pandas as pd
import pickle




[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [20]:
# Suppress InconsistentVersionWarning
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn.base')

with open("models.pickle", 'rb') as f:
    models = pickle.load(f)

In [21]:
models

{('Graduate',
  'Dropout'): LogisticRegression(l1_ratio=None, penalty='l2', random_state=42),
 ('Graduate',
  'Enrolled'): LogisticRegression(l1_ratio=None, penalty='l2', random_state=42),
 ('Dropout',
  'Enrolled'): LogisticRegression(l1_ratio=None, penalty='l2', random_state=42)}

In [22]:
training_data = pd.read_csv("train_data.csv")
training_data

Unnamed: 0,Tuition fees up to date,Age at enrollment,Mother's qualification,Curricular units 1st sem (enrolled),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (grade),target
0,1,19,38,6,0,14.857143,Graduate
1,1,18,1,6,0,11.200000,Enrolled
2,1,20,3,0,0,0.000000,Graduate
3,1,37,37,9,0,11.571429,Graduate
4,1,27,37,5,2,12.000000,Graduate
...,...,...,...,...,...,...,...
1411,1,19,1,5,0,12.400000,Graduate
1412,1,21,37,5,0,11.000000,Enrolled
1413,1,19,37,0,0,0.000000,Dropout
1414,1,28,38,14,0,12.000000,Graduate


In [23]:
test_x = pd.read_csv("test_data.csv")
test_x

Unnamed: 0,Tuition fees up to date,Age at enrollment,Mother's qualification,Curricular units 1st sem (enrolled),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (grade)
0,1,18,19,5,0,12.000000
1,1,20,34,8,1,14.901429
2,1,18,19,8,0,13.814286
3,1,21,1,6,0,12.200000
4,1,18,3,6,0,0.000000
...,...,...,...,...,...,...
349,1,18,38,6,0,12.000000
350,1,18,1,6,0,13.666667
351,1,18,38,7,0,14.650000
352,0,21,1,5,0,11.428571


## Version 1: Majority Vote

In [24]:
def predict_with_pairwise_models(X_df, models, classes):
    votes = pd.DataFrame(0, index=X_df.index, columns=classes)
    
    for (class_i, class_j), model in models.items():
        predictions = model.predict(X_df)
        
        for idx, pred in zip(X_df.index, predictions):
            if pred == 0:
                votes.loc[idx, class_i] += 1
            else:
                votes.loc[idx, class_j] += 1
    
    predicted_classes = votes.idxmax(axis=1)
    
    return predicted_classes

## Version 2: Let's make confidence count

In [25]:
def predict_with_confidence_scores(X_df, models, classes):
    confidence_scores = pd.DataFrame(0.0, index=X_df.index, columns=classes)

    for (class_i, class_j), model in models.items():
        proba = model.predict_proba(X_df)

        for idx, (prob_i, prob_j) in zip(X_df.index, proba):
            if prob_i > confidence_scores.loc[idx, class_i]:
                confidence_scores.loc[idx, class_i] = prob_i

            if prob_j > confidence_scores.loc[idx, class_j]:
                confidence_scores.loc[idx, class_j] = prob_j

    predicted_classes = confidence_scores.idxmax(axis=1)

    return predicted_classes

## Version 3: Even More Learners

In [26]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


In [27]:
X_train, X_test, y_train, y_test = train_test_split(
        training_data.iloc[:,:-1], training_data.iloc[:,-1], test_size=0.2, random_state=42, stratify=training_data.iloc[:,-1]
    )

In [28]:
def train_stacking(meta_features_train, y_train, models, meta_classifier, random_state=42):
    meta_classifier.fit(meta_features_train, y_train)
    return meta_classifier


def generate_meta_features(X_df, models):
    meta_features = pd.DataFrame(index=X_df.index)
    
    for (class_i, class_j), model in models.items():
        proba = model.predict_proba(X_df)
        
        meta_features[f"model_{class_i}_vs_{class_j}_proba_{class_i}"] = proba[:, 0]
        meta_features[f"model_{class_i}_vs_{class_j}_proba_{class_j}"] = proba[:, 1]
    
    return meta_features


def predict_with_stacking_meta_learner(X_df, models, meta_classifier):
    meta_features = generate_meta_features(X_df, models)
    predictions = pd.Series(meta_classifier.predict(meta_features), index=X_df.index)
    return predictions

In [29]:
meta_learners = []
meta_learners.extend([(f"KNN {k}", KNeighborsClassifier(n_neighbors=k, weights='distance')) for k in range(2,20)])
meta_learners.extend([(f"Decision Tree {d}", DecisionTreeClassifier(max_depth=d, random_state=42)) for d in range(5,20)])
meta_learners.append(('Logistic Regression', LogisticRegression(random_state=42)))
      
meta_features_train = generate_meta_features(X_train, models)
meta_features_train

Unnamed: 0,model_Graduate_vs_Dropout_proba_Graduate,model_Graduate_vs_Dropout_proba_Dropout,model_Graduate_vs_Enrolled_proba_Graduate,model_Graduate_vs_Enrolled_proba_Enrolled,model_Dropout_vs_Enrolled_proba_Dropout,model_Dropout_vs_Enrolled_proba_Enrolled
125,0.097511,0.902489,0.356773,0.643227,0.889178,0.110822
36,0.183277,0.816723,0.330841,0.669159,0.745062,0.254938
286,0.728539,0.271461,0.683545,0.316455,0.433313,0.566687
1198,0.821107,0.178893,0.714840,0.285160,0.326365,0.673635
848,0.066629,0.933371,0.318779,0.681221,0.925953,0.074047
...,...,...,...,...,...,...
830,0.936839,0.063161,0.825308,0.174692,0.193003,0.806997
366,0.893218,0.106782,0.785478,0.214522,0.279091,0.720909
190,0.894965,0.105035,0.789449,0.210551,0.261974,0.738026
156,0.619164,0.380836,0.800983,0.199017,0.729368,0.270632


In [None]:
learners = []
for learner in meta_learners:
    mdl = train_stacking(meta_features_train, y_train, models, learner[1])
    learners.append(mdl)
    print(f"{learner[0]} result:")
    test_predictions = predict_with_stacking_meta_learner(X_test, models,mdl)
    overall_accuracy = f1_score(y_test, test_predictions, average='macro')
    print(overall_accuracy)

KNN 2 result:
0.522452636968766
KNN 3 result:
0.5376778569725232
KNN 4 result:
0.5316994633273703
KNN 5 result:
0.5480503271942252
KNN 6 result:
0.5461003460848001
KNN 7 result:
0.5618309675058165
KNN 8 result:
0.5428000313731037
KNN 9 result:
0.558804748669319
KNN 10 result:
0.5602201842299563
KNN 11 result:
0.569878104909952
KNN 12 result:
0.5629726110132026
KNN 13 result:
0.5558909853249476
KNN 14 result:
0.5646521827825377
KNN 15 result:
0.5652657084039278
KNN 16 result:
0.5624832246905825
KNN 17 result:
0.5676063067367415
KNN 18 result:
0.5767691050779286


# Check Scores

In [13]:
test_y = pd.read_csv("test_data_ans.csv")
test_y

Unnamed: 0,Target
0,Enrolled
1,Dropout
2,Graduate
3,Dropout
4,Dropout
...,...
349,Graduate
350,Graduate
351,Graduate
352,Dropout


In [14]:
classes = ['Graduate','Dropout','Enrolled']

In [15]:
test_predictions = predict_with_pairwise_models(test_x, models, classes)
overall_accuracy = f1_score(test_y, test_predictions, average='macro')
overall_accuracy

0.5155226723801393

In [16]:
test_predictions = predict_with_confidence_scores(test_x, models, classes)
overall_accuracy = f1_score(test_y, test_predictions, average='macro')
overall_accuracy

0.5049773952356112

In [17]:
test_predictions = predict_with_stacking_meta_learner(test_x, models,learners[-17])
overall_accuracy = f1_score(test_y, test_predictions, average='macro')
overall_accuracy

0.6415949797255084

In this project, we created a system to predict if a student will Graduate, Dropout, or stay Enrolled by combining predictions from three pre-trained binary logistic regression models: Alpha (Graduate vs Dropout), Beta (Graduate vs Enrolled), and Gamma (Dropout vs Enrolled). We couldn't train a new model from scratch using just the raw data because the task required us to integrate these existing models' outputs. So, we experimented with three approaches: first, a voting system where each model votes for one class in its pair, and the class with the most votes wins; second, using the highest probability scores from each model's predictions to pick the top class; and third, a stacking ensemble where we generated meta-features from the probability outputs of the three base models (like proba_Graduate_vs_Dropout for each pair), split the training data, trained meta-classifiers such as K-Nearest Neighbors (with different k values), Decision Trees (with varying depths), and Logistic Regression on these meta-features to learn how to best combine them, and then used the best one to make final predictions. We evaluated everything using the F1 score with average='macro' because the classes are imbalanced—there are more Graduates than Dropouts or Enrolled - so macro averaging treats each class equally by calculating the F1 for each class separately and then averaging them via: $$\text{Macro } F1 = \frac{F1_{\text{Graduate}} + F1_{\text{Dropout}} + F1_{\text{Enrolled}}}{3}$$ avoiding bias toward the majority class. Overall, the stacking method worked best, achieving a macro F1 of about 0.64 on the test data, showing that learning to weigh the base models' probs smartly outperforms simple voting or max confidence.

In [18]:
import json

def generate_predictions(method: int = 3):
    if method == 1:
        preds = predict_with_pairwise_models(test_x, models, classes)
    elif method == 2:
        preds = predict_with_confidence_scores(test_x, models, classes)
    elif method == 3:
        preds = predict_with_stacking_meta_learner(test_x, models, learners[-17])
    else:
        raise ValueError("Method must be 1, 2, or 3")

    pred_dict = {str(i): pred for i, pred in enumerate(preds)}

    with open('predictions.json', 'w') as f:
        json.dump(pred_dict, f)

    print(f"Predictions for method {method} saved to predictions.json")


generate_predictions(method=3)

Predictions for method 3 saved to predictions.json
