# Fee Payment Default Prediction  
## Data Preprocessing & Model Training


### Notebook Overview

This notebook focuses on preparing the dataset for machine learning and training predictive models to identify students who are likely to delay fee payments or require reminders.


### Objectives

- Clean and preprocess the dataset  
- Handle missing values and categorical variables  
- Engineer relevant features  
- Train multiple machine learning models  
- Evaluate and compare model performance  


### Import Required Libraries

Importing essential Python libraries for data manipulation, preprocessing, model training, and evaluation.


In [1]:
# === Cell 1 ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import joblib
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_auc_score
)
from imblearn.over_sampling import SMOTE


### Load the Dataset

The cleaned dataset from the EDA stage is loaded for preprocessing and model development.


### Initial Data Inspection

Checking dataset shape, column names, and basic structure before preprocessing.


In [2]:
# === Cell 2 ===
df = pd.read_csv('../data/merged.csv')  # change path if needed
df.shape, df.columns
# quick peek
display(df.head(10))


Unnamed: 0.1,Unnamed: 0,student_id,student_name_x,email_x,course,admission_date,admission_status,total_fee,paid_fee,Old_payment_status,payment_status,balance,delayed_payment,needs_reminder
0,0,STU-1000,Kiran Iyer,kiran.39@mail.com,BBA,2024-06-01 00:00:00,Rejected,106092,106092,Paid,PAID,0,0,0
1,1,STU-1001,Rohit Reddy,rohit.46@mail.com,MBA,2024-06-01 01:00:00,Rejected,72408,72408,Paid,PAID,0,0,0
2,2,STU-1002,Emma Taylor,emma.86@mail.com,B.Tech,2024-06-01 02:00:00,Waitlisted,155578,155578,Paid,PAID,0,0,0
3,3,STU-1003,Vivek Nair,vivek.81@mail.com,B.Tech,2024-06-01 03:00:00,Waitlisted,97216,43360,Partially Paid,PENDING,53856,1,1
4,4,STU-1004,Amit Sharma,amit.36@mail.com,BBA,2024-06-01 04:00:00,Waitlisted,174295,86980,Pending,PENDING,87315,1,1
5,5,STU-1005,Manish Verma,manish.16@mail.com,B.Tech,2024-06-01 05:00:00,Waitlisted,179434,93158,Partially Paid,PENDING,86276,1,1
6,6,STU-1006,Anita Nair,anita.60@mail.com,BCA,2024-06-01 06:00:00,Rejected,149305,82440,Pending,PENDING,66865,1,1
7,7,STU-1007,Michael Thomas,michael.60@mail.com,B.Tech,2024-06-01 07:00:00,Waitlisted,166207,166207,Pending,PAID,0,0,0
8,8,STU-1008,Manish Iyer,manish.75@mail.com,MCA,2024-06-01 08:00:00,Waitlisted,196568,139594,Partially Paid,PENDING,56974,1,0
9,9,STU-1009,Karthik Das,karthik.73@mail.com,B.Tech,2024-06-01 09:00:00,Waitlisted,67394,61668,Pending,PENDING,5726,1,0


In [3]:
df.isnull().sum()

Unnamed: 0            0
student_id            0
student_name_x        0
email_x               0
course                0
admission_date        0
admission_status      0
total_fee             0
paid_fee              0
Old_payment_status    0
payment_status        0
balance               0
delayed_payment       0
needs_reminder        0
dtype: int64

### Feature and Target Separation

Separating independent features (X) and target variables (y) to prepare for preprocessing and model training.
New features are created to improve prediction accuracy, such as:
- `balance = total_fee - paid_fee`

In [4]:
# === Cell 3 ===
# drop obvious index column if present
if 'Unnamed: 0' in df.columns:
    df = df.drop(columns=['Unnamed: 0'])

# Recompute balance to avoid stale/leaky values
df['balance'] = df['total_fee'] - df['paid_fee']

# Standardize column names (if any trailing spaces)
df.columns = df.columns.str.strip()

# Choose features to use (you can drop/modify later)
# NOTE: You included payment_status â€” this is allowed but may leak current payment state.
features_col = ['admission_status', 'total_fee', 'paid_fee', 'balance', 'course']

# Verify targets
assert 'delayed_payment' in df.columns and 'needs_reminder' in df.columns, "Targets missing"

# Quick counts
print("Delayed payment distribution:\n", df['delayed_payment'].value_counts())
print("\nNeeds reminder distribution:\n", df['needs_reminder'].value_counts())



Delayed payment distribution:
 delayed_payment
0    2620
1    2380
Name: count, dtype: int64

Needs reminder distribution:
 needs_reminder
0    3412
1    1588
Name: count, dtype: int64


### Encoding Categorical Variables

Categorical features such as course,admission status, and Payent Status  are encoded using Label Encoding to convert them into numerical form.


In [5]:
# === Cell 4 ===
# We'll encode: admission_status, payment_status, course
categorical_cols = ['admission_status', 'payment_status', 'course']

label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = df[col].astype(str).fillna('NA')
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Save encoders
with open('../models/label_encoders.pkl', 'wb') as f:
    pickle.dump(label_encoders, f)

print("Encoders saved: ", list(label_encoders.keys()))


Encoders saved:  ['admission_status', 'payment_status', 'course']


### Train-Test Split

The dataset is split into training and testing sets using an 80:20 ratio to evaluate model performance on unseen data.
Each model is trained using the training dataset to learn patterns in fee payment behavior.


In [14]:
# === Cell 5 ===
X = df[features_col].copy()
y = df[['delayed_payment','needs_reminder']].copy()

# single split to keep alignment
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y['delayed_payment']  # stratify helps; use one target
)

# separate targets
y1_train = y_train['delayed_payment'].reset_index(drop=True)
y1_test  = y_test['delayed_payment'].reset_index(drop=True)
y2_train = y_train['needs_reminder'].reset_index(drop=True)
y2_test  = y_test['needs_reminder'].reset_index(drop=True)

# reset indices on X
X_train = X_train.reset_index(drop=True)
X_test  = X_test.reset_index(drop=True)

print("X_train:", X_train.shape, "X_test:", X_test.shape)


X_train: (4000, 5) X_test: (1000, 5)


### Feature Scaling

Numerical features are scaled using StandardScaler to bring them onto a common scale and improve model performance.


In [15]:
# === Cell 6 ===
scaler = StandardScaler()
scaler.fit(X_train)  # fit only on train

X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test_scaled  = pd.DataFrame(scaler.transform(X_test),  columns=X_test.columns)

# Save scaler
joblib.dump(scaler, "../models/scaler.pkl")
print("Scaler saved.")


Scaler saved.


In [16]:
# === Cell 7 ===
sm = SMOTE(random_state=42)

# For delayed_payment
X_train_sm1, y1_train_sm = sm.fit_resample(X_train_scaled, y1_train)
print("After SMOTE (delayed_payment):", pd.Series(y1_train_sm).value_counts().to_dict())

# For needs_reminder
X_train_sm2, y2_train_sm = sm.fit_resample(X_train_scaled, y2_train)
print("After SMOTE (needs_reminder):", pd.Series(y2_train_sm).value_counts().to_dict())


After SMOTE (delayed_payment): {1: 2096, 0: 2096}
After SMOTE (needs_reminder): {1: 2727, 0: 2727}


### Machine Learning Models Used

The following models are trained and evaluated:
- Random Forest Classifier  
- Decision Tree Classifier  


In [17]:
# === Cell 8 ===
rf1 = RandomForestClassifier(n_estimators=300, random_state=42)  # for delayed_payment
rf2 = RandomForestClassifier(n_estimators=300, random_state=42)  # for needs_reminder

# Train on SMOTE-balanced versions
rf1.fit(X_train_sm1, y1_train_sm)
rf2.fit(X_train_sm2, y2_train_sm)

print("Models trained.")


Models trained.


In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report



# Decision Tree model
dt1 = DecisionTreeClassifier(
    criterion='gini',
    max_depth=6,        # You can tune
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)

dt1.fit(X_train_sm1, y1_train_sm)

# Predictions
dt1_pred = dt1.predict(X_test)


dt2 = DecisionTreeClassifier(
    criterion='gini',
    max_depth=6,        # You can tune
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)

dt2.fit(X_train_sm2, y2_train_sm)

# Predictions
dt2_pred = dt2.predict(X_test)
print("===  DECISION TREE Models Trained Successfully!! ===")


===  DECISION TREE Models Trained Successfully!! ===


### Model Evaluation

Models are evaluated using:
- Accuracy  
- Precision  
- Recall  
- F1-score  
- Confusion Matrix  


In [22]:
def evaluate(model, X_test, y_test, name):
    pred = model.predict(X_test)

    print("\n==============================")
    print("MODEL :", name)
    print("==============================")
    print("Accuracy :", accuracy_score(y_test, pred))
    print("Precision:", precision_score(y_test, pred))
    print("Recall   :", recall_score(y_test, pred))
    print("F1 Score :", f1_score(y_test, pred))
    print("\nClassification Report:")
    print(classification_report(y_test, pred))

# Random Forest
evaluate(rf1, X_test_scaled, y1_test, "Random Forest - Delayed Payment")
evaluate(rf2, X_test_scaled, y2_test, "Random Forest - Needs Reminder")

# Decision Tree
evaluate(dt1, X_test, y1_test, "Decision Tree - Delayed Payment")
evaluate(dt2, X_test, y2_test, "Decision Tree - Needs Reminder")



MODEL : Random Forest - Delayed Payment
Accuracy : 0.999
Precision: 1.0
Recall   : 0.9978991596638656
F1 Score : 0.9989484752891693

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       524
           1       1.00      1.00      1.00       476

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000


MODEL : Random Forest - Needs Reminder
Accuracy : 0.998
Precision: 0.9968253968253968
Recall   : 0.9968253968253968
F1 Score : 0.9968253968253968

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       685
           1       1.00      1.00      1.00       315

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000


MODEL : Decision 

### Best Model Selection

Based on evaluation metrics, the best-performing model is selected for deployment.


### Saving Trained Models and Preprocessing Objects

The trained model, scaler, feature list, and encoders are saved as `.pkl` files for use during deployment.


In [12]:
# === Cell 10 ===
joblib.dump(rf1, "../models/best_model_delayed_payment.pkl")
joblib.dump(rf2, "../models/best_model_needs_reminder.pkl")
joblib.dump(list(X_train.columns), "../models/feature_columns.pkl")
print("Saved models, scaler, and feature_columns.")

Saved models, scaler, and feature_columns.


### Conclusion

This notebook successfully preprocesses the dataset and trains machine learning models capable of predicting fee payment default behavior.  
The selected model is saved and prepared for real-time deployment using Streamlit.
