#### Purpose : D. MODEL DEVELOPMENT: HANDLING OUTLIERS AND MISSING VALUES, DATA NORMALIZATION, FEATURE SELECTION, HYPERPARAMETER TUNING VIA GRID SEARCH, AND PREVENTION OF DATA LEAKAGE

In this phase, the focus is on preparing the dataset and building predictive models to accurately forecast student academic performance. Key preprocessing steps include handling outliers and missing values, normalizing data where necessary, and performing feature selection to retain the most informative features while reducing redundancy. To optimize model performance, hyperparameter tuning is conducted using GridSearchCV with 5-fold cross-validation. The models selected for training include Support Vector Machines (SVM), Decision Tree, Random Forest, XGBoost, and LightGBM, providing a mix of linear, non-linear, and ensemble approaches to capture complex patterns in the data. Care is taken throughout to prevent data leakage, ensuring that all transformations and tuning steps are applied only on training data within the cross-validation framework.

In [1]:
%run 00_project_setup.ipynb
%run 01_data_import.ipynb 
%run 04_feature_engineering.ipynb

  from .autonotebook import tqdm as notebook_tqdm


Original: (4424, 15)
After outlier removal: (4335, 15)


In [2]:
# --------------------------------------------------
#  Define models and hyperparameters
# --------------------------------------------------
models = {
    "SVM": {
        "model": SVC(probability=True, random_state=42),
        "params": {
            "clf__C": [0.01, 0.1, 1, 10, 100],
            "clf__kernel": ["linear", "rbf", "poly"],
            "clf__gamma": ["scale", "auto"],
            "clf__degree": [2, 3, 4]  # only used for poly kernel
        }
    },

    "RandomForest": {
        "model": RandomForestClassifier(random_state=42),
        "params": {
            "clf__n_estimators": [100, 200, 300, 500],
            "clf__max_depth": [None, 5, 10, 20, 30],
            "clf__min_samples_split": [2, 5, 10],
            "clf__min_samples_leaf": [1, 2, 4],
            "clf__bootstrap": [True, False]
        }
    },

    "XGBoost": {
        "model": XGBClassifier(
            use_label_encoder=False,
            eval_metric="logloss",
            random_state=42,
            tree_method="auto"
        ),
        "params": {
            "clf__n_estimators": [100, 200, 300],
            "clf__max_depth": [3, 5, 7],
            "clf__learning_rate": [0.001, 0.01, 0.1],
            "clf__subsample": [0.6, 0.8, 1.0],
            "clf__colsample_bytree": [0.6, 0.8, 1.0]
        }
    },

    "DecisionTree": {
        "model": DecisionTreeClassifier(random_state=42),
        "params": {
            "clf__criterion": ["gini", "entropy", "log_loss"],
            "clf__max_depth": [None, 5, 10, 20, 30],
            "clf__min_samples_split": [2, 5, 10],
            "clf__min_samples_leaf": [1, 2, 4]
        }
    },

    "LightGBM": {
        "model": LGBMClassifier(
            random_state=42,
            boosting_type="gbdt",
            n_jobs=-1,          # use all CPU cores
            verbose=-1
        ),
        "params": {
            "clf__n_estimators": [30, 50],     # extremely fast
            "clf__learning_rate": [0.05, 0.1], # avoids very slow training
            "clf__max_depth": [3, 5],          # shallow but effective
            "clf__num_leaves": [15, 31],       # small leaf count = speed
            "clf__subsample": [0.8],           # fixed to reduce search space
            "clf__colsample_bytree": [0.8]     # fixed to reduce search space
        }
    }
}

In [3]:
# --------------------------------------------------
# Preprocessing pipeline (without SMOTE)
# --------------------------------------------------
def create_pipeline(model):
    return Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("clf", model)
    ])

In [4]:
# --------------------------------------------------
#  Custom CV Strategy: SMOTE INSIDE CV LOOP
# --------------------------------------------------
class SMOTE_CV(GridSearchCV):
    def _fit_and_score(self, estimator, X, y, scorer, train, test, **fit_params):
        
        # Apply SMOTE ONLY to training fold
        sm = SMOTE(random_state=42)
        X_train_res, y_train_res = sm.fit_resample(X[train], y[train])
        
        # Fit model
        estimator.fit(X_train_res, y_train_res)
        
        # Predict on unmodified test fold
        y_pred = estimator.predict(X[test])
        
        # Score
        score = scorer(estimator, X[test], y[test])
        return score


In [5]:
# --------------------------------------------------
# GridSearch with SMOTE-enabled CV
# --------------------------------------------------
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
best_models = {}

for name, m in models.items():
    print(f"\nTraining {name}...")

    pipeline = create_pipeline(m["model"])

    grid = SMOTE_CV(
        estimator=pipeline,
        param_grid=m["params"],
        cv=skf,
        scoring="f1_weighted",
        n_jobs=-1,
        verbose=1
    )

    # Fit using SMOTE in CV
    grid.fit(X_train , y_train)

    best_models[name] = grid.best_estimator_

    print(f"Best parameters for {name}: {grid.best_params_}")
    print(f"Best CV Weighted F1 Score: {grid.best_score_:.4f}")



Training SVM...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Best parameters for SVM: {'clf__C': 10, 'clf__degree': 2, 'clf__gamma': 'scale', 'clf__kernel': 'linear'}
Best CV Weighted F1 Score: 0.7490

Training RandomForest...
Fitting 5 folds for each of 360 candidates, totalling 1800 fits
Best parameters for RandomForest: {'clf__bootstrap': False, 'clf__max_depth': 20, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 10, 'clf__n_estimators': 100}
Best CV Weighted F1 Score: 0.7554

Training XGBoost...
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Best parameters for XGBoost: {'clf__colsample_bytree': 1.0, 'clf__learning_rate': 0.1, 'clf__max_depth': 3, 'clf__n_estimators': 100, 'clf__subsample': 1.0}
Best CV Weighted F1 Score: 0.7575

Training DecisionTree...
Fitting 5 folds for each of 135 candidates, totalling 675 fits
Best parameters for DecisionTree: {'clf__criterion': 'gini', 'clf__max_depth': 5, 'clf__min_samples_leaf': 4, 'clf__min_sample

##### SAVING THE ENTIRE MODEL PIPELINE:

In [6]:
# Make sure the directory exists
os.makedirs("../outputs/models", exist_ok=True)

In [7]:
# Save best model (GridSearchCV)
best_model = grid.best_estimator_

In [8]:
best_model

0,1,2
,steps,"[('imputer', ...), ('scaler', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,boosting_type,'gbdt'
,num_leaves,15
,max_depth,3
,learning_rate,0.1
,n_estimators,50
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [9]:
dump(best_model, "../outputs/models/best_model.joblib")

['../outputs/models/best_model.joblib']

In [10]:
# Save the entire best pipeline (not just the raw model)
best_pipeline = grid.best_estimator_

In [11]:
best_pipeline

0,1,2
,steps,"[('imputer', ...), ('scaler', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,boosting_type,'gbdt'
,num_leaves,15
,max_depth,3
,learning_rate,0.1
,n_estimators,50
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [12]:
dump(best_pipeline, "../outputs/models/best_pipeline.joblib")
print("Pipeline saved to ../outputs/models/best_pipeline.joblib")

Pipeline saved to ../outputs/models/best_pipeline.joblib
