#### Purpose : D. MODEL DEVELOPMENT: HANDLING OUTLIERS AND MISSING VALUES, DATA NORMALIZATION, FEATURE SELECTION, HYPERPARAMETER TUNING VIA GRID SEARCH, AND PREVENTION OF DATA LEAKAGE

In this phase, the focus is on preparing the dataset and building predictive models to accurately forecast student academic performance. Key preprocessing steps include handling outliers and missing values, normalizing data where necessary, and performing feature selection to retain the most informative features while reducing redundancy. To optimize model performance, hyperparameter tuning is conducted using GridSearchCV with 5-fold cross-validation. The models selected for training include Support Vector Machines (SVM), Logistic Regression, Random Forest, XGBoost, and Gradient Boosting, providing a mix of linear, non-linear, and ensemble approaches to capture complex patterns in the data. Care is taken throughout to prevent data leakage, ensuring that all transformations and tuning steps are applied only on training data within the cross-validation framework.

In [1]:
%run 00_project_setup.ipynb
%run 01_data_import.ipynb 
%run 04_feature_engineering.ipynb

In [2]:
X = reduced_df.drop('Target', axis=1)
y = reduced_df['Target']

In [3]:
e = LabelEncoder()
y_copy = y.values.ravel()
y_encoded = le.fit_transform(y_copy)

In [4]:
# Split dataset (optional: holdout for final testing)
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y
)

In [5]:
# -----------------------------
# Define models and hyperparameters
# -----------------------------
models = {
    "SVM": {
        "model": SVC(probability=True, random_state=42),
        "params": {
            "clf__C": [0.1, 1, 10],
            "clf__kernel": ["linear", "rbf"],
            "clf__gamma": ["scale", "auto"]
        }
    },
    "LogisticRegression": {
        "model": LogisticRegression(max_iter=1000, random_state=42),
        "params": {
            "clf__C": [0.1, 1, 10],
            "clf__penalty": ["l2"],
            "clf__solver": ["lbfgs"]
        }
    },
    "RandomForest": {
        "model": RandomForestClassifier(random_state=42),
        "params": {
            "clf__n_estimators": [100, 200],
            "clf__max_depth": [None, 5, 10],
            "clf__min_samples_split": [2, 5]
        }
    },
    "XGBoost": {
        "model": XGBClassifier(use_label_encoder=False, eval_metric="mlogloss", random_state=42),
        "params": {
            "clf__n_estimators": [100, 200],
            "clf__max_depth": [3, 5],
            "clf__learning_rate": [0.01, 0.1]
        }
    },
    "GradientBoosting": {
        "model": GradientBoostingClassifier(random_state=42),
        "params": {
            "clf__n_estimators": [100, 200],
            "clf__learning_rate": [0.05, 0.1],
            "clf__max_depth": [3, 5]
        }
    }
}

In [6]:
# -----------------------------
# Define preprocessing pipeline
# -----------------------------
def create_pipeline(model):
    pipeline = Pipeline([
        # 1. Handle missing values
        ("imputer", SimpleImputer(strategy="median")),
        # 2. Feature scaling
        ("scaler", StandardScaler()),
        # 3. Classifier
        ("clf", model)
    ])
    return pipeline

In [7]:
# -----------------------------
# Model Building: GridSearchCV with 5-fold CV
# -----------------------------
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
best_models = {}

for name, m in models.items():
    print(f"\nTraining {name}...")
    pipeline = create_pipeline(m["model"])
    
    grid = GridSearchCV(
        estimator=pipeline,
        param_grid=m["params"],
        cv=skf,
        scoring="f1_weighted",
        n_jobs=-1,
        verbose=1
    )
    
    grid.fit(X_train, y_train)
    best_models[name] = grid.best_estimator_
    print(f"Best parameters for {name}: {grid.best_params_}")
    print(f"Best cross-validation F1 score for {name}: {grid.best_score_:.4f}")


Training SVM...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters for SVM: {'clf__C': 10, 'clf__gamma': 'scale', 'clf__kernel': 'linear'}
Best cross-validation F1 score for SVM: 0.7527

Training LogisticRegression...
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best parameters for LogisticRegression: {'clf__C': 1, 'clf__penalty': 'l2', 'clf__solver': 'lbfgs'}
Best cross-validation F1 score for LogisticRegression: 0.7454

Training RandomForest...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters for RandomForest: {'clf__max_depth': 10, 'clf__min_samples_split': 2, 'clf__n_estimators': 100}
Best cross-validation F1 score for RandomForest: 0.7553

Training XGBoost...
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters for XGBoost: {'clf__learning_rate': 0.1, 'clf__max_depth': 5, 'clf__n_estimators': 200}
Best cross-validation F1 score for XGBoost: 0.7592

Training GradientBoosting...
Fitting 5 fo