## Catboost model implemententation for Exam prediction
 **Was created by:** Matsvei Makhnou<br><br>
**Short description:** <br>In this notebook I will implement Catboost model for exam prediction. I will use Optuna for hyperparameter tuning and ML flow for tracking the experiments.<br>
**Lib you need to install to run this code:** <br>catboost, optuna, mlflow, pandas, numpy, scikit-learn see more in `pyproject.toml` file<br>
#### Section for novigating through the notebook:
[Data preprocessing](#data-preprocessing) - Data loading and preprocessing <br>
[Model training](#model_training) - Model training, hyperparameter tuning and training process tracking <br>
[Final model training](#final_model_training) - Final model training with best hyperparameters <br>
[Conclusion](#conclusion) - Conclusion and future work <br>

In [1]:
import pandas as pd

import sys
import os

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from sklearn.model_selection import train_test_split
from models.catboost import run_optimization

  from .autonotebook import tqdm as notebook_tqdm
2025/04/29 08:04:48 INFO mlflow.tracking.fluent: Experiment with name 'With_class_balanced_CV_Accuracy' does not exist. Creating a new experiment.


### <a id =  'data-preprocessing'> Data preprocessing </a> 
In our situation we have imbalanced dataset so during the model trainng I will use class weight parameter to balance the dataset. In this section I will load the data and split it into (train + validation) and test sets.

In [2]:
df = pd.read_csv("/workspaces/ifortex/student_exam_data.csv")

X = df.drop(columns=["Сдал"])
y = df["Сдал"]

y_processed = y.map({"Да": 1, "Нет": 0})

X_train, X_test, y_train, y_test = train_test_split(
    X, y_processed, test_size=0.15, random_state=42
)

print(f"For train and validation: {y_train.value_counts()}")
print(f"For test: {y_test.value_counts()}")

For train and validation: Сдал
1    167
0     88
Name: count, dtype: int64
For test: Сдал
1    26
0    19
Name: count, dtype: int64


In this section I will train the catboost model and tune the hyperparameters using optuna. Using ML flow I will track the experiments and save the best model.

In [3]:
categorical_features = [
    "Сон накануне",
    "Настроение",
    "Энергетиков накануне",
    "Посещаемость занятий",
    "Время подготовки",
]

### <a id =  'model_training'> Model training </a>
In this section I will choise the best params for my model

In [4]:
study = run_optimization(
    X=X_train, y=y_train, cat_features=categorical_features, n_splits=5
)

print(f"Best params for balanced class weight: {study.best_params}")

[I 2025-04-29 08:04:57,470] A new study created in memory with name: no-name-8e2b0053-2f57-46b4-a989-6f42e6855676


[I 2025-04-29 08:05:43,478] Trial 0 finished with value: 0.968627450980392 and parameters: {'iterations': 947, 'learning_rate': 0.2103999440190995, 'depth': 5, 'l2_leaf_reg': 8.779327088092264, 'subsample': 0.559876418077939}. Best is trial 0 with value: 0.968627450980392.
[I 2025-04-29 08:06:09,863] Trial 1 finished with value: 0.9607843137254903 and parameters: {'iterations': 1000, 'learning_rate': 0.20854525342038105, 'depth': 9, 'l2_leaf_reg': 5.772996344817301, 'subsample': 0.7935134347096668}. Best is trial 0 with value: 0.968627450980392.
[I 2025-04-29 08:06:34,978] Trial 2 finished with value: 0.9764705882352942 and parameters: {'iterations': 638, 'learning_rate': 0.25245911404130733, 'depth': 5, 'l2_leaf_reg': 2.9332092985979656, 'subsample': 0.513473668073871}. Best is trial 2 with value: 0.9764705882352942.
[I 2025-04-29 08:07:01,021] Trial 3 finished with value: 0.9647058823529411 and parameters: {'iterations': 354, 'learning_rate': 0.29831483729903613, 'depth': 9, 'l2_leaf

Best params for balanced class weight: {'iterations': 554, 'learning_rate': 0.25486612921338103, 'depth': 5, 'l2_leaf_reg': 1.2920326786341383, 'subsample': 0.6876616820894427}


### <a id =  'final_model_training'> Final model training </a>
Final training with best params. In this case I will combine training and validation sets to have more examples for train!

In [5]:
from models.final_model import run_final_training

2025/04/29 08:49:23 INFO mlflow.tracking.fluent: Experiment with name 'final_training' does not exist. Creating a new experiment.


In [6]:
model, f1, auc, precision, recall, accuracy = run_final_training(
    X_train, y_train, X_test, y_test, categorical_features
)

print(f"Final model: {model}")
print(f"F1 score: {f1}")
print(f"AUC: {auc}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"Accuracy: {accuracy}")



Final model: <catboost.core.CatBoostClassifier object at 0x7f76744f9550>
F1 score: 0.9259259259259259
AUC: 0.9838056680161944
Precision: 0.8928571428571429
Recall: 0.9615384615384616
Accuracy: 0.9111111111111111


In [7]:
model.save_model("/workspaces/ifortex/models/catboost_model.cbm")
print("Model saved secessfully!")

Model saved secessfully!


### <a id = 'conclusion'> Conclusion </a>

As we can see, the model achieved excellent performance with an accuracy exceeding 90%. This indicates that the CatBoost model, combined with proper handling of categorical features and class imbalance, is well-suited for this task.

#### Key Takeaways:
- **Model Performance**: The high accuracy, along with strong F1-score, precision, recall, and AUC metrics, demonstrates the robustness of the model.
- **Future Improvements**:
  1. Use **SHAP values** to interpret the model's predictions and provide insights into feature importance. This will help explain the results to stakeholders and improve trust in the model.
  2. Experiment with additional techniques for handling class imbalance, such as **SMOTE** or **undersampling**, to further validate the results.
  3. Consider testing other models or ensemble techniques to compare performance and ensure the best possible outcome.

By leveraging these insights, the model can be further refined and made more interpretable for real-world applications.