<span style="font-size: 30px;"><b>Model Training</b></span>

<b>Importing and loading data</b>

In [16]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

X_train = pd.read_csv("../data/processed/X_train.csv")
X_test = pd.read_csv("../data/processed/X_test.csv")
y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()
y_test = pd.read_csv("../data/processed/y_test.csv").squeeze()

<b>Define Models</b>

In [18]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=500, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
}

<b>Train & Evaluate Models</b>

In [20]:
results = []

for name, model in models.items():
    #train 
    model.fit(X_train, y_train)

    #predict
    y_pred = model.predict(X_test)

    #metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)

    results.append({
        "Model" : name,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1 Score": f1
    })

    print(f"\n{name} Results:")
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Logistic Regression Results:
              precision    recall  f1-score   support

           0       0.75      0.94      0.83     59284
           1       0.85      0.54      0.66     40155

    accuracy                           0.78     99439
   macro avg       0.80      0.74      0.75     99439
weighted avg       0.79      0.78      0.77     99439

Confusion Matrix:
 [[55531  3753]
 [18294 21861]]

Random Forest Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     59284
           1       1.00      1.00      1.00     40155

    accuracy                           1.00     99439
   macro avg       1.00      1.00      1.00     99439
weighted avg       1.00      1.00      1.00     99439

Confusion Matrix:
 [[59136   148]
 [    1 40154]]


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



XGBoost Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     59284
           1       1.00      1.00      1.00     40155

    accuracy                           1.00     99439
   macro avg       1.00      1.00      1.00     99439
weighted avg       1.00      1.00      1.00     99439

Confusion Matrix:
 [[59181   103]
 [    9 40146]]


<b>Comparing Models</b>

In [26]:
result_df = pd.DataFrame(results)
print("\nModel Comparison:\n", result_df)


Model Comparison:
                  Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.778286   0.853479  0.544415  0.664781
1        Random Forest  0.998502   0.996328  0.999975  0.998148
2              XGBoost  0.998874   0.997441  0.999776  0.998607


<b>Saving Model Results</b>

In [33]:
result_df.to_csv("../outputs/model_results.csv", index=False)

<b>This concludes the Modeling part of the project.</b>