# Model Evaluation Binary Classification

In this notebook will be evaluated the performance of five different basemodels

- XGBoost
- CatBoost
- RandomForest
- LogisticRegression
- MLP

It will be training five different models with a different data split each\
The data split will consist on 2 cities for training, 2 cities for validation and 2 cities for evaluation \
The evaluation will use the mean ROCAU metrics of the five models for each basemodel\
The models will have learning rate fixed to 0.01 and 10 early-stopping rounds (when applied) 

**Results**

| Model           |   ROCAU       |
| -----------     | -----------   |
| XGBoost         |    0.772      |
| Catboost        |               |
| RandomForest    |               |
| LR              |    0.688      |
| MLP             |               |

## 1. Generate Random Data

In [1]:
import pandas as pd
import numpy as np
import random
from random_data_generator import random_data_generator

#grid searh
from sklearn.model_selection import RandomizedSearchCV

#models
from xgboost import XGBClassifier
from catboost import CatBoostClassifier, Pool
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#metrics
from sklearn.metrics import roc_auc_score

random.seed(42)

In [2]:
dataset = random_data_generator("binary", 5)

## 2. XGBoost

In [4]:
xgboost_params = {
    'learning_rate': np.arange(0.001, 0.1, 0.005),
    'max_depth': np.arange(2, 8),
    'n_estimators': np.arange(50, 150, 10),
    'subsample': np.arange(0.3, 0.9, 0.1),
    'colsample_bytree': np.arange(0.6, 1.0, 0.05),
    'gamma': np.arange(0.1, 5, 0.1),
    'early_stopping_rounds': np.arange(5, 15, 5),
    'eval_metric':['auc']
}

scores = []
for i in range(0, len(dataset)):
    x_train, y_train, x_val, y_val, x_test, y_test = dataset[i][0], dataset[i][1], dataset[i][2], dataset[i][3], dataset[i][4], dataset[i][5]
    random_search = RandomizedSearchCV(XGBClassifier(random_state=42), param_distributions=xgboost_params, cv=5, scoring='roc_auc').fit(x_train, y_train, eval_set=[(x_val, y_val)], verbose=False)
    best_params = random_search.best_params_
    xgb_clf = XGBClassifier(**best_params, random_state=42).fit(x_train, y_train, eval_set=[(x_val, y_val)], verbose=False)
    y_pred = xgb_clf.predict_proba(x_test)
    scores.append(roc_auc_score(y_test, y_pred[:,1]))
mean =  np.mean(np.array(scores))
print(scores)
print(mean)

COMEÇOU
FOI
COMEÇOU


Exception ignored on calling ctypes callback function: <bound method DataIter._next_wrapper of <xgboost.data.SingleBatchInternalIter object at 0x0000026A99A7FC50>>
Traceback (most recent call last):
  File "c:\Users\caior\OneDrive\Documentos\GitHub\xai-nui-classification\venv\Lib\site-packages\xgboost\core.py", line 588, in _next_wrapper
    def _next_wrapper(self, this: None) -> int:  # pylint: disable=unused-argument

KeyboardInterrupt: 


: 

## 3. Catboost

In [3]:
catboost_params = {
    'learning_rate': np.arange(0.001, 0.1, 0.005),
    'depth': np.arange(4, 10),
    'iterations': np.arange(50, 150, 10),
    'l2_leaf_reg': np.arange(0.1, 1, 0.2),
    'eval_metric':['AUC'],
    'early_stopping_rounds': np.arange(5, 15, 5),

}

scores = []
for i in range(0, len(dataset)):
    x_train, y_train, x_val, y_val, x_test, y_test = dataset[i][0], dataset[i][1], dataset[i][2], dataset[i][3], dataset[i][4], dataset[i][5]
    print("COMEÇOU")
    random_search = RandomizedSearchCV(CatBoostClassifier(random_seed=42), param_distributions=catboost_params, cv=5, scoring='roc_auc').fit(x_train, y_train, eval_set=[(x_val, y_val)], verbose=False)
    best_params = random_search.best_params_
    cat_clf = CatBoostClassifier(**best_params, random_seed=42).fit(x_train, y_train, eval_set=[(x_val, y_val)], verbose=False)
    y_pred = cat_clf.predict_proba(x_test)
    scores.append(roc_auc_score(y_test, y_pred[:,1]))
    print("FOI")
mean =  np.mean(np.array(scores))
print(scores)
print(mean)

COMEÇOU


## 4. Random Forest

In [19]:
scores = []
for i in range(0, len(dataset)):
    x_train, y_train, x_val, y_val, x_test, y_test = dataset[i][0], dataset[i][1], dataset[i][2], dataset[i][3], dataset[i][4], dataset[i][5]
    rf_clf = RandomForestClassifier(random_state=42, verbose=False)
    rf_clf.fit(x_train, y_train)
    y_pred = rf_clf.predict_proba(x_test)
    scores.append(roc_auc_score(y_test, y_pred[:,1]))
mean =  np.mean(np.array(scores))
print(scores)
print(mean)

[0.7479586532342586, 0.70459258589082, 0.7910507940324134, 0.7568966029590902, 0.7461569741946842]
0.7493311220622532


## 5. Logistic Regression

In [17]:
scores = []
for i in range(0, len(dataset)):
    x_train, y_train, x_val, y_val, x_test, y_test = dataset[i][0], dataset[i][1], dataset[i][2], dataset[i][3], dataset[i][4], dataset[i][5]
    lr_clf = LogisticRegression(random_state=42, verbose=False, solver="sag")
    lr_clf.fit(x_train, y_train)
    y_pred = lr_clf.predict_proba(x_test)
    scores.append(roc_auc_score(y_test, y_pred[:,1]))
mean =  np.mean(np.array(scores))
print(scores)
print(mean)



[0.63691483025621, 0.6219127525835161, 0.4746259835525888, 0.5993430897283265, 0.7113281637915438]
0.608824963982437


## 6. MLP