# Desmascarando Robôs
### CRISP-DM Cycle 1
---

Imagine um mercado online, um palco digital onde diversos leilões se desenrolam a cada segundo. Neste ambiente, participantes do mundo inteiram lançam seus lances em busca de objetos desejados, desde joias até equipamentos tecnológicos. No entanto, nem todos os jogadores neste campo são humanos; alguns são robôs programados para manipular os resultados dos leilões.

Seu desafio é se aprofundar nesses dados, explorar as camadas de atividade nos leilões e conseguir construir um modelo que saiba muito bem diferenciar humanos de robôs.

> Disclaimer: This is a fictional bussiness cas

## 0. PREPARATION

### 0.1 Settings

In [1]:
# Settings imports
import os
import sys
import pandas as pd
from dotenv import load_dotenv

# Load .env file
env_path = "../.env"
load_dotenv(dotenv_path=env_path)

# Seed
seed = int(os.getenv("SEED"))

# Add path
path = os.getenv("HOMEPATH")

# Add path to sys.path
sys.path.append(path)

In [2]:
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
    ExtraTreesClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

In [3]:
from helper.classes.Pipeline import MLPipeline

### 0.2 Data

**Train e Test**

- **id_participante**: Identificador único do participante
- **conta_pagamento**: Conta de pagamento associada ao participante (com o valor ocultado) # Não será utilizada
- **endereco**: Endereço postal do participante # Não será utilizada
- **resultado**: A variável alvo que identifica se o participante é um robô ou um humano. (Robô = 1 e Humano = 0). (*target*)

- **Robôs Confirmados**: Participantes com provas claras de atividades fraudulentas, resultando em banimento da plataforma. São rotulados como robôs no conjunto de dados (resultado = 1).

- **Robôs Suspeitos**: Participantes com atividades atípicas ou estatísticas que superam a média, mas sem provas definitivas de fraude. A classificação deles como robôs é incerta.

**Lances**

- **id_lance**: Identificador único do lance
- **id_participante**: Identificador único do participante
- **leilao**: Identificador único do leilão 
- **mercadoria**: A categoria da mercadoria leiloada
- **dispositivo**: O dispositivo utilizado pelo visitante
- **tempo**: O tempo que o lance foi feito
- **pais**: O país que o IP pertence
- **ip**: O IP do participante
- **url**: A URL de onde o participante foi referido

## 1. Modelling

### 1.1 Loading Data

In [4]:
X_train = pd.read_feather(path + "/data/processed/X_train.feather")
X_test = pd.read_feather(path + "/data/processed/X_test.feather")
X_val = pd.read_feather(path + "/data/processed/X_val.feather")


y_train = pd.read_pickle(path + "/data/processed/y_train.pkl")
y_test = pd.read_pickle(path + "/data/processed/y_test.pkl")
y_val = pd.read_pickle(path + "/data/processed/y_val.pkl")

for data in [X_train, X_test, X_val]:
    data.drop(
        columns=[
            "pais",
            "url",
            "endereco",
            "dispositivo",
            "leilao",
            "periodo_dia",
            "mercadoria",
            "conta_pagamento",
        ],
        inplace=True,
    )

In [5]:
feature_transformations = {
    "log": [
        "contagem_participante",
        "contagem_leilao",
        "contagem_conta_pagamento",
        "frequencia_dispositivo",
    ],
    # "one_hot": [
    #    "dispositivo",
    #    "leilao",
    #    "periodo_dia",
    #    "mercadoria",
    #    "conta_pagamento",
    # ],
    "ordinal": ["ip_classe"],
    # "hashing": ["pais", "url", "endereco"],
    "min_max_scaler": [
        "hora_sin",
        "hora_cos",
        "minuto_sin",
        "minuto_cos",
        "segundo_sin",
        "segundo_cos",
    ],
    "robust_scaler": ["hora", "minuto", "segundo"],
}

In [6]:
proportion = float(len(y_train[y_train == 0])) / len(y_train[y_train == 1])
class_weights = {0: 0.87, 1: 0.13}

models = [
    KNeighborsClassifier(n_neighbors=5),
    LogisticRegression(class_weight=class_weights, random_state=seed),
    DecisionTreeClassifier(class_weight=class_weights, random_state=seed),
    RandomForestClassifier(class_weight=class_weights, random_state=seed),
    AdaBoostClassifier(random_state=seed),
    GradientBoostingClassifier(random_state=seed),
    ExtraTreesClassifier(class_weight=class_weights, random_state=seed),
    XGBClassifier(
        scale_pos_weight=proportion,
        objective="binary:logistic",
        eval_metric="logloss",
        random_state=seed,
    ),
    LGBMClassifier(
        is_unbalance=True,
        objective="binary",
        metric="binary_logloss",
        random_state=seed,
    ),
    CatBoostClassifier(scale_pos_weight=proportion, random_state=seed),
]

In [7]:
pipeline = MLPipeline(feature_transformations, models)
results = pipeline.run(X_train, y_train, X_test, y_test)
results

Starting run...
Starting fit_transform...
Transformers configured.
ColumnTransformer created.
fit_transform completed.
Data transformation completed.
Training model KNeighborsClassifier...
Evaluating model KNeighborsClassifier...
Evaluation completed for model KNeighborsClassifier.
Model KNeighborsClassifier trained and evaluated.
Training model LogisticRegression...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluating model LogisticRegression...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation completed for model LogisticRegression.
Model LogisticRegression trained and evaluated.
Training model DecisionTreeClassifier...
Evaluating model DecisionTreeClassifier...
Evaluation completed for model DecisionTreeClassifier.
Model DecisionTreeClassifier trained and evaluated.
Training model RandomForestClassifier...
Evaluating model RandomForestClassifier...
Evaluation completed for model RandomForestClassifier.
Model RandomForestClassifier trained and evaluated.
Training model AdaBoostClassifier...




Evaluating model AdaBoostClassifier...
Evaluation completed for model AdaBoostClassifier.
Model AdaBoostClassifier trained and evaluated.
Training model GradientBoostingClassifier...
Evaluating model GradientBoostingClassifier...
Evaluation completed for model GradientBoostingClassifier.
Model GradientBoostingClassifier trained and evaluated.
Training model ExtraTreesClassifier...
Evaluating model ExtraTreesClassifier...
Evaluation completed for model ExtraTreesClassifier.
Model ExtraTreesClassifier trained and evaluated.
Training model XGBClassifier...
Evaluating model XGBClassifier...
Evaluation completed for model XGBClassifier.
Model XGBClassifier trained and evaluated.
Training model LGBMClassifier...
[LightGBM] [Info] Number of positive: 233239, number of negative: 1539223
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.024532 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_

Unnamed: 0,model,brier_score_loss,log_loss,f1_score,recall,precision,roc_auc,balanced_accuracy
0,KNeighborsClassifier,0.113633,1.596259,0.360416,0.309469,0.431444,0.735115,0.623836
1,LogisticRegression,0.121279,0.459746,0.0,0.0,0.0,0.712779,0.5
2,DecisionTreeClassifier,0.440123,15.863626,0.078831,0.143112,0.054398,0.383071,0.383071
3,RandomForestClassifier,0.36754,5.793164,0.040954,0.065831,0.029722,0.361888,0.370089
4,AdaBoostClassifier,0.255794,0.704891,0.013976,0.022749,0.010086,0.297411,0.342209
5,GradientBoostingClassifier,0.347278,1.266747,0.072743,0.119118,0.052358,0.341915,0.396214
6,ExtraTreesClassifier,0.243776,0.841097,0.038778,0.055672,0.029749,0.384523,0.390269
7,XGBClassifier,0.189034,1.167196,0.196991,0.201981,0.192242,0.56483,0.53669
8,LGBMClassifier,0.433089,3.321559,0.10476,0.19916,0.071072,0.345847,0.402358
9,CatBoostClassifier,0.429828,3.056711,0.078722,0.141672,0.054504,0.408109,0.384633


In [8]:
results_cv, metrics_cv = pipeline.run_cv(X_train, y_train)
results_cv

Starting cross-validation run...
Processing fold 1/5...
Starting fit_transform...
Transformers configured.
ColumnTransformer created.
fit_transform completed.
Data transformation completed.
Training model KNeighborsClassifier on fold 1...
Evaluating model KNeighborsClassifier...
Evaluation completed for model KNeighborsClassifier.
Model KNeighborsClassifier trained and evaluated on fold 1.
Training model LogisticRegression on fold 1...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluating model LogisticRegression...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation completed for model LogisticRegression.
Model LogisticRegression trained and evaluated on fold 1.
Training model DecisionTreeClassifier on fold 1...
Evaluating model DecisionTreeClassifier...
Evaluation completed for model DecisionTreeClassifier.
Model DecisionTreeClassifier trained and evaluated on fold 1.
Training model RandomForestClassifier on fold 1...
Evaluating model RandomForestClassifier...
Evaluation completed for model RandomForestClassifier.
Model RandomForestClassifier trained and evaluated on fold 1.
Training model AdaBoostClassifier on fold 1...




Evaluating model AdaBoostClassifier...
Evaluation completed for model AdaBoostClassifier.
Model AdaBoostClassifier trained and evaluated on fold 1.
Training model GradientBoostingClassifier on fold 1...
Evaluating model GradientBoostingClassifier...
Evaluation completed for model GradientBoostingClassifier.
Model GradientBoostingClassifier trained and evaluated on fold 1.
Training model ExtraTreesClassifier on fold 1...
Evaluating model ExtraTreesClassifier...
Evaluation completed for model ExtraTreesClassifier.
Model ExtraTreesClassifier trained and evaluated on fold 1.
Training model XGBClassifier on fold 1...
Evaluating model XGBClassifier...
Evaluation completed for model XGBClassifier.
Model XGBClassifier trained and evaluated on fold 1.
Training model LGBMClassifier on fold 1...
[LightGBM] [Info] Number of positive: 186591, number of negative: 1231378
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.021215 seconds.
You can set `force_row_wis

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluating model LogisticRegression...
Evaluation completed for model LogisticRegression.
Model LogisticRegression trained and evaluated on fold 2.
Training model DecisionTreeClassifier on fold 2...
Evaluating model DecisionTreeClassifier...
Evaluation completed for model DecisionTreeClassifier.
Model DecisionTreeClassifier trained and evaluated on fold 2.
Training model RandomForestClassifier on fold 2...
Evaluating model RandomForestClassifier...
Evaluation completed for model RandomForestClassifier.
Model RandomForestClassifier trained and evaluated on fold 2.
Training model AdaBoostClassifier on fold 2...




Evaluating model AdaBoostClassifier...
Evaluation completed for model AdaBoostClassifier.
Model AdaBoostClassifier trained and evaluated on fold 2.
Training model GradientBoostingClassifier on fold 2...
Evaluating model GradientBoostingClassifier...
Evaluation completed for model GradientBoostingClassifier.
Model GradientBoostingClassifier trained and evaluated on fold 2.
Training model ExtraTreesClassifier on fold 2...
Evaluating model ExtraTreesClassifier...
Evaluation completed for model ExtraTreesClassifier.
Model ExtraTreesClassifier trained and evaluated on fold 2.
Training model XGBClassifier on fold 2...
Evaluating model XGBClassifier...
Evaluation completed for model XGBClassifier.
Model XGBClassifier trained and evaluated on fold 2.
Training model LGBMClassifier on fold 2...
[LightGBM] [Info] Number of positive: 186591, number of negative: 1231378
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.018100 seconds.
You can set `force_row_wis

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluating model LogisticRegression...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation completed for model LogisticRegression.
Model LogisticRegression trained and evaluated on fold 3.
Training model DecisionTreeClassifier on fold 3...
Evaluating model DecisionTreeClassifier...
Evaluation completed for model DecisionTreeClassifier.
Model DecisionTreeClassifier trained and evaluated on fold 3.
Training model RandomForestClassifier on fold 3...
Evaluating model RandomForestClassifier...
Evaluation completed for model RandomForestClassifier.
Model RandomForestClassifier trained and evaluated on fold 3.
Training model AdaBoostClassifier on fold 3...




Evaluating model AdaBoostClassifier...
Evaluation completed for model AdaBoostClassifier.
Model AdaBoostClassifier trained and evaluated on fold 3.
Training model GradientBoostingClassifier on fold 3...
Evaluating model GradientBoostingClassifier...
Evaluation completed for model GradientBoostingClassifier.
Model GradientBoostingClassifier trained and evaluated on fold 3.
Training model ExtraTreesClassifier on fold 3...
Evaluating model ExtraTreesClassifier...
Evaluation completed for model ExtraTreesClassifier.
Model ExtraTreesClassifier trained and evaluated on fold 3.
Training model XGBClassifier on fold 3...
Evaluating model XGBClassifier...
Evaluation completed for model XGBClassifier.
Model XGBClassifier trained and evaluated on fold 3.
Training model LGBMClassifier on fold 3...
[LightGBM] [Info] Number of positive: 186592, number of negative: 1231378
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.017978 seconds.
You can set `force_row_wis

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluating model LogisticRegression...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation completed for model LogisticRegression.
Model LogisticRegression trained and evaluated on fold 4.
Training model DecisionTreeClassifier on fold 4...
Evaluating model DecisionTreeClassifier...
Evaluation completed for model DecisionTreeClassifier.
Model DecisionTreeClassifier trained and evaluated on fold 4.
Training model RandomForestClassifier on fold 4...
Evaluating model RandomForestClassifier...
Evaluation completed for model RandomForestClassifier.
Model RandomForestClassifier trained and evaluated on fold 4.
Training model AdaBoostClassifier on fold 4...




Evaluating model AdaBoostClassifier...
Evaluation completed for model AdaBoostClassifier.
Model AdaBoostClassifier trained and evaluated on fold 4.
Training model GradientBoostingClassifier on fold 4...
Evaluating model GradientBoostingClassifier...
Evaluation completed for model GradientBoostingClassifier.
Model GradientBoostingClassifier trained and evaluated on fold 4.
Training model ExtraTreesClassifier on fold 4...
Evaluating model ExtraTreesClassifier...
Evaluation completed for model ExtraTreesClassifier.
Model ExtraTreesClassifier trained and evaluated on fold 4.
Training model XGBClassifier on fold 4...
Evaluating model XGBClassifier...
Evaluation completed for model XGBClassifier.
Model XGBClassifier trained and evaluated on fold 4.
Training model LGBMClassifier on fold 4...
[LightGBM] [Info] Number of positive: 186591, number of negative: 1231379
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032052 seconds.
You can set `force_row_wis

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluating model LogisticRegression...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation completed for model LogisticRegression.
Model LogisticRegression trained and evaluated on fold 5.
Training model DecisionTreeClassifier on fold 5...
Evaluating model DecisionTreeClassifier...
Evaluation completed for model DecisionTreeClassifier.
Model DecisionTreeClassifier trained and evaluated on fold 5.
Training model RandomForestClassifier on fold 5...
Evaluating model RandomForestClassifier...
Evaluation completed for model RandomForestClassifier.
Model RandomForestClassifier trained and evaluated on fold 5.
Training model AdaBoostClassifier on fold 5...




Evaluating model AdaBoostClassifier...
Evaluation completed for model AdaBoostClassifier.
Model AdaBoostClassifier trained and evaluated on fold 5.
Training model GradientBoostingClassifier on fold 5...
Evaluating model GradientBoostingClassifier...
Evaluation completed for model GradientBoostingClassifier.
Model GradientBoostingClassifier trained and evaluated on fold 5.
Training model ExtraTreesClassifier on fold 5...
Evaluating model ExtraTreesClassifier...
Evaluation completed for model ExtraTreesClassifier.
Model ExtraTreesClassifier trained and evaluated on fold 5.
Training model XGBClassifier on fold 5...
Evaluating model XGBClassifier...
Evaluation completed for model XGBClassifier.
Model XGBClassifier trained and evaluated on fold 5.
Training model LGBMClassifier on fold 5...
[LightGBM] [Info] Number of positive: 186591, number of negative: 1231379
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.019544 seconds.
You can set `force_row_wis

Unnamed: 0,model,brier_score_loss,log_loss,f1_score,recall,precision,roc_auc,balanced_accuracy,fold
0,KNeighborsClassifier,0.07084,0.938649,0.600583,0.514213,0.721826,0.866402,0.742092,1
1,LogisticRegression,0.123089,0.486483,0.0,0.0,0.0,0.710773,0.5,1
2,DecisionTreeClassifier,0.001142,0.041179,0.995665,0.996934,0.994398,0.998042,0.998042,1
3,RandomForestClassifier,0.004705,0.023925,0.983471,0.968745,0.998652,0.999889,0.984273,1
4,AdaBoostClassifier,0.19591,0.583898,0.810404,0.683609,0.994946,0.984377,0.841541,1
5,GradientBoostingClassifier,0.01917,0.078852,0.914443,0.845009,0.99631,0.997023,0.922268,1
6,ExtraTreesClassifier,0.015798,0.068478,0.932512,0.876329,0.996393,0.99875,0.937924,1
7,XGBClassifier,0.007049,0.023567,0.962987,0.995005,0.932965,0.999767,0.992086,1
8,LGBMClassifier,0.005511,0.01973,0.972248,0.998114,0.94769,0.999902,0.994883,1
9,CatBoostClassifier,0.001096,0.004802,0.9942,0.99955,0.988908,0.999994,0.998925,1


In [9]:
metrics_cv

Unnamed: 0,model,brier_score_loss_mean,brier_score_loss_std,log_loss_mean,log_loss_std,f1_score_mean,f1_score_std,recall_mean,recall_std,precision_mean,precision_std,roc_auc_mean,roc_auc_std,balanced_accuracy_mean,balanced_accuracy_std,fold_mean,fold_std
0,AdaBoostClassifier,0.194713,0.001654,0.581418,0.003426,0.814009,0.006819,0.691535,0.013665,0.989553,0.007815,0.984804,0.000741,0.845205,0.006405467,3.0,1.581139
1,CatBoostClassifier,0.001087,0.000183,0.004589,0.000697,0.994243,0.00084,0.999507,0.00013,0.989035,0.001638,0.999993,3e-06,0.998914,0.0001514347,3.0,1.581139
2,DecisionTreeClassifier,0.0012,9.2e-05,0.043253,0.003317,0.995445,0.000349,0.996373,0.000494,0.994518,0.000644,0.99777,0.000238,0.99777,0.0002380565,3.0,1.581139
3,ExtraTreesClassifier,0.015569,0.000208,0.067926,0.000637,0.933674,0.001311,0.878176,0.002302,0.996662,0.000282,0.998793,5.6e-05,0.938865,0.001151351,3.0,1.581139
4,GradientBoostingClassifier,0.018997,0.000697,0.078762,0.001868,0.913559,0.004147,0.84332,0.006895,0.996589,0.000379,0.997194,0.000314,0.921441,0.003458907,3.0,1.581139
5,KNeighborsClassifier,0.071237,0.000321,0.942114,0.004345,0.597075,0.002466,0.509559,0.003131,0.720897,0.002111,0.865514,0.000872,0.739832,0.00152847,3.0,1.581139
6,LGBMClassifier,0.005257,0.000257,0.019187,0.000719,0.973762,0.001627,0.99741,0.000657,0.951216,0.003504,0.999903,1e-05,0.994829,0.0002397377,3.0,1.581139
7,LogisticRegression,0.12312,0.000119,0.486628,0.000309,0.0,0.0,0.0,0.0,0.0,0.0,0.709692,0.002618,0.5,7.263616e-07,3.0,1.581139
8,RandomForestClassifier,0.004671,6.5e-05,0.023802,0.000201,0.983996,0.0006,0.969589,0.001079,0.998838,0.000136,0.999896,5e-06,0.984709,0.0005459295,3.0,1.581139
9,XGBClassifier,0.007738,0.000451,0.026277,0.001664,0.959879,0.002131,0.994628,0.000393,0.927482,0.003938,0.999711,3.7e-05,0.991421,0.0004091379,3.0,1.581139
