<a href="https://colab.research.google.com/github/Kaiziferr/-Miner_Detector/blob/master/05_booster_use_gbtree_gblinear.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import (
    make_classification,
    make_friedman1,
    make_regression)

from sklearn.model_selection import (
    GridSearchCV,
    train_test_split)

from sklearn.metrics import (
    make_scorer,
    recall_score,
    f1_score
)

from xgboost import XGBClassifier, XGBRegressor

# **Info**

---


@By: Steven Bernal

@Nickname: Kaiziferr

@Git: https://github.com/Kaiziferr

# **Functions**
---

In [2]:
def split_test_train(

    X:pd.DataFrame,
    y:pd.Series,
    **kward)->tuple:
  """Randomly split the data into training and validation sets."""
  X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    **kward
  )
  return X_train, X_test, y_train, y_test

# **Config**
---




In [3]:
random_seed=73

# **Data**
---

The purpose of this notebook is to visualize the behavior of the hyperparameter booster: gbtree and gblinear, for fictitious datasets of classification and regression. For classification, the classes are unbalanced; and for regression, the data can be linear or nonlinear.

## **Config Metric**

In [4]:
scoring_clasification = make_scorer(
    f1_score,
    labels = [0,1,2,3,4],
    average='micro'

)

The parameter dictionary is defined with:

- n_estimators: number of trees
- learning_rate: learning rate
- booster: type of booste

In [5]:
dict_params = {
    "n_estimators": [100, 300, 600, 900],
    "learning_rate": [0.01, 0.05, 0.1, 0.2],
    "booster": ["gbtree", "gblinear"]
}

## **Clasification**
---

A synthetic dataset is generated, consisting of five classes, with 1500 records, and 5% misclassified. It has 8 informative features and two redundant ones. The classes are unbalanced

In [6]:
X, y = make_classification(
    n_samples = 1500,
    n_features = 10,
    n_informative = 8,
    n_redundant = 2,
    n_classes = 5,
    flip_y = 0.05,
    weights = [0.25, 0.15, 0.30, 0.20, 0.10],
    random_state = random_seed
)

Randomly split the data into training and validation sets.

In [7]:
X_train, X_test, y_train, y_test = split_test_train(X, y, **{
    "test_size": 0.25,
    "shuffle": True,
    "stratify": y,
    "random_state": random_seed
})

The model is instantiated.

In [8]:
model = XGBClassifier(
    random_state = random_seed
)

The 'GridSearchCV' is defined with 5-fold cross-validation and a modified F1 metric.

In [9]:
grid = GridSearchCV(
    model,
    dict_params,
    refit=True,
    scoring=scoring_clasification,
    cv=5,
    return_train_score=True,
    verbose = 0
)

The model is fitted

In [10]:
grid.fit(X_train, y_train)

In [11]:
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False).head()

Unnamed: 0,param_booster,param_learning_rate,param_n_estimators,mean_test_score,std_test_score,mean_train_score,std_train_score
12,gbtree,0.2,100,0.728,0.036585,1.0,0.0
10,gbtree,0.1,600,0.722667,0.045724,1.0,0.0
9,gbtree,0.1,300,0.721778,0.039313,1.0,0.0
14,gbtree,0.2,600,0.72,0.038848,1.0,0.0
13,gbtree,0.2,300,0.72,0.036325,1.0,0.0


In [12]:
print("-----------------------------------")
print("Best hyperparameters found")
print("-----------------------------------")
print(f"{grid.best_params_} : {grid.best_score_} ({grid.scoring})")

-----------------------------------
Best hyperparameters found
-----------------------------------
{'booster': 'gbtree', 'learning_rate': 0.2, 'n_estimators': 100} : 0.728 (make_scorer(f1_score, response_method='predict', labels=[0, 1, 2, 3, 4], average=micro))


For the classification problem, it is observed that the booster that works best is 'gbtree with an F1 score of

In [13]:
grid.best_score_

0.728

# **Regression**
---



**No Lineal**

Synthetic data is generated for a regression problem with a non-linear relationship. The number of records is 1500, with 10 features and a noise level of 3 (standard deviation).

In [14]:
X, y = make_friedman1(
    n_samples = 1500,
    n_features = 10,
    noise=3,
    random_state = random_seed
)

Randomly split the data into training and validation sets.

In [15]:
X_train, X_test, y_train, y_test = split_test_train(X, y, **{
    "test_size": 0.25,
    "random_state": random_seed
})

The model is instantiated.

In [16]:
model = XGBRegressor(
    random_state = random_seed
)

The 'GridSearchCV' is defined with 5-fold cross-validation and a neg_mean_absolute_error metric.

In [17]:
grid = GridSearchCV(
    model,
    dict_params,
    refit=True,
    scoring="neg_mean_absolute_error",
    cv=5,
    return_train_score=True,
    verbose = 0
)

The model is fitted

In [18]:
grid.fit(X_train, y_train)

In [19]:
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False).head()

Unnamed: 0,param_booster,param_learning_rate,param_n_estimators,mean_test_score,std_test_score,mean_train_score,std_train_score
5,gbtree,0.05,300,-2.77167,0.160466,-0.345594,0.043617
2,gbtree,0.01,600,-2.776635,0.166085,-0.84636,0.037368
4,gbtree,0.05,100,-2.776953,0.137588,-0.933703,0.036628
3,gbtree,0.01,900,-2.779367,0.181602,-0.634394,0.045391
8,gbtree,0.1,100,-2.779665,0.167668,-0.537783,0.04615


In [20]:
print("-----------------------------------")
print("Best hyperparameters found")
print("-----------------------------------")
print(f"{grid.best_params_} : {grid.best_score_} ({grid.scoring})")

-----------------------------------
Best hyperparameters found
-----------------------------------
{'booster': 'gbtree', 'learning_rate': 0.05, 'n_estimators': 300} : -2.7716704377998003 (neg_mean_absolute_error)


For the classification problem, it is observed that the booster that works best is 'gbtree with an neg_mean_absolute_error score of

In [21]:
-1*grid.best_score_

2.7716704377998003

**Lineal**

Generate synthetic data for a linear regression problem. This dataset will consist of 1,500 records, with 10 features where 8 are informative, and with 10% noise. There will be a single target.

In [23]:
X, y = make_regression(
    n_samples = 1500,
    n_features = 10,
    n_informative = 8,
    n_targets = 1,
    bias = 0.2,
    tail_strength = 0.3,
    noise = 0.1,
    random_state = random_seed
)

Randomly split the data into training and validation sets.

In [24]:
X_train, X_test, y_train, y_test = split_test_train(X, y, **{
    "test_size": 0.25,
    "random_state": random_seed
})

The 'GridSearchCV' is defined with 5-fold cross-validation and a neg_mean_absolute_error metric.

In [25]:
grid = GridSearchCV(
    model,
    dict_params,
    refit=True,
    scoring="neg_mean_absolute_error",
    cv=5,
    return_train_score=True,
    verbose = 0
)

The model is fitted

In [26]:
grid.fit(X_train, y_train)

For the classification problem, it is observed that the booster that works best is 'gblinear with an neg_mean_absolute_error score of

In [27]:
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False).head()

Unnamed: 0,param_booster,param_learning_rate,param_n_estimators,mean_test_score,std_test_score,mean_train_score,std_train_score
23,gblinear,0.05,900,-0.076779,0.003088,-0.075858,0.000765
22,gblinear,0.05,600,-0.076779,0.003088,-0.075858,0.000765
29,gblinear,0.2,300,-0.076779,0.003087,-0.075856,0.000764
30,gblinear,0.2,600,-0.076779,0.003087,-0.075856,0.000764
31,gblinear,0.2,900,-0.076779,0.003087,-0.075856,0.000764


In [28]:
-1*grid.best_score_

0.0767785912184936