<a href="https://colab.research.google.com/github/Kaiziferr/ensemble_learning/blob/main/bagging/Random_Forest/01_random_forest_oob_score.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import warnings

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    ParameterGrid)

from sklearn.datasets import make_regression, make_friedman1, make_classification
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    recall_score
)

The focus of this project is to demonstrate the use oob score in random forest

°°°°

El objetivo de este proyecto es demostrar el uso de la puntuación oob en el bosque aleatorio.


#**Info**
---
@By: **Steven Bernal**

@Nickname: **Kaiziferr**

@Git: https://github.com/Kaiziferr

# **Config**
---

In [None]:
sns.set(style="darkgrid")
pd.set_option('display.float_format', '{:,.5f}'.format)
random_seed = 12354
warnings.filterwarnings('ignore')

A synthetic, non-linear dataset will be generated for the regression problem and data for the classification problem. Since the idea is to demonstrate a proof of concept, this approach is adopted to minimize the impact on processing time.

°°°°

Se generará un conjunto de datos sintéticos no lineales para el problema de regresión y datos para el problema de clasificación. Dado que la idea es demostrar una prueba de concepto, se adopta este enfoque para minimizar el impacto en el tiempo de procesamiento.

# **Regression**
---

## **Data**
---

The use the make_friedman1 function from scikit-learn, as it allows to generate a non-linear dataset

°°°°

Utilice la función make_friedman1 de scikit-learn, ya que permite generar un conjunto de datos no lineal



In [None]:
X, y = make_friedman1(
    n_samples=1000,
    n_features=8,
    noise=1.8,
    random_state=random_seed)

In [None]:
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.55196,0.10971,0.02975,0.42802,0.56019,0.79467,0.46556,0.34588
1,0.33711,0.20498,0.45069,0.14943,0.78478,0.29625,0.86917,0.4528
2,0.21115,0.90488,0.33384,0.78674,0.49532,0.44739,0.8207,0.3717
3,0.47144,0.02144,0.23761,0.70976,0.57599,0.41125,0.71222,0.16422
4,0.55229,0.84667,0.78529,0.98003,0.8633,0.05351,0.08885,0.50807


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    train_size=0.8,
    random_state=random_seed)

## **Model**

**Apply Cross Validation**

To validate the performance of the random forest, cross-validation can be used; however, depending on the case, it can be computationally expensive due to multiple iterations. I usually place more importance on the following parameters:

- n_estimators: number of estimators (number of trees)
- max_features: number of features to consider for each estimator (the features will be random)
- criterion: division  criterion for each estimator

While these are the ones I typically use, they depend on the context of the problem and what I want to find.

°°°°

Para validar el rendimiento del bosque aleatorio, se puede utilizar la validación cruzada; sin embargo, según el caso, puede resultar computacionalmente costosa debido a las múltiples iteraciones. Suelo dar mayor importancia a los siguientes parámetros:

- n_estimators: número de estimadores (número de árboles)
- max_features: número de características a considerar para cada estimador (las características serán aleatorias)
- criterion: criterio de división para cada estimador

Si bien estos son los que suelo utilizar, dependen del contexto del problema y de lo que quiero encontrar.




In [None]:
dict_params = ParameterGrid(
    {
        "n_estimators": [50, 100, 150, 200],
        'max_features': [0.75, None, 'sqrt', 'log2'],
        'criterion': ['squared_error', 'friedman_mse', 'absolute_error']
    }
)

In [None]:
dict_params.param_grid[0]

{'n_estimators': [50, 100, 150, 200],
 'max_features': [0.75, None, 'sqrt', 'log2'],
 'criterion': ['squared_error', 'friedman_mse', 'absolute_error']}

The test is performed with five validations using GridSearchCV.

°°°°

La prueba se realiza con cinco validaciones utilizando GridSearchCV.

In [None]:
grid = GridSearchCV(
    estimator=RandomForestRegressor(
        n_jobs = -1,
        random_state = random_seed,

    ),
    cv = 5,
    param_grid  = dict_params.param_grid[0],
    scoring = "neg_root_mean_squared_error",
    refit      = True,
    verbose    = 0,
    return_train_score = True,
  )
grid.fit(X_train, y_train)

In [None]:
results = pd.DataFrame(grid.cv_results_)
results = results.filter(regex = '(param.*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(4)

results

Unnamed: 0,param_criterion,param_max_features,param_n_estimators,mean_test_score,std_test_score,mean_train_score,std_train_score
30,friedman_mse,log2,150,-2.55307,0.15691,-0.95681,0.0096
31,friedman_mse,log2,200,-2.55344,0.15825,-0.95166,0.00759
14,squared_error,log2,150,-2.55589,0.15466,-0.95696,0.0096
15,squared_error,log2,200,-2.55632,0.15678,-0.95208,0.00761


The best hyperparameters are:
- param_criterion: friedman_mse
- param_max_features: log2
- param_n_estimators: 150

as they have the lowest average error 2.55307

°°°°


Los mejores hiperparámetros son:
- param_criterion: friedman_mse
- param_max_features: log2
- param_n_estimators: 150

In [None]:
-1*grid.best_score_

2.5530737977107893

**Apply Oob score**

- By applying the oob_score, the oob_score parameter must be set to True
- The default metric for regression is the coefficient of determination (R²).

°°°°

- Al aplicar oob_score, el parámetro oob_score debe establecerse en Verdadero
- La métrica predeterminada para la regresión es el coeficiente de determinación (R²).

In [None]:
results = {
    'params': [],
    'oob_r2': []
}

for params in dict_params:
  model_oobscore = RandomForestRegressor(
      oob_score = True,
      n_jobs = -1,
      random_state = random_seed,
      **params
  )
  model_oobscore.fit(X_train, y_train)
  results['params'].append(params)
  results['oob_r2'].append(model_oobscore.oob_score_)

In [None]:
results_score = pd.DataFrame(results)
results_score = pd.concat(
    [results_score, results_score['params'].apply(pd.Series)], axis=1
)

results_score = results_score.drop(columns = 'params')
results_score = results_score.sort_values('oob_r2', ascending=False)
results_score.head(4)

Unnamed: 0,oob_r2,criterion,max_features,n_estimators
15,0.77529,squared_error,log2,200
31,0.77473,friedman_mse,log2,200
14,0.77261,squared_error,log2,150
30,0.77221,friedman_mse,log2,150


The best hyperparameters are:
- criterion: squared_error
- max_features: log2
- n_estimators: 200

°°°°

Los mejores hiperparámetros son:
- criterio: error_cuadrado
- características_máximas: log²
- estimadores_n: 200

**Apply Oob score other function**


While the default metric is R², an error function can be used through a callback. For this callback, an error measure such as MAE is configured to be used as the oob_score.

°°°°

Aunque la métrica predeterminada es R², se puede usar una función de error mediante una devolución de llamada. Para esta devolución de llamada, se configura una medida de error, como MAE, para que se use como oob_score.

In [None]:
def metrica_oob_score(y, y_predict, **kwards):
  score = mean_absolute_error(y, y_predict,**kwards)
  return score

In [None]:
resultados = {
    'params': [],
    'mae': []
}

In [None]:
for params in dict_params:
  model_oobscore = RandomForestRegressor(
      oob_score       = metrica_oob_score,
      n_jobs          =-1,
      random_state    = random_seed,
      **params
  )

  model_oobscore.fit(X_train, y_train)
  resultados['params'].append(params)
  resultados['mae'].append(model_oobscore.oob_score_)

In [None]:
resultados_scores = pd.DataFrame(resultados)
resultados_scores = pd.concat(
    [resultados_scores, resultados_scores['params'].apply(pd.Series)], axis=1)

resultados_scores = resultados_scores.drop(columns = 'params')
resultados_scores = resultados_scores.sort_values('mae', ascending=True)
resultados_scores.head(4)

Unnamed: 0,mae,criterion,max_features,n_estimators
15,1.97455,squared_error,log2,200
31,1.97553,friedman_mse,log2,200
30,1.98761,friedman_mse,log2,150
14,1.98819,squared_error,log2,150


The best hyperparameters are:
- criterion: squared_error
- max_features: log2
- n_estimators: 200

°°°°

Los mejores hiperparámetros son:
- criterio: error_cuadrado
- características_máximas: log²
- estimadores_n: 200

# **Classification**

## **Data**
---

A class-imbalanced dataset with three categories is generated. This is done to justify not using the default metric that would be applied in the oob_score. However, it can be used for any classification problem.

°°°°

Se genera un conjunto de datos desequilibrado por clase con tres categorías. Esto se hace para justificar la no utilización de la métrica predeterminada que se aplicaría en oob_score. Sin embargo, puede utilizarse para cualquier problema de clasificación.

In [None]:
X, y = make_classification(
    n_samples = 1000,
    n_features = 10,
    n_informative = 7,
    n_redundant = 2,
    n_repeated = 1,
    n_classes = 3,
    weights = [0.5, 0.35, 0.15],
    class_sep = 0.8,
    random_state=random_seed
)

In [None]:
pd.Series(y).value_counts() / 1000

Unnamed: 0,count
0,0.494
1,0.352
2,0.154


In [None]:
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.55972,3.05397,3.05397,3.18851,-1.03085,-0.23431,4.48372,0.15694,-0.81657,4.7132
1,1.58635,0.95818,0.95818,3.65735,4.07767,-1.47404,-2.08,0.98687,-1.78331,-0.60233
2,2.15958,1.33346,1.33346,-0.94225,2.20403,3.13712,0.20985,2.53849,-1.28125,0.48116
3,-0.26474,2.05311,2.05311,1.74693,-1.15899,-0.88624,2.35354,-2.32645,-3.21542,5.13135
4,0.24866,-1.63643,-1.63643,-3.2794,-0.86734,1.16182,-2.69982,1.89413,0.01785,-0.58824


In [None]:
x_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    train_size = 0.8,
    stratify=y,
    random_state = random_seed
)

## **Model**

**Apply Cross Validation**

To validate the performance of the random forest, cross-validation can be used; however, depending on the case, it can be computationally expensive due to multiple iterations. I usually place more importance on the following parameters:

- n_estimators: number of estimators (number of trees)
- max_features: number of features to consider for each estimator (the features will be random)
- criterion: division  criterion for each estimator

While these are the ones I typically use, they depend on the context of the problem and what I want to find.

°°°°

To validate the performance of the random forest, cross-validation can be used; however, depending on the case, it can be computationally expensive due to multiple iterations. I usually place more importance on the following parameters:

- n_estimators: número de estimadores (número de árboles)
- max_features: número de características a considerar para cada estimador (las características serán aleatorias)
- criterion: criterio de división para cada estimador


Si bien estos son los que suelo utilizar, dependen del contexto del problema y de lo que quiero encontrar.

In [None]:
dict_params = ParameterGrid(
    {
        "n_estimators": [50, 100, 150, 200],
        'max_features': [0.75, None, 'sqrt', 'log2'],
        'criterion': ['gini', 'entropy', 'log_loss']
    }
)

In [None]:
grid = GridSearchCV(
    estimator=RandomForestClassifier(
        n_jobs = -1,
        random_state = random_seed,

    ),
    cv = 5,
    param_grid  = dict_params.param_grid[0],
    refit      = True,
    verbose    = 0,
    return_train_score = True,
  )
grid.fit(X_train, y_train)

In [None]:
results = pd.DataFrame(grid.cv_results_)
results = results.filter(regex = '(param.*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(4)
results

Unnamed: 0,param_criterion,param_max_features,param_n_estimators,mean_test_score,std_test_score,mean_train_score,std_train_score
8,gini,sqrt,50,0.475,0.0301,1.0,0.0
5,gini,,100,0.45875,0.02222,1.0,0.0
2,gini,0.75000,150,0.4575,0.02604,1.0,0.0
42,log_loss,sqrt,150,0.45625,0.01936,1.0,0.0


The best hyperparameters are:
- criterion: gini
- max_features: sqrt
- n_estimators: 50

°°°°

Los mejores hiperparámetros son:
- criterio: Gini
- max_features: sqrt
- n_estimators: 50

**Apply Oob score**

For classification problems, the default metric for oob_score is accuracy. This metric may not be useful for imbalanced problems or when we want to emphasize one or a few specific classes. However, we will proceed with the standard metric for this code section.

°°°°

Para problemas de clasificación, la métrica predeterminada para oob_score es la accuracy. Esta métrica puede no ser útil para problemas desequilibrados o cuando se desea enfatizar una o varias clases específicas. Sin embargo, continuaremos con la métrica estándar para esta sección de código.

In [None]:
results = {
    'params': [],
    'oob_score': []
}

for params in dict_params:
  model_oobscore = RandomForestClassifier(
      oob_score = True,
      n_jobs = -1,
      random_state = random_seed,
      **params
  )
  model_oobscore.fit(X_train, y_train)
  results['params'].append(params)
  results['oob_score'].append(model_oobscore.oob_score_)

In [None]:
results_score = pd.DataFrame(results)
results_score = pd.concat(
    [results_score, results_score['params'].apply(pd.Series)], axis=1
)

results_score = results_score.drop(columns = 'params')
results_score = results_score.sort_values('oob_score', ascending=False)
results_score.head(4)

Unnamed: 0,oob_score,criterion,max_features,n_estimators
24,0.46,entropy,sqrt,50
40,0.46,log_loss,sqrt,50
37,0.45875,log_loss,,100
36,0.45875,log_loss,,50


The best hyperparameters are:
- criterion: entropy
- max_features: sqrt
- n_estimators: 50

°°°°

Los mejores hiperparámetros son:
- criterio: entropía
- max_features: sqrt
- n_estimators: 50

**Apply Oob score other function**

Through a callback, the recall score metric is defined.

°°°°

A través de una devolución de llamada, se define la métrica de puntuación de recuperación.

In [None]:
def metrica_oob_score(y, y_predict, **kwards):
  score = recall_score(y, y_predict, average='micro')
  return score

In [None]:
resultados = {
    'params': [],
    'recall-score': []
}

for params in dict_params:
  model_oobscore = RandomForestClassifier(
      oob_score       = metrica_oob_score,
      n_jobs          =-1,
      random_state    = random_seed,
      **params
  )

  model_oobscore.fit(X_train, y_train)
  resultados['params'].append(params)
  resultados['recall-score'].append(model_oobscore.oob_score_)

In [None]:
results_score = pd.DataFrame(resultados)
results_score = pd.concat(
    [results_score, results_score['params'].apply(pd.Series)], axis=1
)

results_score = results_score.drop(columns = 'params')
results_score = results_score.sort_values('recall-score', ascending=False)
results_score.head(4)

Unnamed: 0,recall-score,criterion,max_features,n_estimators
24,0.46,entropy,sqrt,50
40,0.46,log_loss,sqrt,50
37,0.45875,log_loss,,100
36,0.45875,log_loss,,50


The best hyperparameters are:
- criterion: entropy
- max_features: sqrt
- n_estimators: 50

Los mejores hiperparámetros son:
- criterio: entropía
- max_features: sqrt
- n_estimators: 50

#**Info**
---
@By: **Steven Bernal**

@Nickname: **Kaiziferr**

@Git: https://github.com/Kaiziferr