<a href="https://colab.research.google.com/github/Kaiziferr/XGBoost/blob/main/02_random_fores_xgboost_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [107]:
from unicodedata import normalize
import warnings

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import (
    train_test_split,
    ParameterGrid,
    GridSearchCV
)

from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    root_mean_squared_error)

# **Info**
---
@By: **Steven Bernal**

@Nickname: **Kaiziferr**

@Git: https://github.com/Kaiziferr

# **Data**
---
- Set of gas volumes supplied by Service Stations in Colombia

- Conjunto de los volúmenes de gas suministrado por las Estaciones de Servicio en Colombia.


**Información de la Entidad/Entity Information**

- Área o dependencia: Dirección de Hidrocarburos
- Nombre de la Entidad: Ministerio de Minas y Energía
- Departamento: Bogotá D.C.
- Municipio: Bogotá D.C.
- Orden: Nacional
- Sector: Minas y Energía

**Información de Datos/Data Information**

- Cobertura Geográfica: Nacional
- Frecuencia de Actualización: Diaria
- Fecha Emisión (aaaa-mm-dd): 2023-08-17

Suministró los datos: Ministerio de Minas y Energía

path data: https://www.datos.gov.co/Minas-y-Energ-a/Consulta-Ventas-de-Gas-Natural-Comprimido-Vehicula/v8jr-kywh/about_data

# **Data Dictionary**
---

- FECHA_VENTA: fecha de la transacción (date of the transaction)
- ANIO_VENTA: fecha de la transacción (year of the transaction)
- MES_VENTA: fecha de la transacción (month of the transaction)
- DIA_VENTA: fecha de la transacción (day of the transaction)
- CODIGO_MUNICIPIO_DANE: Código del municipio (municipality code)
- DEPARTAMENTO: departamento (department)
- MUNICIPIO: municipio (municipality)
- LATITUD: coordenadas de georeferenciación (georeferencing coordinates latitude)
- LONGITUD: coordenadas de georeferenciación (georeferencing coordinates longitude)
- TIPO_AGENTE: tipo del agente proveedor (type of supplier agent)
- TIPO_DE_COMBUSTIBLE: combustible suministrado (fuel type supplied)
- EDS_ACTIVAS: estaciones de servicio activas (active service stations)
- NUMERO_DE_VENTAS (number of fuel transactions)
- VEHICULOS_ATENDIDOS: vehículos atendidos (vehicles attended)
- CANTIDAD_VOLUMEN_SUMINISTRADO: volumen suministrado en las tanqueadas (volume supplied in the fuel transactions)

**The purpose of this exercise is two objects**:

1. To understand how to configure a random forest using XGBoost.
2. To compare the version proposed by sklearn and the one from XGBoost.


**Este ejercicio tiene como propósito dos objetivos**:
1. Entender como configurar un bosque aleatorio con xgboosting
2. Comparar la versión propuesta por sklearn y la de xgboosting




# **Config**
---



In [108]:
sns.set(style="darkgrid")
pd.set_option('display.float_format', '{:,.2f}'.format)
title_data = 'Fuel Stations'
paleta = sns.color_palette('Set2').as_hex()
random_seed=73
np.set_printoptions(precision=3, suppress=True)
warnings.filterwarnings('ignore')

# **Funciones**
---

In [109]:
def normalize_word(word):
  """Normaliza palabras"""
  word = word.replace(' ', '_')
  find_guion = word.find('_')
  list_word = []
  if find_guion:
    list_word = [w for w in word.split('_') if w != '']
  else:
    list_word = word
  word = list(map(lambda x: x.lower(), list_word))
  word = [normalize('NFKD', c).encode('ASCII', 'ignore').decode() for c in word]
  word = "_".join(word)
  return word


def normalize_name_columns(columns):
  """Normaliza columnas"""
  columns = list(map(lambda x: normalize_word(x), columns))
  return columns

# **Data**
---

In [110]:
url_gas_data = 'https://drive.google.com/file/d/1d2zxaI8riPA7SJm3cCw_jrrUYCIc_63F/view?usp=sharing'
url_gas_data = 'https://drive.google.com/uc?id=' + url_gas_data.split('/')[-2]
data = pd.read_csv(url_gas_data, dtype='str')

In [111]:
data.columns = normalize_name_columns(
    data.columns)

- There are no null values

In [112]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125771 entries, 0 to 125770
Data columns (total 15 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   fecha_venta                    125771 non-null  object
 1   anio_venta                     125771 non-null  object
 2   mes_venta                      125771 non-null  object
 3   dia_venta                      125771 non-null  object
 4   codigo_municipio_dane          125771 non-null  object
 5   departamento                   125771 non-null  object
 6   municipio                      125771 non-null  object
 7   latitud                        125771 non-null  object
 8   longitud                       125771 non-null  object
 9   tipo_agente                    125771 non-null  object
 10  tipo_de_combustible            125771 non-null  object
 11  eds_activas                    125771 non-null  object
 12  numero_de_ventas               125771 non-nu

In [113]:
data['anio_venta'].value_counts()

Unnamed: 0_level_0,count
anio_venta,Unnamed: 1_level_1
2023,34237
2022,29587
2021,26523
2020,22151
2024,13273


- Se asigna los tipos de datos reales a los datos

In [114]:
data[[
    'eds_activas',
    'numero_de_ventas',
    'vehiculos_atendidos',
    'cantidad_volumen_suministrado'
]] = data[[
    'eds_activas',
    'numero_de_ventas',
    'vehiculos_atendidos',
    'cantidad_volumen_suministrado'
]].astype('float64')

- The data corresponding to the year 2022 is selected for training and the data from 2023 for testing.

- Se seleccionan los datos correspondientes al año 2022 para el entrenamiento y los datos del año 2023 para las pruebas.

In [115]:
data_gas_train = data[data['anio_venta'] == "2022"]
data_gas_test = data[data['anio_venta'] == "2023"]

- Two columns, the department and the municipality, are concatenated to create a new column that allows identifying transactions to a single entity.

- Se concatenan dos columnas, el departamento y el municipio, para crear una nueva columna que permite identificar transacciones a una sola entidad.

In [116]:
data_gas_train['key'] = data_gas_train.departamento.str.cat(
    data_gas_train.municipio, sep='-'
)

data_gas_test['key'] = data_gas_test.departamento.str.cat(
    data_gas_test.municipio, sep='-'
)

In [117]:
data_gas_train.head()

Unnamed: 0,fecha_venta,anio_venta,mes_venta,dia_venta,codigo_municipio_dane,departamento,municipio,latitud,longitud,tipo_agente,tipo_de_combustible,eds_activas,numero_de_ventas,vehiculos_atendidos,cantidad_volumen_suministrado,key
0,2022-06-17,2022,6,17,68682,SANTANDER,FLORIDABLANCA,7.0797047615,-73.0679931641,ESTACION DE SERVICIO DE GNCV,GNV,2.0,671.0,576.0,4909.3,SANTANDER-FLORIDABLANCA
2,2022-04-16,2022,4,16,85850,CASANARE,YOPAL,5.2427449226,-72.258026123,ESTACION DE SERVICIO DE GNCV,GNV,7.0,1162.0,560.0,8883.95,CASANARE-YOPAL
3,2022-06-06,2022,6,6,68680,SANTANDER,BUCARAMANGA,7.1558337212,-73.1115722656,ESTACION DE SERVICIO DE GNCV,GNV,9.0,1957.0,1331.0,13073.23,SANTANDER-BUCARAMANGA
5,2022-12-06,2022,12,6,23231,CORDOBA,CERETE,8.8956670761,-75.8784255981,ESTACION DE SERVICIO DE GNCV,GNV,2.0,100.0,87.0,993.2,CORDOBA-CERETE
6,2022-10-27,2022,10,27,68685,SANTANDER,PIEDECUESTA,6.9708209038,-73.0148086548,ESTACION DE SERVICIO DE GNCV,GNV,1.0,227.0,189.0,1938.12,SANTANDER-PIEDECUESTA


- A function is defined to group the data, establishing a dictionary structure
- Se define una función que permita agrupar los datos, definiendo una estructura de diccionarios

In [118]:
def agrupamiento(
     function_dictionary:dict,
     filter_feature:list,
     new_val_col:list,
     data):
  try:
    data_group = data.groupby(
    filter_feature).aggregate(function_dictionary)

    data_group.columns =  new_val_col
    data_group = data_group.reset_index()
    return data_group
  except Exception as e:
    print(e)


In [119]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['sum'],
    "vehiculos_atendidos": ['sum'],
    'numero_de_ventas': ['sum'],
    'eds_activas': ['sum']
}

filter_feature = [
    'key',
    'mes_venta']


new_val_col  = [
    "cantidad_volumen_suministrado",
    "vehiculos_atendidos",
    'numero_de_ventas',
    'eds_activas'
]

data_group = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    data_gas_train
)

In [120]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['mean'],
    "vehiculos_atendidos": ['mean'],
    'numero_de_ventas': ['mean'],
    'eds_activas': ['mean'],
    'mes_venta': ['count']
}

filter_feature = [
    'key'
]


new_val_col  = [
    "cantidad_volumen_suministrado_mean",
    "vehiculos_atendidos_mean",
    'numero_de_ventas_mean',
    'eds_activas_mean',
    'meses_activos'
]

data_group2 = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    data_group
)

# **Data Split**
---

In [121]:
features = data_group2.select_dtypes(include=['float64', 'int64']).columns.to_list()
X = data_group2[features[1: ]]
y = data_group2[features[0]]

In [122]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=random_seed,
)

# **Model**
---

**Random forest oob_score**

In [123]:
dict_random = ParameterGrid(
    {
        'n_estimators': [100, 150],
        'max_features': [0.45, 0.85],
        'max_depth': [None, 5, 10],
        'criterion': ['squared_error', 'absolute_error']
    }
)

dict_random.param_grid

[{'n_estimators': [100, 150],
  'max_features': [0.45, 0.85],
  'max_depth': [None, 5, 10],
  'criterion': ['squared_error', 'absolute_error']}]

In [124]:
resultados = {
    'params': [],
    'oob_r2': []
}

In [125]:
for params in dict_random:
  model_oobscore = RandomForestRegressor(
      oob_score = True,
      n_jobs = -1,
      random_state = random_seed,
      **params
  )
  model_oobscore.fit(X_train, y_train)
  resultados['params'].append(params)
  resultados['oob_r2'].append(model_oobscore.oob_score_)


In [126]:
resultados = pd.DataFrame(resultados)
resultados = pd.concat(
    [resultados, resultados['params'].apply(pd.Series)], axis=1
)
resultados = resultados.drop(columns='params')
resultados = resultados.sort_values('oob_r2', ascending=False)
resultados.head(4)

Unnamed: 0,oob_r2,criterion,max_depth,max_features,n_estimators
11,0.14,squared_error,10.0,0.85,150
5,0.14,squared_error,5.0,0.45,150
7,0.14,squared_error,5.0,0.85,150
1,0.14,squared_error,,0.45,150


In [127]:
resultados.iloc[0, 1:].to_dict()

{'criterion': 'squared_error',
 'max_depth': 10.0,
 'max_features': 0.85,
 'n_estimators': 150}

In [128]:
model_oobscore = RandomForestRegressor(
      oob_score    = True,
      n_jobs       = -1,
      random_state = random_seed,
      max_depth = 10,
      criterion = "squared_error",
      max_features = 0.85,
      n_estimators = 150
  )

model_oobscore.fit(X_train, y_train)

**Random forest Grid**

In [129]:
model_forest_grid = RandomForestRegressor(
      oob_score    = False,
      n_jobs       = -1,
      random_state = random_seed
)


In [130]:
grid_search = GridSearchCV(
    model_forest_grid,
    dict_random.param_grid,
    cv = 3,
    scoring = 'neg_mean_absolute_error',
    verbose = 2,
    n_jobs = -1
)

model_forest_grid_best = grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_score = grid_search.best_score_

Fitting 3 folds for each of 24 candidates, totalling 72 fits


In [131]:
best_params

{'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 0.45,
 'n_estimators': 150}

In [132]:
-1*best_score

np.float64(239976.84334813105)

In [133]:
score_model = pd.DataFrame(grid_search.cv_results_)
score_model.sort_values(by='mean_test_score', ascending=False).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_max_features,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
1,0.36,0.01,0.05,0.0,squared_error,,0.45,150,"{'criterion': 'squared_error', 'max_depth': No...",-53546.87,-37236.06,-629147.6,-239976.84,275265.83,1
9,0.68,0.06,0.06,0.0,squared_error,10.0,0.45,150,"{'criterion': 'squared_error', 'max_depth': 10...",-54230.43,-37145.95,-629200.27,-240192.21,275158.64,2
17,0.36,0.02,0.05,0.01,absolute_error,5.0,0.45,150,"{'criterion': 'absolute_error', 'max_depth': 5...",-57659.93,-33373.85,-630233.18,-240422.32,275816.16,3
0,0.24,0.01,0.04,0.0,squared_error,,0.45,100,"{'criterion': 'squared_error', 'max_depth': No...",-54700.06,-37536.91,-629791.5,-240676.15,275235.3,4
16,0.24,0.01,0.04,0.0,absolute_error,5.0,0.45,100,"{'criterion': 'absolute_error', 'max_depth': 5...",-57979.51,-33006.31,-631178.12,-240721.31,276282.83,5


**Random model_XGBRF_grid Grid**

In [134]:
model_XGBRF_grid = XGBRFRegressor(
    booster='gbtree',
    random_state=random_seed
)

In [135]:
params = [
    {
        'n_estimators':  [50, 100, 150],
        'subsample': [0.45, 0.75, 0.85],
        'colsample_bynode': [0.45, 0.75, 0.85],
        'max_depth': [None, 5, 10, 15]

    }
]

In [136]:
grid_search = GridSearchCV(
    model_XGBRF_grid,
    params,
    cv = 3,
    scoring = 'neg_mean_absolute_error',
    verbose = 2,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 108 candidates, totalling 324 fits


In [137]:
model_XGBRF_grid_best = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

model_XGBRF_grid_best, best_params, -1*best_score

(XGBRFRegressor(base_score=None, booster='gbtree', callbacks=None,
                colsample_bylevel=None, colsample_bynode=0.75,
                colsample_bytree=None, device=None, early_stopping_rounds=None,
                enable_categorical=False, eval_metric=None, feature_types=None,
                gamma=None, grow_policy=None, importance_type=None,
                interaction_constraints=None, max_bin=None,
                max_cat_threshold=None, max_cat_to_onehot=None,
                max_delta_step=None, max_depth=None, max_leaves=None,
                min_child_weight=None, missing=nan, monotone_constraints=None,
                multi_strategy=None, n_estimators=50, n_jobs=None,
                num_parallel_tree=None, objective='reg:squarederror',
                random_state=73, ...),
 {'colsample_bynode': 0.75,
  'max_depth': None,
  'n_estimators': 50,
  'subsample': 0.85},
 np.float64(290777.24802798947))

In [138]:
scores_model = pd.DataFrame(grid_search.cv_results_)
scores_model.sort_values(by='mean_test_score', ascending=False).head(1)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_colsample_bynode,param_max_depth,param_n_estimators,param_subsample,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
38,0.05,0.02,0.01,0.0,0.75,,50,0.85,"{'colsample_bynode': 0.75, 'max_depth': None, ...",-57363.36,-181550.52,-633417.87,-290777.25,247531.22,1


In [139]:
y_predict_oobscore = model_oobscore.predict(X_test)
y_predict_random_grid = model_forest_grid_best.predict(X_test)
y_predict_xgboost_grid = model_XGBRF_grid_best.predict(X_test)

In [140]:
scors = {
    'MAE': [
        mean_absolute_error(y_test, y_predict_oobscore),
        mean_absolute_error(y_test, y_predict_random_grid),
        mean_absolute_error(y_test, y_predict_xgboost_grid)],
    'MSE': [
        mean_squared_error(y_test, y_predict_oobscore),
        mean_squared_error(y_test, y_predict_random_grid),
        mean_squared_error(y_test, y_predict_xgboost_grid),
    ],
    'RMSE': [
        root_mean_squared_error(y_test, y_predict_oobscore),
        root_mean_squared_error(y_test, y_predict_random_grid),
        root_mean_squared_error(y_test, y_predict_xgboost_grid)
    ]


}
pd.DataFrame(scors, ['RandomForest_oobscore', 'RandomForest_grid', 'XGBRFRegressor'])

Unnamed: 0,MAE,MSE,RMSE
RandomForest_oobscore,63945.44,12058436588.73,109810.91
RandomForest_grid,62229.85,10105878660.95,100528.0
XGBRFRegressor,84632.25,33442513711.88,182872.94


- The model that performs best is RandomForest_grid, as it has the lowest error in the predefined metrics; however, the difference compared to the other models is not significant. If we use RMSE to evaluate the problem-specific measurements, the model had an average error of 100,528.00 in the volume supplied compared to the theoretical average per refueling

- El modelo se que mejor comportamiento tiene es el RandomForest_grid, pues en las métricas preestablecidas tiene el erro más bajo, pero no están diferencial con respecto al resto. Si utilizamos RMSE para trabajar con las medidas del problema, el modelo se equivocó en promedio 100,528.00 en la cantidad de volumen suministrada en el promedio teórico de tanqueaada.

- El promedio teórico es el supuesto de lo que debe tanquear una estación de gas en el mes. Se calcula con base en la suma del periodo de actividad dividido por la cantidad de meses activos.

- The theoretical average refers to the assumed amount a gas station should supply in a month. It is calculated based on the total volume during the active period divided by the number of active months.

# **Test**
---

In [141]:
def variacion_intervalos(y_test, y_predict):
  macf = {
      'q10': 0,
      'q20': 0,
      'q30': 0,
      'q40': 0,
      'q50': 0,
      'q60': 0,
      'q70': 0,
      'q80': 0,
      'q90': 0,
      'q100': 0,
      'erraticos':0
  }


  for i, j in zip(y_test, y_predict):
    variacion = abs((j-i)/i)
    if variacion <= 0.1:
      macf['q10'] = macf['q10'] + 1
    elif variacion <= 0.2:
      macf['q20'] = macf['q20'] + 1
    elif variacion <= 0.3:
      macf['q30'] = macf['q30'] + 1
    elif variacion <= 0.4:
      macf['q40'] = macf['q40'] + 1
    elif variacion<= 0.5:
      macf['q50'] = macf['q50'] + 1
    elif variacion <= 0.6:
      macf['q60'] = macf['q60'] + 1
    elif variacion <= 0.7:
      macf['q70'] = macf['q70'] + 1
    elif variacion <= 0.8:
      macf['q80'] = macf['q80'] + 1
    elif variacion <= 0.9:
      macf['q90'] = macf['q90'] + 1
    elif variacion <= 1:
      macf['q100'] = macf['q100'] + 1
    else:
      macf['erraticos'] = macf['erraticos'] + 1

  total = sum(macf.values())

  macf_formateado = {
        k: (v, f"{(v / total) * 100:.1f}%") for k, v in macf.items()
    }

  return macf_formateado

In [142]:
macf_oobscore = variacion_intervalos(y_test, y_predict_oobscore)
macf_random_grid = variacion_intervalos(y_test, y_predict_random_grid)
macf_xgboost_grid = variacion_intervalos(y_test, y_predict_xgboost_grid)

In [143]:
pd.DataFrame(macf_oobscore.values(), macf_oobscore.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,3,3,4,1,2,3,2,1,3,0,4
1,11.5%,11.5%,15.4%,3.8%,7.7%,11.5%,7.7%,3.8%,11.5%,0.0%,15.4%


In [144]:
pd.DataFrame(macf_random_grid.values(), macf_random_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,3,4,3,2,3,0,0,1,3,0,7
1,11.5%,15.4%,11.5%,7.7%,11.5%,0.0%,0.0%,3.8%,11.5%,0.0%,26.9%


In [145]:
pd.DataFrame(macf_xgboost_grid.values(), macf_xgboost_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,5,4,1,2,2,4,1,1,2,0,4
1,19.2%,15.4%,3.8%,7.7%,7.7%,15.4%,3.8%,3.8%,7.7%,0.0%,15.4%


- Se evidencia que random_grib la segunda opción, la mayoría de valores que encuentran por debajo de una variación porcentual del 0.5 en comparación a las otras opciones, sin embargo, tiene 7 registros erráticos, por encima a una variación del 100.  Se probara el rendimiento del modelo con data del 2023...

- It is evident that random_grid, the second option, has most of its values within a percentage variation of less than 0.5 compared to the other options. However, it presents 7 outlier records with variations exceeding 100%. The model’s performance will be tested using data from 2023.

In [146]:
X_validation = data_gas_test
y_validation = data_gas_test['cantidad_volumen_suministrado']

In [147]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['sum'],
    "vehiculos_atendidos": ['sum'],
    'numero_de_ventas': ['sum'],
    'eds_activas': ['sum']
}

filter_feature = [
    'key',
    'mes_venta']


new_val_col  = [
    "cantidad_volumen_suministrado",
    "vehiculos_atendidos",
    'numero_de_ventas',
    'eds_activas'
]

data_group_test = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    X_validation
)

In [148]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['mean'],
    "vehiculos_atendidos": ['mean'],
    'numero_de_ventas': ['mean'],
    'eds_activas': ['mean'],
    'mes_venta': ['count']
}

filter_feature = [
    'key'
]


new_val_col  = [
    "cantidad_volumen_suministrado_mean",
    "vehiculos_atendidos_mean",
    'numero_de_ventas_mean',
    'eds_activas_mean',
    'meses_activos'
]

data_group_test2 = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    data_group_test
)

In [149]:
X_vali = data_group_test2.drop(['key', 'cantidad_volumen_suministrado_mean'], axis=1)
y_vali = data_group_test2['cantidad_volumen_suministrado_mean']

In [150]:
y_predict_oobscore_validation = model_oobscore.predict(X_vali)
y_predict_random_grid_validation = model_forest_grid_best.predict(X_vali)
y_predict_xgboost_grid_validation = model_XGBRF_grid_best.predict(X_vali)

In [151]:
scors = {
    'MAE': [
        mean_absolute_error(y_vali, y_predict_oobscore_validation),
        mean_absolute_error(y_vali, y_predict_random_grid_validation),
        mean_absolute_error(y_vali, y_predict_xgboost_grid_validation)],
    'MSE': [
        mean_squared_error(y_vali, y_predict_oobscore_validation),
        mean_squared_error(y_vali, y_predict_random_grid_validation),
        mean_squared_error(y_vali, y_predict_xgboost_grid_validation),
    ],
    'RMSE': [
        root_mean_squared_error(y_vali, y_predict_oobscore_validation),
        root_mean_squared_error(y_vali, y_predict_random_grid_validation),
        root_mean_squared_error(y_vali, y_predict_xgboost_grid_validation)
    ]


}
pd.DataFrame(scors, ['RandomForest_oobscore', 'RandomForest_grid', 'XGBRFRegressor'])

Unnamed: 0,MAE,MSE,RMSE
RandomForest_oobscore,188428.32,456280282242.22,675485.22
RandomForest_grid,177851.51,450135686953.75,670921.52
XGBRFRegressor,109926.39,84377137245.13,290477.43


In [152]:
macf_oobscore = variacion_intervalos(y_vali, y_predict_oobscore_validation)
macf_random_grid = variacion_intervalos(y_vali, y_predict_random_grid_validation)
macf_xgboost_grid = variacion_intervalos(y_vali, y_predict_xgboost_grid_validation)

- It is evident that the RandomForest Grid model performed worse with the 2023 validation data. In contrast, the XGBRFRegressor model improved substantially compared to the performance of the other two options, as the RMSE and MSE errors are significantly lower.
- Es evidente que el modelo RandomForest Grid tuvo un peor rendimiento con los datos de validación de 2023. Por el contrario, el modelo XGBRFRegressor mejoró sustancialmente el rendimiento en comparación con las otras dos opciones, ya que los errores RMSE y MSE son significativamente menores.

In [153]:
pd.DataFrame(macf_oobscore.values(), macf_oobscore.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,17,15,15,11,7,7,5,2,5,2,15
1,16.8%,14.9%,14.9%,10.9%,6.9%,6.9%,5.0%,2.0%,5.0%,2.0%,14.9%


In [154]:
pd.DataFrame(macf_random_grid.values(), macf_random_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,13,15,14,6,9,5,5,7,7,1,19
1,12.9%,14.9%,13.9%,5.9%,8.9%,5.0%,5.0%,6.9%,6.9%,1.0%,18.8%


In [155]:
pd.DataFrame(macf_xgboost_grid.values(), macf_xgboost_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,27,17,10,6,5,12,6,4,5,3,6
1,26.7%,16.8%,9.9%,5.9%,5.0%,11.9%,5.9%,4.0%,5.0%,3.0%,5.9%


- It is evident that most records have the majority of their variations below the 30th percentile.

- Se evidencia que la mayoría de registros tienen la mayoría de las variaciones por debajo del cuantil 30.

- The possible reason for the abrupt change in the errors may be due to an underestimation of the problem and the selection of a small sampling proportion.

- La posible razón del cambio abrupto en los errores puede deberse a una subestimación del problema y a la selección de una pequeña proporción de muestra.

- Another possible reason could be a significant change in the behavior of the average monthly quantity recorded in 2023.

* La otra razón puede deberse a que pudo presentarse un cambio importante en el comportamiento en la cantidad en promedio mensual realizada en el año 2023.

# **Info**
---
@By: **Steven Bernal**

@Nickname: **Kaiziferr**

@Git: https://github.com/Kaiziferr