<a href="https://colab.research.google.com/github/Kaiziferr/machine_learning/blob/main/XGBoost/02_random_fores_xgboost_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from unicodedata import normalize
import warnings

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import (
    train_test_split,
    ParameterGrid,
    GridSearchCV)
from sklearn.pipeline import Pipeline
from sklearn.metrics import (mean_absolute_error, mean_squared_error)

# **Info**
---
**@By**: Steven Bernal Tovar

**@Nickname**: Kaiziferr

**@Git**: https://github.com/Kaiziferr

# **Datos**
---

Conjunto de los volúmenes de gas suministrado por las Estaciones de Servicio en Colombia.


**Información de la Entidad**

- Área o dependencia: Dirección de Hidrocarburos
- Nombre de la Entidad: Ministerio de Minas y Energía
- Departamento: Bogotá D.C.
- Municipio: Bogotá D.C.
- Orden: Nacional
- Sector: Minas y Energía

**Información de Datos**

- Cobertura Geográfica: Nacional
- Frecuencia de Actualización: Diaria
- Fecha Emisión (aaaa-mm-dd): 2023-08-17

Suministró los datos: Ministerio de Minas y Energía

path data: https://www.datos.gov.co/Minas-y-Energ-a/Consulta-Ventas-de-Gas-Natural-Comprimido-Vehicula/v8jr-kywh/about_data

- FECHA_VENTA: fecha de la transacción
- ANIO_VENTA: fecha de la transacción
- MES_VENTA: fecha de la transacción
- DIA_VENTA: fecha de la transacción
- CODIGO_MUNICIPIO_DANE: Código del municipio
- DEPARTAMENTO: departamento
- MUNICIPIO: municipio
- LATITUD: coordenadas de georeferenciación
- LONGITUD: coordenadas de georeferenciación
- TIPO_AGENTE: tipo del agente proveedor
- TIPO_DE_COMBUSTIBLE: combustible suministrado
- EDS_ACTIVAS: estaciones de servicio activas
- NUMERO_DE_VENTAS
- cantidad de tanqueadas en la eds
- VEHICULOS_ATENDIDOS: vehículos atendidos
- CANTIDAD_VOLUMEN_SUMINISTRADO: volumen suministrado en las tanqueadas

**Este ejercicio tiene como propósito dos objetivos**:
1. Entender como configurar un bosque aleatorio con xgboosting
2. Comparar la versión propuesta por sklearn y la de xgboosting

# **Config**
---



In [None]:
sns.set(style="darkgrid")
pd.set_option('display.float_format', '{:,.2f}'.format)
title_data = 'Materiales extraidos en Colombia'
paleta = sns.color_palette('Set2').as_hex()
random_seed=73
np.set_printoptions(precision=3, suppress=True)

# **Funciones**
---

In [None]:
def normalize_word(word):
  """Normaliza palabras"""
  word = word.replace(' ', '_')
  find_guion = word.find('_')
  list_word = []
  if find_guion:
    list_word = [w for w in word.split('_') if w != '']
  else:
    list_word = word
  word = list(map(lambda x: x.lower(), list_word))
  word = [normalize('NFKD', c).encode('ASCII', 'ignore').decode() for c in word]
  word = "_".join(word)
  return word


def normalize_name_columns(columns):
  """Normaliza columnas"""
  columns = list(map(lambda x: normalize_word(x), columns))
  return columns

# **Data**
---

In [None]:
url_gas_data = 'https://drive.google.com/file/d/1d2zxaI8riPA7SJm3cCw_jrrUYCIc_63F/view?usp=sharing'
url_gas_data = 'https://drive.google.com/uc?id=' + url_gas_data.split('/')[-2]
gas_data = pd.read_csv(url_gas_data, dtype='str')

In [None]:
gas_data.columns = normalize_name_columns(
    gas_data.columns)

In [None]:
gas_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125771 entries, 0 to 125770
Data columns (total 15 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   fecha_venta                    125771 non-null  object
 1   anio_venta                     125771 non-null  object
 2   mes_venta                      125771 non-null  object
 3   dia_venta                      125771 non-null  object
 4   codigo_municipio_dane          125771 non-null  object
 5   departamento                   125771 non-null  object
 6   municipio                      125771 non-null  object
 7   latitud                        125771 non-null  object
 8   longitud                       125771 non-null  object
 9   tipo_agente                    125771 non-null  object
 10  tipo_de_combustible            125771 non-null  object
 11  eds_activas                    125771 non-null  object
 12  numero_de_ventas               125771 non-nu

Se asigna los tipos de datos reales a los datos


In [None]:
gas_data['anio_venta'].value_counts()

anio_venta
2023    34237
2022    29587
2021    26523
2020    22151
2024    13273
Name: count, dtype: int64

- Se toma la data del 2022

In [None]:
gas_data[[
    'eds_activas',
    'numero_de_ventas',
    'vehiculos_atendidos',
    'cantidad_volumen_suministrado'
]] = gas_data[[
    'eds_activas',
    'numero_de_ventas',
    'vehiculos_atendidos',
    'cantidad_volumen_suministrado'
]].astype('float64')

In [None]:
gas_data_periodo = gas_data[gas_data['anio_venta'] == "2022"]

In [None]:
gas_data_periodo_test = gas_data[gas_data['anio_venta'] == "2023"]

In [None]:
gas_data_periodo_test['key'] = gas_data_periodo_test.departamento.str.cat(
    gas_data_periodo_test.municipio, sep='-')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gas_data_periodo_test['key'] = gas_data_periodo_test.departamento.str.cat(


In [None]:
gas_data_periodo['key'] = gas_data_periodo.departamento.str.cat(
    gas_data_periodo.municipio, sep='-')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gas_data_periodo['key'] = gas_data_periodo.departamento.str.cat(


In [None]:
 def agrupamiento(
     function_dictionary:dict,
     filter_feature:list,
     new_val_col:list,
     data):
  try:
    data_group = data.groupby(
    filter_feature).aggregate(function_dictionary)

    data_group.columns =  new_val_col
    data_group = data_group.reset_index()
    return data_group
  except Exception as e:
    print(e)


In [None]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['sum'],
    "vehiculos_atendidos": ['sum'],
    'numero_de_ventas': ['sum'],
    'eds_activas': ['sum']
}

filter_feature = [
    'key',
    'mes_venta']


new_val_col  = [
    "cantidad_volumen_suministrado",
    "vehiculos_atendidos",
    'numero_de_ventas',
    'eds_activas'
]

data_group = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    gas_data_periodo
)

In [None]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['mean'],
    "vehiculos_atendidos": ['mean'],
    'numero_de_ventas': ['mean'],
    'eds_activas': ['mean'],
    'mes_venta': ['count']
}

filter_feature = [
    'key'
]


new_val_col  = [
    "cantidad_volumen_suministrado_mean",
    "vehiculos_atendidos_mean",
    'numero_de_ventas_mean',
    'eds_activas_mean',
    'meses_activos'
]

data_group2 = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    data_group
)

# **Data Split**
---

In [None]:
features = data_group2.select_dtypes(include=['float64', 'int64']).columns.to_list()
X = data_group2[features[1: ]]
y = data_group2[features[0]]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=random_seed,
)

# **Model**
---

**Random forest oob_score**

In [None]:
dict_random = ParameterGrid(
    {
        'n_estimators': [100, 150],
        'max_features': [0.45, 0.85],
        'max_depth': [None, 5, 10],
        'criterion': ['squared_error', 'absolute_error']
    }
)

params_random = dict_random.param_grid

In [None]:
resultados = {
    'params': [],
    'oob_r2': []
}

In [None]:
for params in dict_random:
  model_oobscore = RandomForestRegressor(
      oob_score    = True,
      n_jobs       = -1,
      random_state = random_seed,
      **params
  )
  model_oobscore.fit(X_train, y_train)
  resultados['params'].append(params)
  resultados['oob_r2'].append(model_oobscore.oob_score_)

In [None]:
resultados = pd.DataFrame(resultados)
resultados = pd.concat(
    [resultados, resultados['params'].apply(pd.Series)], axis=1)

resultados = resultados.drop(columns = 'params')
resultados = resultados.sort_values('oob_r2', ascending=False)
resultados.head(4)

Unnamed: 0,oob_r2,criterion,max_depth,max_features,n_estimators
11,0.14,squared_error,10.0,0.85,150
5,0.14,squared_error,5.0,0.45,150
7,0.14,squared_error,5.0,0.85,150
1,0.14,squared_error,,0.45,150


In [None]:
resultados.iloc[0, 1:].to_dict()

{'criterion': 'squared_error',
 'max_depth': 10.0,
 'max_features': 0.85,
 'n_estimators': 150}

In [None]:
model_oobscore = RandomForestRegressor(
      oob_score    = True,
      n_jobs       = -1,
      random_state = random_seed,
      max_depth = 10,
      criterion = "squared_error",
      max_features = 0.85,
      n_estimators = 150
  )

model_oobscore.fit(X_train, y_train)

**Random forest Grid**

In [None]:
model_forest_grid = RandomForestRegressor(
      oob_score    = False,
      n_jobs       = -1,
      random_state = random_seed
)


In [None]:
grid_search = GridSearchCV(model_forest_grid, params_random, cv = 3, scoring = 'neg_mean_absolute_error', verbose = 2, n_jobs= -1)
grid_search.fit(X_train, y_train)

model_forest_grid_best = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

Fitting 3 folds for each of 24 candidates, totalling 72 fits


In [None]:
best_score

-239976.84334813105

In [None]:
scores_model = pd.DataFrame(grid_search.cv_results_)
scores_model.sort_values(by='mean_test_score', ascending=False).head(1)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_max_features,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
1,0.35,0.0,0.04,0.0,squared_error,,0.45,150,"{'criterion': 'squared_error', 'max_depth': No...",-53546.87,-37236.06,-629147.6,-239976.84,275265.83,1


**Random model_XGBRF_grid Grid**

In [None]:
model_XGBRF_grid = XGBRFRegressor(booster='gbtree', random_state=random_seed)

In [None]:
params = [
  {
      'n_estimators': [50, 100, 150],
      'subsample': [0.45, 0.75, 0.85],
      'colsample_bynode': [0.45, 0.75, 0.85],
      'max_depth': [None, 5, 10, 15]
  }
]


In [None]:
grid_search = GridSearchCV(model_XGBRF_grid, params_random, cv = 3, scoring = 'neg_mean_absolute_error', verbose = 2, n_jobs= -1)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


Parameters: { "criterion", "max_features" } are not used.



In [None]:
model_XGBRF_grid_best = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

In [None]:
best_score

-301143.47241115593

In [None]:
scores_model = pd.DataFrame(grid_search.cv_results_)
scores_model.sort_values(by='mean_test_score', ascending=False).head(1)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_max_features,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
23,0.1,0.01,0.0,0.0,absolute_error,10,0.85,150,"{'criterion': 'absolute_error', 'max_depth': 1...",-54508.55,-218804.31,-630117.56,-301143.47,242096.72,1


In [None]:
y_predict_oobscore = model_oobscore.predict(X_test)
y_predict_random_grid = model_forest_grid_best.predict(X_test)
y_predict_xgboost_grid = model_XGBRF_grid_best.predict(X_test)

In [None]:
scors = {
    'MAE': [
        mean_absolute_error(y_test, y_predict_oobscore),
        mean_absolute_error(y_test, y_predict_random_grid),
        mean_absolute_error(y_test, y_predict_xgboost_grid)],
    'MSE': [
        mean_squared_error(y_test, y_predict_oobscore),
        mean_squared_error(y_test, y_predict_random_grid),
        mean_squared_error(y_test, y_predict_xgboost_grid),
    ],
    'RMSE': [
        mean_squared_error(y_test, y_predict_oobscore, squared=False),
        mean_squared_error(y_test, y_predict_random_grid, squared=False),
        mean_squared_error(y_test, y_predict_xgboost_grid, squared=False)
    ]


}
pd.DataFrame(scors, ['RandomForest_oobscore', 'RandomForest_grid', 'XGBRFRegressor'])

Unnamed: 0,MAE,MSE,RMSE
RandomForest_oobscore,63945.44,12058436588.73,109810.91
RandomForest_grid,62229.85,10105878660.95,100528.0
XGBRFRegressor,70468.2,16286978738.67,127620.45


-

- El modelo se que mejor comportamiento tiene es el RandomForest_grid, pues en las métricas preestablecidas tiene el erro más bajo, pero no están diferencial con respecto al resto. Si utilizamos RMSE para trabajar con las medidas del problema, el modelo se equivocó en promedio 100,528.00 en la cantidad de volumen suministrada en el promedio teórico de tanqueaada.

- El promedio teórico es el supuesto de lo que debe tanquear una estación de gas en el mes. Se calcula con base en la suma del periodo de actividad dividido por la cantidad de meses activos.

# **Test**
---

In [None]:
def variacion_intervalos(y_test, y_predict):
  macf = {
      'q10': 0,
      'q20': 0,
      'q30': 0,
      'q40': 0,
      'q50': 0,
      'q60': 0,
      'q70': 0,
      'q80': 0,
      'q90': 0,
      'q100': 0,
      'erraticos':0
  }


  for i, j in zip(y_test, y_predict):
    variacion = abs((j-i)/i)
    if variacion <= 0.1:
      macf['q10'] = macf['q10'] + 1
    elif variacion <= 0.2:
      macf['q20'] = macf['q20'] + 1
    elif variacion <= 0.3:
      macf['q30'] = macf['q30'] + 1
    elif variacion <= 0.4:
      macf['q40'] = macf['q40'] + 1
    elif variacion<= 0.5:
      macf['q50'] = macf['q50'] + 1
    elif variacion <= 0.6:
      macf['q60'] = macf['q60'] + 1
    elif variacion <= 0.7:
      macf['q70'] = macf['q70'] + 1
    elif variacion <= 0.8:
      macf['q80'] = macf['q80'] + 1
    elif variacion <= 0.9:
      macf['q90'] = macf['q90'] + 1
    elif variacion <= 1:
      macf['q100'] = macf['q100'] + 1
    else:
      macf['erraticos'] = macf['erraticos'] + 1
  return macf

In [None]:
macf_oobscore = variacion_intervalos(y_test, y_predict_oobscore)
macf_random_grid = variacion_intervalos(y_test, y_predict_random_grid)
macf_xgboost_grid = variacion_intervalos(y_test, y_predict_xgboost_grid)

In [None]:
pd.DataFrame(macf_oobscore.values(), macf_oobscore.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,3,3,4,1,2,3,2,1,3,0,4


In [None]:
pd.DataFrame(macf_random_grid.values(), macf_random_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,3,4,3,2,3,0,0,1,3,0,7


In [None]:
pd.DataFrame(macf_xgboost_grid.values(), macf_xgboost_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,4,5,2,1,2,2,3,1,2,1,3


Se evidencia que random_grib la segunda opción, la mayoría de valores que encuentran por debajo de una variación porcentual del 0.5 en comparación a las otras opciones, sin embargo, tiene 7 registros erráticos, por encima a una variación del 100.  Se probara el rendimiento del modelo con data del 2023...

In [None]:
X_validation = gas_data_periodo_test
y_validation = gas_data_periodo_test['cantidad_volumen_suministrado']

In [None]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['sum'],
    "vehiculos_atendidos": ['sum'],
    'numero_de_ventas': ['sum'],
    'eds_activas': ['sum']
}

filter_feature = [
    'key',
    'mes_venta']


new_val_col  = [
    "cantidad_volumen_suministrado",
    "vehiculos_atendidos",
    'numero_de_ventas',
    'eds_activas'
]

data_group_test = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    X_validation
)

In [None]:
function_dictionary = {
    "cantidad_volumen_suministrado": ['mean'],
    "vehiculos_atendidos": ['mean'],
    'numero_de_ventas': ['mean'],
    'eds_activas': ['mean'],
    'mes_venta': ['count']
}

filter_feature = [
    'key'
]


new_val_col  = [
    "cantidad_volumen_suministrado_mean",
    "vehiculos_atendidos_mean",
    'numero_de_ventas_mean',
    'eds_activas_mean',
    'meses_activos'
]

data_group_test2 = agrupamiento(
    function_dictionary,
    filter_feature,
    new_val_col,
    data_group_test
)

In [None]:
X_vali = data_group_test2.drop(['key', 'cantidad_volumen_suministrado_mean'], axis=1)
y_vali = data_group_test2['cantidad_volumen_suministrado_mean']

In [None]:
y_predict_oobscore_validation = model_oobscore.predict(X_vali)
y_predict_random_grid_validation = model_forest_grid_best.predict(X_vali)
y_predict_xgboost_grid_validation = model_XGBRF_grid_best.predict(X_vali)

In [None]:
scors = {
    'MAE': [
        mean_absolute_error(y_vali, y_predict_oobscore_validation),
        mean_absolute_error(y_vali, y_predict_random_grid_validation),
        mean_absolute_error(y_vali, y_predict_xgboost_grid_validation)],
    'MSE': [
        mean_squared_error(y_vali, y_predict_oobscore_validation),
        mean_squared_error(y_vali, y_predict_random_grid_validation),
        mean_squared_error(y_vali, y_predict_xgboost_grid_validation),
    ],
    'RMSE': [
        mean_squared_error(y_vali, y_predict_oobscore_validation, squared=False),
        mean_squared_error(y_vali, y_predict_random_grid_validation, squared=False),
        mean_squared_error(y_vali, y_predict_xgboost_grid_validation, squared=False)
    ]


}
pd.DataFrame(scors, ['RandomForest_oobscore', 'RandomForest_grid', 'XGBRFRegressor'])

Unnamed: 0,MAE,MSE,RMSE
RandomForest_oobscore,188428.32,456280282242.22,675485.22
RandomForest_grid,177851.51,450135686953.75,670921.52
XGBRFRegressor,116475.47,84391438804.63,290502.05


In [None]:
macf_oobscore = variacion_intervalos(y_vali, y_predict_oobscore_validation)
macf_random_grid = variacion_intervalos(y_vali, y_predict_random_grid_validation)
macf_xgboost_grid = variacion_intervalos(y_vali, y_predict_xgboost_grid_validation)

Se evidencia que el modelo de randoforest grid tuvo un peor rendimiento con los datos de validación del 2023. En cambio, el modelo XGBRFRegressor mejoro sustancialmente en comparación al rendimiento de las dos opciones, ya que, el error del RMSE y MSE es muy significativo.

In [None]:
pd.DataFrame(macf_oobscore.values(), macf_oobscore.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,17,15,15,11,7,7,5,2,5,2,15


In [None]:
pd.DataFrame(macf_random_grid.values(), macf_random_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,13,15,14,6,9,5,5,7,7,1,19


In [None]:
pd.DataFrame(macf_xgboost_grid.values(), macf_xgboost_grid.keys()).T

Unnamed: 0,q10,q20,q30,q40,q50,q60,q70,q80,q90,q100,erraticos
0,26,15,16,6,5,8,2,4,4,4,11


- Se evidencia que la mayoría de registros tienen la mayoría de las variaciones por debajo del cuantil 30.

- La posible razón del cambio tan a bruto de los errores, puede deberse a que se subestimó el problema y se seleccionó una pequeña proporción del muestreo.
- La otra razón puede deberse a que pudo presentarse un cambio importante en el comportamiento en la cantidad en promedio mensual realizada en el año 2023.
- Otra posible causa puede ser la aleatoriedad.

# **Info**
---
**@By**: Steven Bernal Tovar

**@Nickname**: Kaiziferr

**@Git**: https://github.com/Kaiziferr