![image info](https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/images/banner_1.png)

# Proyecto 1 - Predicción de precios de vehículos usados

En este proyecto podrán poner en práctica sus conocimientos sobre modelos predictivos basados en árboles y ensambles, y sobre la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 1: Predicción de precios de vehículos usados".

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 4. Sin embargo, es importante que avancen en la semana 3 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 4, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/competitions/miad2024-12-prediccion-precio-vehiculos).

# Procesamiento y exploración preliminar de datos

In [2]:
# librerías
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [3]:
# cargar datos (se tiene en .csv en local)
df_train=pd.read_csv("dataTrain_carListings.csv")
# data test tiene una columna llamada ID, que solamente es el orden de numeros
df_test=pd.read_csv("dataTest_carListings.csv", index_col=0)
# Cargar datos reales
# df_real=pd.read_csv("true_car_listings.csv")

In [3]:
# eliminar columnas adicionales
# df_real.drop(["City", "Vin"], axis=1, inplace=True)
# df_real.columns

## Entrenamiento y calibración de XGB

In [4]:
# pip install xgboost

In [4]:
import sklearn.datasets
import sklearn.metrics
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBRegressor

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Price    400000 non-null  int64 
 1   Year     400000 non-null  int64 
 2   Mileage  400000 non-null  int64 
 3   State    400000 non-null  object
 4   Make     400000 non-null  object
 5   Model    400000 non-null  object
dtypes: int64(3), object(3)
memory usage: 18.3+ MB


In [5]:
# Codificar variables categóricas
categorical_columns = ['State', 'Make', 'Model']
df_train = pd.get_dummies(df_train, columns=categorical_columns).astype(int)

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
data = df_train.drop(['Price'], axis=1)
target = df_train['Price']

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.25, random_state=42)

# Convertir los arrays resultantes en DataFrames de pandas
X_train = pd.DataFrame(X_train, columns=data.columns)
X_test = pd.DataFrame(X_test, columns=data.columns)
y_train = pd.Series(y_train, name='Price')
y_test = pd.Series(y_test, name='Price')

# Verifica los tipos de datos de las estructuras resultantes
print("Tipo de X_train:", type(X_train))
print("Tipo de X_test:", type(X_test))
print("Tipo de y_train:", type(y_train))
print("Tipo de y_test:", type(y_test))

Tipo de X_train: <class 'pandas.core.frame.DataFrame'>
Tipo de X_test: <class 'pandas.core.frame.DataFrame'>
Tipo de y_train: <class 'pandas.core.series.Series'>
Tipo de y_test: <class 'pandas.core.series.Series'>


In [11]:


# Función objetivo para la optimización
import optuna
import numpy as np
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

# Función objetivo para la optimización
def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 10, 100)
    max_samples = trial.suggest_float("max_samples", 0.5, 1.0)
    max_features = trial.suggest_float("max_features", 0.5, 1.0)
    bootstrap = trial.suggest_categorical("bootstrap", [True, False])
    bootstrap_features = trial.suggest_categorical("bootstrap_features", [True, False])
    warm_start = trial.suggest_categorical("warm_start", [True, False])
    
    reg = BaggingRegressor(
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        bootstrap=bootstrap,
        bootstrap_features=bootstrap_features,
        warm_start=warm_start,
        random_state=42
    )
    
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return rmse

# Crear y ejecutar el estudio de optimización con un límite de tiempo
study = optuna.create_study(direction="minimize")  # Cambio a minimize porque queremos minimizar RMSE
study.optimize(objective, n_trials=100, timeout=3600)  # Límite de tiempo de 1 hora

# Resultados
print("Mejores parámetros:", study.best_params)
print("Mejor RMSE obtenido:", study.best_value)

# Configuración del modelo con los mejores parámetros
best_reg = BaggingRegressor(
    n_estimators=study.best_params['n_estimators'],
    max_samples=study.best_params['max_samples'],
    max_features=study.best_params['max_features'],
    bootstrap=study.best_params['bootstrap'],
    bootstrap_features=study.best_params['bootstrap_features'],
    warm_start=study.best_params['warm_start'],
    random_state=42
)
best_reg.fit(X_train, y_train)


# El modelo está listo para ser usado o evaluado más detalladamente


[I 2024-04-25 09:17:05,749] A new study created in memory with name: no-name-e203564a-4421-4133-b0a3-d5e6558a7189
[I 2024-04-25 09:27:34,852] Trial 0 finished with value: 6249.331659738671 and parameters: {'n_estimators': 93, 'max_samples': 0.801568164623271, 'max_features': 0.5184037757417734, 'bootstrap': True, 'bootstrap_features': True, 'warm_start': False}. Best is trial 0 with value: 6249.331659738671.
[I 2024-04-25 09:41:45,557] Trial 1 finished with value: 4621.955223673426 and parameters: {'n_estimators': 79, 'max_samples': 0.5156798312823754, 'max_features': 0.8994471249644933, 'bootstrap': False, 'bootstrap_features': True, 'warm_start': False}. Best is trial 1 with value: 4621.955223673426.
[I 2024-04-25 09:58:29,754] Trial 2 finished with value: 4476.08850019158 and parameters: {'n_estimators': 62, 'max_samples': 0.5486829558076504, 'max_features': 0.6153976469360052, 'bootstrap': True, 'bootstrap_features': False, 'warm_start': False}. Best is trial 2 with value: 4476.088

Mejores parámetros: {'n_estimators': 62, 'max_samples': 0.5486829558076504, 'max_features': 0.6153976469360052, 'bootstrap': True, 'bootstrap_features': False, 'warm_start': False}
Mejor RMSE obtenido: 4476.08850019158


In [18]:
import optuna
import numpy as np
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Función objetivo para la optimización
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0.0, 1.0),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1, 100)
    }
    
    reg = XGBRegressor(**params, random_state=42)
    reg.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=50, verbose=False)
    
    y_pred = reg.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return rmse

# Crear y ejecutar el estudio de optimización
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, timeout=7200)  # 1 hora de límite

# Resultados
print("Mejores parámetros:", study.best_params)
print("Mejor RMSE obtenido:", study.best_value)

# Configuración del modelo con los mejores parámetros
best_reg = XGBRegressor(**study.best_params, random_state=42)
best_reg.fit(X_train, y_train)


[I 2024-04-25 12:00:18,354] A new study created in memory with name: no-name-c6093811-fa8d-4333-92ab-9ca09d783c90
[I 2024-04-25 12:00:32,646] Trial 0 finished with value: 6107.411781863169 and parameters: {'n_estimators': 108, 'max_depth': 5, 'min_child_weight': 7, 'gamma': 0.32818886251806256, 'learning_rate': 0.07659074397579649, 'subsample': 0.7285260770421705, 'colsample_bytree': 0.594034521742388, 'lambda': 0.02053595551720402, 'alpha': 1.0386149535617004, 'scale_pos_weight': 61.525616251564394}. Best is trial 0 with value: 6107.411781863169.
[I 2024-04-25 12:01:39,316] Trial 1 finished with value: 3555.7563794012485 and parameters: {'n_estimators': 592, 'max_depth': 7, 'min_child_weight': 4, 'gamma': 0.8944539791912829, 'learning_rate': 0.13951754102998098, 'subsample': 0.7811533302283751, 'colsample_bytree': 0.902414643116183, 'lambda': 0.0010490887048911982, 'alpha': 0.08070787874148855, 'scale_pos_weight': 94.04380293979686}. Best is trial 1 with value: 3555.7563794012485.
[I 

Mejores parámetros: {'n_estimators': 996, 'max_depth': 10, 'min_child_weight': 7, 'gamma': 0.048642411439236644, 'learning_rate': 0.29837432228713884, 'subsample': 0.9898373156271857, 'colsample_bytree': 0.8350961180480424, 'lambda': 0.005369078483468314, 'alpha': 0.0010458991368176762, 'scale_pos_weight': 34.055589403357416}
Mejor RMSE obtenido: 3458.9098166975778


In [21]:
import optuna
import numpy as np
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Función objetivo para la optimización
def objective(trial):
    # Parámetros para Random Forest
    rf_params = {
        'n_estimators': trial.suggest_int('rf_n_estimators', 100, 500),
        'max_depth': trial.suggest_int('rf_max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 10),
        'min_samples_leaf': trial.suggest_int('rf_min_samples_leaf', 1, 10)
    }
    
    # Parámetros para XGBoost
    xgb_params = {
        'n_estimators': trial.suggest_int('xgb_n_estimators', 100, 500),
        'max_depth': trial.suggest_int('xgb_max_depth', 3, 15),
        'min_child_weight': trial.suggest_int('xgb_min_child_weight', 1, 10),
        'gamma': trial.suggest_float('xgb_gamma', 0.0, 1.0),
        'learning_rate': trial.suggest_float('xgb_learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('xgb_subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('xgb_colsample_bytree', 0.5, 1.0)
    }

    # Crear los modelos base
    rf = RandomForestRegressor(**rf_params, random_state=42)
    xgb = XGBRegressor(**xgb_params, random_state=42)

    # Crear el ensamble de stacking
    stack_reg = StackingRegressor(
        estimators=[('rf', rf), ('xgb', xgb)],
        final_estimator=LinearRegression(),
        passthrough=False
    )
    
    # Entrenar el modelo de stacking
    stack_reg.fit(X_train, y_train)
    
    # Hacer predicciones y calcular RMSE
    y_pred = stack_reg.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return rmse

# Crear y ejecutar el estudio de optimización
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, timeout=3600)  # 1 hora de límite

# Resultados
print("Mejores parámetros:", study.best_trial.params)
print("Mejor RMSE obtenido:", study.best_trial.value)

# Configurar el modelo con los mejores parámetros
rf_best = RandomForestRegressor(**{k[3:]: v for k, v in study.best_params.items() if k.startswith('rf_')}, random_state=42)
xgb_best = XGBRegressor(**{k[4:]: v for k, v in study.best_params.items() if k.startswith('xgb_')}, random_state=42)
stack_reg_best = StackingRegressor(
    estimators=[('rf', rf_best), ('xgb', xgb_best)],
    final_estimator=LinearRegression(),
    passthrough=True
)

# Entrenar el modelo final
stack_reg_best.fit(X_train, y_train)



[I 2024-04-25 14:19:27,759] A new study created in memory with name: no-name-46a70fa6-ba08-4939-9827-32c6c2a91fd1
[I 2024-04-25 16:05:49,557] Trial 0 finished with value: 5997.567046217543 and parameters: {'rf_n_estimators': 389, 'rf_max_depth': 9, 'rf_min_samples_split': 6, 'rf_min_samples_leaf': 7, 'xgb_n_estimators': 392, 'xgb_max_depth': 6, 'xgb_min_child_weight': 10, 'xgb_gamma': 0.6549283630830555, 'xgb_learning_rate': 0.012454017827420245, 'xgb_subsample': 0.8076163019864833, 'xgb_colsample_bytree': 0.5004985904294686}. Best is trial 0 with value: 5997.567046217543.


Mejores parámetros: {'rf_n_estimators': 389, 'rf_max_depth': 9, 'rf_min_samples_split': 6, 'rf_min_samples_leaf': 7, 'xgb_n_estimators': 392, 'xgb_max_depth': 6, 'xgb_min_child_weight': 10, 'xgb_gamma': 0.6549283630830555, 'xgb_learning_rate': 0.012454017827420245, 'xgb_subsample': 0.8076163019864833, 'xgb_colsample_bytree': 0.5004985904294686}
Mejor RMSE obtenido: 5997.567046217543


In [7]:
import optuna
import numpy as np
from xgboost import XGBRegressor
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.model_selection import cross_val_score, KFold

# Función objetivo para la optimización
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0.0, 1.0),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1, 100)
    }
    
    reg = XGBRegressor(**params, random_state=42)
    
    # Crear un scorer RMSE negativo, ya que por defecto mayor es mejor
    rmse_scorer = make_scorer(mean_squared_error, squared=False, greater_is_better=False)
    kf = KFold(n_splits=3, shuffle=True, random_state=42)
    
    # Calcula el RMSE promedio en los 3 folds
    scores = cross_val_score(reg, data, target, cv=kf, scoring=rmse_scorer)
    rmse_mean = -scores.mean()  # Negar el promedio para obtener RMSE positivo
    return rmse_mean

# Crear y ejecutar el estudio de optimización
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, timeout=7200)  # 2 horas de límite

# Resultados
print("Mejores parámetros:", study.best_params)
print("Mejor RMSE obtenido:", study.best_value)

# Configuración del modelo con los mejores parámetros
best_reg = XGBRegressor(**study.best_params, random_state=42)
best_reg.fit(data, target)


[I 2024-04-26 00:35:47,494] A new study created in memory with name: no-name-8c8e7b7f-0527-48b9-aafa-b618b480340a
[I 2024-04-26 00:36:52,785] Trial 0 finished with value: 3578.2688753983525 and parameters: {'n_estimators': 615, 'max_depth': 10, 'min_child_weight': 3, 'gamma': 0.1519025210338012, 'learning_rate': 0.24665056488928921, 'subsample': 0.6433745179036909, 'colsample_bytree': 0.6242354505683942, 'lambda': 0.07041188874749668, 'alpha': 3.4568441069804634, 'scale_pos_weight': 95.15152827843455}. Best is trial 0 with value: 3578.2688753983525.
[I 2024-04-26 00:37:12,611] Trial 1 finished with value: 4182.837443299656 and parameters: {'n_estimators': 116, 'max_depth': 6, 'min_child_weight': 1, 'gamma': 0.03504795172814512, 'learning_rate': 0.29783498105540446, 'subsample': 0.8296440087760606, 'colsample_bytree': 0.8296347562054345, 'lambda': 0.4106009982636856, 'alpha': 0.024618634368798357, 'scale_pos_weight': 56.32348767128386}. Best is trial 0 with value: 3578.2688753983525.
[I

: 

In [None]:
# Modelo sin calibrar
xgb_1 = XGBRegressor(random_state=42)
xgb_1

In [None]:
# Entrenamiento y desempeño del modelo XGBRegressor
xgb_1.fit(X_train, y_train)
y_pred = xgb_1.predict(X_test)
# Calculo del MSE y el MAE
from sklearn.metrics import mean_squared_error, mean_absolute_error
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("El modelo XGBRegressor tiene un RMSE igual a " +str(rmse)+ " y un MAE igual a "+str(mae))

In [None]:
# pip install optuna

## Prueba 1
Fuente: https://github.com/optuna/optuna-examples/blob/main/xgboost/xgboost_simple.py

In [None]:
import numpy as np
import optuna
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import xgboost as xgb

np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=50, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_calibrado = XGBRegressor(
    booster='gbtree',
    reg_lambda=0.0006300622095494117,
    alpha=0.00014622406232897364,
    subsample=0.23255033200475195,
    colsample_bytree=0.4885554959213272,
    max_depth=9,
    min_child_weight=8,
    eta=0.7114903492018297,
    gamma=6.252368056908203e-05,
    grow_policy='lossguide',
    random_state=1
)
xgb_calibrado.fit(X_train, y_train)
y_pred = xgb_calibrado.predict(X_test)
mse_xgb_calibrado = mean_squared_error(y_test, y_pred)
mae_xgb_calibrado = mean_absolute_error(y_test, y_pred)
rmse_xgb_calibrado = np.sqrt(mse_xgb_calibrado)
print("El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a " +str(rmse_xgb_calibrado)+ " y un MAE igual a "+str(mae_xgb_calibrado))

In [None]:
# Intento #2 agregando más ensayos
np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=100, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_cal = XGBRegressor(
    booster='gbtree',
    reg_lambda=1.3618109137815986e-08,
    alpha=0.0068473212560469085,
    subsample=0.5383447526406704,
    colsample_bytree=0.7375702235448874,
    max_depth=9,
    min_child_weight=4,
    eta=0.925350157501082,
    gamma=0.8591307262185451,
    grow_policy='lossguide',
    random_state=1
)
xgb_cal.fit(X_train, y_train)
y_pred = xgb_cal.predict(X_test)
mse_xgb_cal = mean_squared_error(y_test, y_pred)
mae_xgb_cal = mean_absolute_error(y_test, y_pred)
rmse_xgb_cal = np.sqrt(mse_xgb_cal)
print("El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a " +str(rmse_xgb_cal)+ " y un MAE igual a "+str(mae_xgb_cal))

In [None]:
# intento # 3, agregando el parámetro n_estimators
import numpy as np
import optuna
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import xgboost as xgb

np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        # número de árboles en el ensamble
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=50, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_cal_2 = XGBRegressor(
    booster='gbtree',
    reg_lambda=0.9501091374676599,
    alpha=2.2458201898736107e-06,
    subsample=0.33554148719968135,
    colsample_bytree=0.6131209517619073,
    n_estimators= 988,
    max_depth=3,
    min_child_weight=8,
    eta=0.9710181349940284,
    gamma=0.0031652899319421076,
    grow_policy='depthwise',
    random_state=1
)
xgb_cal_2.fit(X_train, y_train)
y_pred = xgb_cal_2.predict(X_test)
mse_xgb_cal_2 = mean_squared_error(y_test, y_pred)
mae_xgb_cal_2 = mean_absolute_error(y_test, y_pred)
rmse_xgb_cal_2 = np.sqrt(mse_xgb_cal_2)
print("El modelo XGB calibrado usando Optuna tiene un RMSE igual a " +str(rmse_xgb_cal_2)+ " y un MAE igual a "+str(mae_xgb_cal_2))

In [None]:
# Intento #4 cambiando los valores del parámetro gamma
import xgboost as xgb

np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        # número de árboles en el ensamble
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 500, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=50, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_2 = XGBRegressor(
    reg_lambda=9.564561673448604e-07,
    alpha=1.6980413730845048e-05,
    subsample=0.2288913151339783,
    colsample_bytree=0.9932758875637113,
    n_estimators= 699,
    max_depth=9,
    min_child_weight=5,
    eta=0.9676347877899362,
    gamma=1.115418460534174e-08,
    grow_policy='lossguide',
    random_state=1
)
xgb_2.fit(X_train, y_train)
y_pred = xgb_2.predict(X_test)
mse_xgb_2 = mean_squared_error(y_test, y_pred)
mae_xgb_2 = mean_absolute_error(y_test, y_pred)
rmse_xgb_2 = np.sqrt(mse_xgb_2)
print("El modelo XGB calibrado usando Optuna tiene un RMSE igual a " +str(rmse_xgb_2)+ " y un MAE igual a "+str(mae_xgb_2))

In [None]:
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
# Calibración de gamma
gamma = np.arange(0, 100, 10 )
MSE_2 = []
for valor in gamma:
    xgb = XGBRegressor(gamma=valor, random_state=1)
    MSE_2.append(cross_val_score(xgb, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean())
    MSE_2 = [abs(valor) for valor in MSE_2]

In [None]:
plt.plot(gamma, MSE_2)
plt.title("Desempeño de XGBoost por cada valor de gamma")
plt.xlabel('gamma')
plt.ylabel('MSE')

# Cargar predicciones en Kaggle

In [None]:
X_train.info()

In [None]:
X_train.columns

In [19]:
# Codificar variables categóricas en df_test
df_test_ = pd.get_dummies(df_test, columns=categorical_columns).astype(int)

# Alinear las columnas de df_test con X_train
df_test_aligned, _ = df_test_.align(X_train, axis=1, fill_value=0)

column_order = X_train.columns.tolist()

# Reordenar las columnas de df_test_aligned de acuerdo con el orden en X_train
df_test_ordenado = df_test_aligned[column_order]

In [13]:
df_test_ordenado.columns

Index(['Year', 'Mileage', 'State_ AK', 'State_ AL', 'State_ AR', 'State_ AZ',
       'State_ CA', 'State_ CO', 'State_ CT', 'State_ DC',
       ...
       'Model_Yaris4dr', 'Model_YarisBase', 'Model_YarisLE', 'Model_Yukon',
       'Model_Yukon2WD', 'Model_Yukon4WD', 'Model_Yukon4dr', 'Model_tC2dr',
       'Model_xB5dr', 'Model_xD5dr'],
      dtype='object', length=616)

In [None]:
# El modelo montado en Kaggle fue:
xgb_calibrado_3 = XGBRegressor(
    booster='gbtree',
    reg_lambda=0.30212357583678445,
    alpha=0.18211410327397226,
    subsample=0.3521449713976327,
    colsample_bytree=0.44811953613366806,
    max_depth=9,
    min_child_weight=9,
    eta=0.8918833748094722,
    gamma=0.00034064735565798957,
    grow_policy='lossguide',
    random_state=1  # Si deseas mantener una semilla aleatoria fija
)
xgb_calibrado_3.fit(X_train, y_train)
y_pred = xgb_calibrado_3.predict(X_test)
mse_xgb_calibrado_3 = mean_squared_error(y_test, y_pred)
mae_xgb_calibrado_3 = mean_absolute_error(y_test, y_pred)
rmse_xgb_calibrado_3 = np.sqrt(mse_xgb_calibrado_3)
print("El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a " +str(rmse_xgb_calibrado_3)+ " y un MAE igual a "+str(mae_xgb_calibrado_3))

In [20]:
# Se usa df_test_aligned para predecir
y_pred = best_reg.predict(df_test_ordenado)

predictions_df = pd.DataFrame(y_pred, index=df_test_ordenado.index, columns=['Price'])

predictions_df.to_csv('predicciones_2.csv')