![image info](https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/images/banner_1.png)

# Proyecto 1 - Predicción de precios de vehículos usados

En este proyecto podrán poner en práctica sus conocimientos sobre modelos predictivos basados en árboles y ensambles, y sobre la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 1: Predicción de precios de vehículos usados".

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 4. Sin embargo, es importante que avancen en la semana 3 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 4, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/competitions/miad2024-12-prediccion-precio-vehiculos).

# Procesamiento y exploración preliminar de datos

In [1]:
# librerías
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
# cargar datos (se tiene en .csv en local)
df_train=pd.read_csv("dataTrain_carListings.csv")
# data test tiene una columna llamada ID, que solamente es el orden de numeros
df_test=pd.read_csv("dataTest_carListings.csv", index_col=0)
# Cargar datos reales
df_real=pd.read_csv("true_car_listings.csv")

In [3]:
# eliminar columnas adicionales
df_real.drop(["City", "Vin"], axis=1, inplace=True)
df_real.columns

Index(['Price', 'Year', 'Mileage', 'State', 'Make', 'Model'], dtype='object')

## Entrenamiento y calibración de XGB

In [4]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [5]:
import sklearn.datasets
import sklearn.metrics
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBRegressor

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Price    400000 non-null  int64 
 1   Year     400000 non-null  int64 
 2   Mileage  400000 non-null  int64 
 3   State    400000 non-null  object
 4   Make     400000 non-null  object
 5   Model    400000 non-null  object
dtypes: int64(3), object(3)
memory usage: 18.3+ MB


In [7]:
# Codificar variables categóricas
categorical_columns = ['State', 'Make', 'Model']
df_train = pd.get_dummies(df_train, columns=categorical_columns).astype(int)

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
data = df_train.drop(['Price'], axis=1)
target = df_train['Price']

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.25, random_state=42)

# Convertir los arrays resultantes en DataFrames de pandas
X_train = pd.DataFrame(X_train, columns=data.columns)
X_test = pd.DataFrame(X_test, columns=data.columns)
y_train = pd.Series(y_train, name='Price')
y_test = pd.Series(y_test, name='Price')

# Verifica los tipos de datos de las estructuras resultantes
print("Tipo de X_train:", type(X_train))
print("Tipo de X_test:", type(X_test))
print("Tipo de y_train:", type(y_train))
print("Tipo de y_test:", type(y_test))

Tipo de X_train: <class 'pandas.core.frame.DataFrame'>
Tipo de X_test: <class 'pandas.core.frame.DataFrame'>
Tipo de y_train: <class 'pandas.core.series.Series'>
Tipo de y_test: <class 'pandas.core.series.Series'>


In [9]:
# Modelo sin calibrar
xgb_1 = XGBRegressor(random_state=42)
xgb_1

In [10]:
# Entrenamiento y desempeño del modelo XGBRegressor
xgb_1.fit(X_train, y_train)
y_pred = xgb_1.predict(X_test)
# Calculo del MSE y el MAE
from sklearn.metrics import mean_squared_error, mean_absolute_error
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("El modelo XGBRegressor tiene un RMSE igual a " +str(rmse)+ " y un MAE igual a "+str(mae))

El modelo XGBRegressor tiene un RMSE igual a 4266.196601029799 y un MAE igual a 3039.4960325634765


In [11]:
pip install optuna




## Prueba 1
Fuente: https://github.com/optuna/optuna-examples/blob/main/xgboost/xgboost_simple.py

In [24]:
import numpy as np
import optuna
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import xgboost as xgb

np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=50, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

[I 2024-04-21 07:40:42,498] A new study created in memory with name: no-name-a23ec987-b57a-4531-83f1-fc4ab3a67689
[I 2024-04-21 07:41:55,777] Trial 0 finished with value: 58266708.95231 and parameters: {'lambda': 1.971214082209539e-08, 'alpha': 3.891270848001614e-05, 'subsample': 0.9872978739878462, 'colsample_bytree': 0.30435408121285734, 'max_depth': 7, 'min_child_weight': 10, 'eta': 0.2503246399809418, 'gamma': 4.183199747011751e-06, 'grow_policy': 'lossguide'}. Best is trial 0 with value: 58266708.95231.
[I 2024-04-21 07:42:52,876] Trial 1 finished with value: 115477217.75042 and parameters: {'lambda': 2.5542694568391625e-07, 'alpha': 0.021023971090472056, 'subsample': 0.4163091603186153, 'colsample_bytree': 0.32815543454051077, 'max_depth': 5, 'min_child_weight': 7, 'eta': 3.166134818504197e-06, 'gamma': 9.197166660721089e-06, 'grow_policy': 'lossguide'}. Best is trial 0 with value: 58266708.95231.
[I 2024-04-21 07:44:10,687] Trial 2 finished with value: 29354150.29736 and paramet

Number of finished trials:  10
Best trial:
  Value: 29354150.29736
  Params: 
    lambda: 0.0006300622095494117
    alpha: 0.00014622406232897364
    subsample: 0.23255033200475195
    colsample_bytree: 0.4885554959213272
    max_depth: 9
    min_child_weight: 8
    eta: 0.7114903492018297
    gamma: 6.252368056908203e-05
    grow_policy: lossguide


In [25]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_calibrado = XGBRegressor(
    booster='gbtree',
    reg_lambda=0.0006300622095494117,
    alpha=0.00014622406232897364,
    subsample=0.23255033200475195,
    colsample_bytree=0.4885554959213272,
    max_depth=9,
    min_child_weight=8,
    eta=0.7114903492018297,
    gamma=6.252368056908203e-05,
    grow_policy='lossguide',
    random_state=1
)
xgb_calibrado.fit(X_train, y_train)
y_pred = xgb_calibrado.predict(X_test)
mse_xgb_calibrado = mean_squared_error(y_test, y_pred)
mae_xgb_calibrado = mean_absolute_error(y_test, y_pred)
rmse_xgb_calibrado = np.sqrt(mse_xgb_calibrado)
print("El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a " +str(rmse_xgb_calibrado)+ " y un MAE igual a "+str(mae_xgb_calibrado))

El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a 3781.8293587339426 y un MAE igual a 2485.5054897171785


In [26]:
# Intento #2 agregando más ensayos
np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=100, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

[I 2024-04-21 07:57:20,516] A new study created in memory with name: no-name-020af963-bbed-4a46-ad93-b4f8ed746878
[I 2024-04-21 07:58:17,335] Trial 0 finished with value: 115451793.12943 and parameters: {'lambda': 0.20040150100119045, 'alpha': 1.6913496888892257e-05, 'subsample': 0.6862576039507055, 'colsample_bytree': 0.6193776455033985, 'max_depth': 5, 'min_child_weight': 8, 'eta': 3.593276220728275e-05, 'gamma': 2.1162184437348128e-06, 'grow_policy': 'depthwise'}. Best is trial 0 with value: 115451793.12943.
[I 2024-04-21 07:59:09,854] Trial 1 finished with value: 115467205.22416 and parameters: {'lambda': 0.33238518085922236, 'alpha': 0.0035840099337023437, 'subsample': 0.7638912869257131, 'colsample_bytree': 0.4665646270373889, 'max_depth': 9, 'min_child_weight': 6, 'eta': 1.2251412424199472e-05, 'gamma': 0.3008654899419707, 'grow_policy': 'lossguide'}. Best is trial 0 with value: 115451793.12943.
[I 2024-04-21 07:59:54,080] Trial 2 finished with value: 114788166.83434 and paramet

Number of finished trials:  13
Best trial:
  Value: 25131732.68879
  Params: 
    lambda: 1.3618109137815986e-08
    alpha: 0.0068473212560469085
    subsample: 0.5383447526406704
    colsample_bytree: 0.7375702235448874
    max_depth: 9
    min_child_weight: 4
    eta: 0.925350157501082
    gamma: 0.8591307262185451
    grow_policy: lossguide


In [27]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_cal = XGBRegressor(
    booster='gbtree',
    reg_lambda=1.3618109137815986e-08,
    alpha=0.0068473212560469085,
    subsample=0.5383447526406704,
    colsample_bytree=0.7375702235448874,
    max_depth=9,
    min_child_weight=4,
    eta=0.925350157501082,
    gamma=0.8591307262185451,
    grow_policy='lossguide',
    random_state=1
)
xgb_cal.fit(X_train, y_train)
y_pred = xgb_cal.predict(X_test)
mse_xgb_cal = mean_squared_error(y_test, y_pred)
mae_xgb_cal = mean_absolute_error(y_test, y_pred)
rmse_xgb_cal = np.sqrt(mse_xgb_cal)
print("El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a " +str(rmse_xgb_cal)+ " y un MAE igual a "+str(mae_xgb_cal))

El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a 3799.748768198024 y un MAE igual a 2402.751549249878


In [28]:
# intento # 3, agregando el parámetro n_estimators
import numpy as np
import optuna
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import xgboost as xgb

np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        # número de árboles en el ensamble
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=50, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

[I 2024-04-21 08:17:11,925] A new study created in memory with name: no-name-e423e33d-9a4e-4db2-9856-f6b38c66cf1a
[I 2024-04-21 08:18:26,033] Trial 0 finished with value: 115405549.16559 and parameters: {'lambda': 8.483891491275393e-07, 'alpha': 0.04049792107786824, 'subsample': 0.5061438255278877, 'colsample_bytree': 0.9132630822622902, 'n_estimators': 719, 'max_depth': 7, 'min_child_weight': 6, 'eta': 7.218397156833382e-05, 'gamma': 1.012252817923622e-06, 'grow_policy': 'lossguide'}. Best is trial 0 with value: 115405549.16559.
[I 2024-04-21 08:19:21,170] Trial 1 finished with value: 115477217.75042 and parameters: {'lambda': 0.0006005291991613236, 'alpha': 0.007290525723887676, 'subsample': 0.5667400033442328, 'colsample_bytree': 0.9848274047218768, 'n_estimators': 214, 'max_depth': 9, 'min_child_weight': 10, 'eta': 2.919252606518446e-07, 'gamma': 4.14051509102989e-08, 'grow_policy': 'depthwise'}. Best is trial 0 with value: 115405549.16559.
[I 2024-04-21 08:20:12,515] Trial 2 finis

Number of finished trials:  13
Best trial:
  Value: 47623586.62675
  Params: 
    lambda: 0.9501091374676599
    alpha: 2.2458201898736107e-06
    subsample: 0.33554148719968135
    colsample_bytree: 0.6131209517619073
    n_estimators: 988
    max_depth: 3
    min_child_weight: 8
    eta: 0.9710181349940284
    gamma: 0.0031652899319421076
    grow_policy: depthwise


In [31]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_cal_2 = XGBRegressor(
    booster='gbtree',
    reg_lambda=0.9501091374676599,
    alpha=2.2458201898736107e-06,
    subsample=0.33554148719968135,
    colsample_bytree=0.6131209517619073,
    n_estimators= 988,
    max_depth=3,
    min_child_weight=8,
    eta=0.9710181349940284,
    gamma=0.0031652899319421076,
    grow_policy='depthwise',
    random_state=1
)
xgb_cal_2.fit(X_train, y_train)
y_pred = xgb_cal_2.predict(X_test)
mse_xgb_cal_2 = mean_squared_error(y_test, y_pred)
mae_xgb_cal_2 = mean_absolute_error(y_test, y_pred)
rmse_xgb_cal_2 = np.sqrt(mse_xgb_cal_2)
print("El modelo XGB calibrado usando Optuna tiene un RMSE igual a " +str(rmse_xgb_cal_2)+ " y un MAE igual a "+str(mae_xgb_cal_2))

El modelo XGB calibrado usando Optuna tiene un RMSE igual a 3670.867236378481 y un MAE igual a 2405.6893865302277


In [34]:
# Intento #4 cambiando los valores del parámetro gamma
import xgboost as xgb

np.random.seed(42)

def objective(trial):
    data = df_train.drop(['Price'], axis=1).values
    target = df_train['Price'].values
    train_x, valid_x, train_y, valid_y = train_test_split(data, target, test_size=0.25, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "objective": "reg:squarederror",
        # use exact for small dataset.
        "tree_method": "approx",
        # defines booster, gblinear for linear functions.
        "booster": "gbtree",  # Fijar el booster como gbtree
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        # número de árboles en el ensamble
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
    }

    # Parámetros específicos de gbtree
    param["max_depth"] = trial.suggest_int("max_depth", 3, 9, step=2)
    param["min_child_weight"] = trial.suggest_int("min_child_weight", 2, 10)
    param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
    param["gamma"] = trial.suggest_float("gamma", 1e-8, 500, log=True)
    param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])

    bst = xgb.train(param, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    mse = mean_squared_error(valid_y, pred_labels)
    return mse

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=50, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

[I 2024-04-21 08:46:29,488] A new study created in memory with name: no-name-30e1af46-eca3-4406-89bb-0fda57c665e7
[I 2024-04-21 08:47:46,924] Trial 0 finished with value: 99690658.93237 and parameters: {'lambda': 1.881540728677821e-07, 'alpha': 0.017905189484565296, 'subsample': 0.9691611843188477, 'colsample_bytree': 0.6726434557487605, 'n_estimators': 121, 'max_depth': 9, 'min_child_weight': 7, 'eta': 0.01891843984008906, 'gamma': 0.0006103913174070533, 'grow_policy': 'lossguide'}. Best is trial 0 with value: 99690658.93237.
[I 2024-04-21 08:48:34,915] Trial 1 finished with value: 115420571.60924 and parameters: {'lambda': 0.03395876280124672, 'alpha': 0.00036198987123090963, 'subsample': 0.9084870230740569, 'colsample_bytree': 0.21853901055694616, 'n_estimators': 948, 'max_depth': 3, 'min_child_weight': 10, 'eta': 0.00027044891973931076, 'gamma': 0.002081607944394096, 'grow_policy': 'depthwise'}. Best is trial 0 with value: 99690658.93237.
[I 2024-04-21 08:49:20,296] Trial 2 finishe

Number of finished trials:  12
Best trial:
  Value: 26722531.74245
  Params: 
    lambda: 9.564561673448604e-07
    alpha: 1.6980413730845048e-05
    subsample: 0.2288913151339783
    colsample_bytree: 0.9932758875637113
    n_estimators: 699
    max_depth: 9
    min_child_weight: 5
    eta: 0.9676347877899362
    gamma: 1.115418460534174e-08
    grow_policy: lossguide


In [35]:
# Definición del modelo XGBoost con los parámetros encontrados por Optuna
xgb_2 = XGBRegressor(
    reg_lambda=9.564561673448604e-07,
    alpha=1.6980413730845048e-05,
    subsample=0.2288913151339783,
    colsample_bytree=0.9932758875637113,
    n_estimators= 699,
    max_depth=9,
    min_child_weight=5,
    eta=0.9676347877899362,
    gamma=1.115418460534174e-08,
    grow_policy='lossguide',
    random_state=1
)
xgb_2.fit(X_train, y_train)
y_pred = xgb_2.predict(X_test)
mse_xgb_2 = mean_squared_error(y_test, y_pred)
mae_xgb_2 = mean_absolute_error(y_test, y_pred)
rmse_xgb_2 = np.sqrt(mse_xgb_2)
print("El modelo XGB calibrado usando Optuna tiene un RMSE igual a " +str(rmse_xgb_2)+ " y un MAE igual a "+str(mae_xgb_2))

El modelo XGB calibrado usando Optuna tiene un RMSE igual a 3670.867236378481 y un MAE igual a 3129.3654176013756


In [37]:
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
# Calibración de gamma
gamma = np.arange(0, 100, 10 )
MSE_2 = []
for valor in gamma:
    xgb = XGBRegressor(gamma=valor, random_state=1)
    MSE_2.append(cross_val_score(xgb, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean())
    MSE_2 = [abs(valor) for valor in MSE_2]

In [None]:
plt.plot(gamma, MSE_2)
plt.title("Desempeño de XGBoost por cada valor de gamma")
plt.xlabel('gamma')
plt.ylabel('MSE')

# Cargar predicciones en Kaggle

In [None]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 300000 entries, 34894 to 121958
Columns: 616 entries, Year to Model_xD5dr
dtypes: int32(616)
memory usage: 707.2 MB


In [None]:
X_train.columns

Index(['Year', 'Mileage', 'State_ AK', 'State_ AL', 'State_ AR', 'State_ AZ',
       'State_ CA', 'State_ CO', 'State_ CT', 'State_ DC',
       ...
       'Model_Yaris4dr', 'Model_YarisBase', 'Model_YarisLE', 'Model_Yukon',
       'Model_Yukon2WD', 'Model_Yukon4WD', 'Model_Yukon4dr', 'Model_tC2dr',
       'Model_xB5dr', 'Model_xD5dr'],
      dtype='object', length=616)

In [None]:
# Codificar variables categóricas en df_test
df_test_ = pd.get_dummies(df_test, columns=categorical_columns).astype(int)

# Alinear las columnas de df_test con X_train
df_test_aligned, _ = df_test_.align(X_train, axis=1, fill_value=0)

column_order = X_train.columns.tolist()

# Reordenar las columnas de df_test_aligned de acuerdo con el orden en X_train
df_test_ordenado = df_test_aligned[column_order]

In [None]:
df_test_ordenado.columns

Index(['Year', 'Mileage', 'State_ AK', 'State_ AL', 'State_ AR', 'State_ AZ',
       'State_ CA', 'State_ CO', 'State_ CT', 'State_ DC',
       ...
       'Model_Yaris4dr', 'Model_YarisBase', 'Model_YarisLE', 'Model_Yukon',
       'Model_Yukon2WD', 'Model_Yukon4WD', 'Model_Yukon4dr', 'Model_tC2dr',
       'Model_xB5dr', 'Model_xD5dr'],
      dtype='object', length=616)

In [None]:
# Se usa df_test_aligned para predecir
y_pred = xgb_calibrado_3.predict(df_test_ordenado)

predictions_df = pd.DataFrame(y_pred, index=df_test_ordenado.index, columns=['Price'])

predictions_df.to_csv('predicciones.csv')

In [None]:
# El modelo montado en Kaggle fue:
xgb_calibrado_3 = XGBRegressor(
    booster='gbtree',
    reg_lambda=0.30212357583678445,
    alpha=0.18211410327397226,
    subsample=0.3521449713976327,
    colsample_bytree=0.44811953613366806,
    max_depth=9,
    min_child_weight=9,
    eta=0.8918833748094722,
    gamma=0.00034064735565798957,
    grow_policy='lossguide',
    random_state=1  # Si deseas mantener una semilla aleatoria fija
)
xgb_calibrado_3.fit(X_train, y_train)
y_pred = xgb_calibrado_3.predict(X_test)
mse_xgb_calibrado_3 = mean_squared_error(y_test, y_pred)
mae_xgb_calibrado_3 = mean_absolute_error(y_test, y_pred)
rmse_xgb_calibrado_3 = np.sqrt(mse_xgb_calibrado_3)
print("El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a " +str(rmse_xgb_calibrado_3)+ " y un MAE igual a "+str(mae_xgb_calibrado_3))

El modelo XGB calibrado usando GreedSearch tiene un RMSE igual a 3767.236427865088 y un MAE igual a 2456.755887738342
