El servicio de venta de autos usados Rusty Bargain está desarrollando una aplicación para atraer nuevos clientes. Gracias a esa app, puedes averiguar rápidamente el valor de mercado de tu coche. Tienes acceso al historial: especificaciones técnicas, versiones de equipamiento y precios. Tienes que crear un modelo que determine el valor de mercado.
A Rusty Bargain le interesa:
- la calidad de la predicción;
- la velocidad de la predicción;
- el tiempo requerido para el entrenamiento

## Preparación de datos

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
import xgboost as xgb 
import time
import gc

In [2]:
%%time
df = pd.read_csv('/datasets/car_data.csv')
df.info()
print(df.head(10))
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [3]:

def preprocess_data(df):

    # Convertir columnas de fecha a tipo datetime. 
    date_cols = ["DateCrawled", "DateCreated", "LastSeen"]
    for col in date_cols:
        df[col] = pd.to_datetime(df[col], format="%d/%m/%Y %H:%M", errors="coerce")
    
    # Filtrar registros con valores de precios mayores a 0
    df = df[df["Price"] > 0]

    # Eliminar outliers extremos en el precio.
    lower_price = df["Price"].quantile(0.01)
    upper_price = df["Price"].quantile(0.99)
    df = df[(df["Price"] >= lower_price) & (df["Price"] <= upper_price)]
    
    # Rellenar valores nulos en columnas categóricas.
    cat_cols = ["VehicleType", "Gearbox", "FuelType", "Model", "NotRepaired", "Brand"]
    for col in cat_cols:
        df[col].fillna("missing", inplace=True)
    
    # Eliminar columnas de fecha, ya extraídas las características de interes.
    df.drop(columns=date_cols, inplace=True)
    
    # Convertir variables categóricas usando one-hot encoding.
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)
    
    # Eliminar PostalCode y NumberOfPictures
    df.drop(columns=["PostalCode", "NumberOfPictures"], inplace=True)
    
    return df

In [4]:
df_procesado = preprocess_data(df)
df_procesado.info()
df_procesado.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 337629 entries, 0 to 354368
Columns: 313 entries, Price to Brand_volvo
dtypes: int64(5), uint8(308)
memory usage: 114.6 MB


Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,VehicleType_convertible,VehicleType_coupe,VehicleType_missing,VehicleType_other,VehicleType_sedan,...,Brand_seat,Brand_skoda,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo
count,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,...,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0,337629.0
mean,4437.357493,2003.874149,110.60592,128914.637072,5.793931,0.057267,0.04478,0.096304,0.00898,0.261148,...,0.019664,0.015932,0.015236,0.007724,0.002121,0.006676,0.013302,0.001407,0.217019,0.009108
std,4278.302527,68.450397,188.994587,36902.743319,3.692011,0.232352,0.206821,0.295008,0.094338,0.439261,...,0.138842,0.125211,0.122489,0.087549,0.046002,0.081434,0.114563,0.037482,0.412216,0.094998
min,100.0,1000.0,0.0,5000.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1200.0,1999.0,69.0,125000.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2850.0,2003.0,105.0,150000.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6490.0,2008.0,141.0,150000.0,9.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,18850.0,9999.0,20000.0,150000.0,12.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Entrenamiento del modelo 

In [5]:
# División en conjuntos entrenamiento y prueba
X = df_procesado.drop("Price", axis=1)
y = df_procesado["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [6]:
def recm_metric(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred) / y_true)
recm_scorer = make_scorer(lambda y, y_pred: -recm_metric(y, y_pred))

In [7]:
# Regresión lineal
start_time = time.time()

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)
recm_lin = recm_metric(y_test, y_pred_lin)
print("Regresión Lineal - RECM: {:.4f}".format(recm_lin))

end_time = time.time()
print("Tiempo de ejecución: {:.2f} segundos".format(end_time - start_time))

Regresión Lineal - RECM: 1.2491
Tiempo de ejecución: 10.44 segundos


In [8]:

# Muestra 
sample_size = 10000  

X_train_sample = X_train.sample(n=sample_size, random_state=1234)
y_train_sample = y_train.loc[X_train_sample.index]

# Confirmar el tamaño de la muestra
print("Tamaño de X_train_sample:", X_train_sample.shape)
print("Tamaño de y_train_sample:", y_train_sample.shape)

# Bosque aleatorio
start_time = time.time()

rf = RandomForestRegressor(random_state=42)
param_grid_rf = {
    "n_estimators": [100, 150],
    "max_depth": [10, 20, None]
}

# Ajustamos el grid search usando la muestra reducida
grid_rf = GridSearchCV(rf, param_grid_rf, cv=3, scoring=recm_scorer, n_jobs=-1)
grid_rf.fit(X_train_sample, y_train_sample)

best_rf = grid_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test)
recm_rf = recm_metric(y_test, y_pred_rf)
print("Random Forest - Mejores parámetros:", grid_rf.best_params_)
print("Random Forest - RECM: {:.4f}".format(recm_rf))

end_time = time.time()
print("Tiempo de ejecución: {:.2f} segundos".format(end_time - start_time))

Tamaño de X_train_sample: (10000, 312)
Tamaño de y_train_sample: (10000,)
Random Forest - Mejores parámetros: {'max_depth': None, 'n_estimators': 150}
Random Forest - RECM: 0.5549
Tiempo de ejecución: 85.45 segundos


In [9]:
# Modelo: LightGBM
start_time = time.time()

lgb_model = lgb.LGBMRegressor(random_state=42)
param_grid_lgb = {
    "num_leaves": [31, 50],
    "n_estimators": [100, 150],
    "learning_rate": [0.1, 0.05]
}

grid_lgb = GridSearchCV(lgb_model, param_grid_lgb, cv=3, scoring=recm_scorer, n_jobs=-1)
grid_lgb.fit(X_train_sample, y_train_sample)

best_lgb = grid_lgb.best_estimator_
y_pred_lgb = best_lgb.predict(X_test)
recm_lgb = recm_metric(y_test, y_pred_lgb)
print("LightGBM - Mejores parámetros:", grid_lgb.best_params_)
print("LightGBM - RECM: {:.4f}".format(recm_lgb))

end_time = time.time()
print("Tiempo de ejecución: {:.2f} segundos".format(end_time - start_time))

LightGBM - Mejores parámetros: {'learning_rate': 0.1, 'n_estimators': 150, 'num_leaves': 31}
LightGBM - RECM: 0.5457
Tiempo de ejecución: 14.29 segundos


## Análisis del modelo

La regresión lineal tuvo un error mayor que el modelo LightGBM, por lo tanto sirve como prueba de cordura indicando que la regresión lineal no capta relaciones complejas.


Para el bosque aleatorio y el modelo Light GBM se optó por hacer pruebas con una muestra aleatoria.

En el bosque aleatorio se observa una mejora significativa en precisión respecto a la regresión lineal. El modelado basado en árboles redujo el error a menos de la mitad del valor obtenido por la regresión lineal.

Similar al Random Forest, LightGBM alcanza un RECM ligeramente menor. Esto indica que, en términos de calidad, el boosting con LightGBM está funcionando mejor.

Random Forest, aunque con buen desempeño en precisión, resulta bastante más lento en entrenar debido al costo computacional del grid de hiperparámetros.

LightGBM no solo alcanza una precisión ligeramente superior, sino que lo hace en un tiempo considerablemente menor.

Los resultados muestran que la regresión lineal, aunque rápida, sirve como línea base. Los modelos basados en árboles, en especial LightGBM, logran un rendimiento mucho mejor en términos de error relativo (RECM) y además son significativamente más rápidos en entrenamiento que el Random Forest ajustado mediante GridSearchCV. Este análisis sugiere que, si en algún experimento un modelo de boosting (como LightGBM) arrojara un RECM peor que la regresión lineal, habría que revisar el preprocesamiento o la selección de hiperparámetros, ya que es esperado que un modelo de boosting mejore la estimación.