## Rusty Bargain: Predicción del Valor de Mercado de Autos Usados

El servicio de venta de autos usados Rusty Bargain está desarrollando una aplicación para atraer nuevos clientes. Gracias a esa app, puedes averiguar rápidamente el valor de mercado de tu coche. Tienes acceso al historial: especificaciones técnicas, versiones de equipamiento y precios. Tienes que crear un modelo que determine el valor de mercado.
A Rusty Bargain le interesa:
- la calidad de la predicción;
- la velocidad de la predicción;
- el tiempo requerido para el entrenamiento

## Preparación de datos

In [2]:
#importar librerías

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
from sklearn.model_selection import train_test_split
import math
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
import lightgbm as lgb
import xgboost as xgb
import time


In [3]:
#importar dataset

data = pd.read_csv('/Users/luistorres/Downloads/car_data.csv')

#visualizar datos
data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [5]:
#limpieza de columnas no relevantes 
clean_data = data.drop(['DateCrawled', 'RegistrationMonth', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'], axis=1)

In [6]:
#revisión de datos extraños
clean_data['RegistrationYear'].value_counts().sort_index()

RegistrationYear
1000    37
1001     1
1039     1
1111     3
1200     1
        ..
9000     3
9229     1
9450     1
9996     1
9999    26
Name: count, Length: 151, dtype: int64

In [7]:
#filtración de datos para establecer modelos que sean mayor a 1900 y menor al año actual
clean_data = clean_data.query('RegistrationYear > 1900 and RegistrationYear < 2025').reset_index(drop=True)
clean_data

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired
0,480,,1993,manual,0,golf,150000,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,gasoline,skoda,no
...,...,...,...,...,...,...,...,...,...,...
354193,0,,2005,manual,0,colt,150000,petrol,mitsubishi,yes
354194,2200,,2005,,0,,20000,,sonstige_autos,
354195,1199,convertible,2000,auto,101,fortwo,125000,petrol,smart,no
354196,9200,bus,1996,manual,102,transporter,150000,gasoline,volkswagen,no


In [8]:
clean_data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage
count,354198.0,354198.0,354198.0,354198.0
mean,4417.651314,2003.084789,110.078242,128267.607383
std,4514.081022,7.536418,189.536766,37823.538557
min,0.0,1910.0,0.0,5000.0
25%,1050.0,1999.0,69.0,125000.0
50%,2700.0,2003.0,105.0,150000.0
75%,6400.0,2008.0,143.0,150000.0
max,20000.0,2019.0,20000.0,150000.0


In [9]:
#filtracion para obtener autos cuyo poder sea mayor a 0
clean_data = clean_data.query('Power > 0').reset_index(drop=True)

In [10]:
#revisión duplicados
clean_data.duplicated().sum()

40790

In [11]:
#eliminación duplicados
clean_data.drop_duplicates(inplace=True)

#visualización valores nulos
clean_data.isna().sum()

Price                   0
VehicleType         21502
RegistrationYear        0
Gearbox              6237
Power                   0
Model               12829
Mileage                 0
FuelType            20414
Brand                   0
NotRepaired         45884
dtype: int64

In [12]:
#reemplazo valores nulos
clean_data.fillna('desconocido', inplace=True)

#comprobación de la limpieza de los datos
clean_data

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired
0,18300,coupe,2011,manual,190,desconocido,125000,gasoline,audi,yes
1,9800,suv,2004,auto,163,grand,125000,gasoline,jeep,desconocido
2,1500,small,2001,manual,75,golf,150000,petrol,volkswagen,no
3,3600,small,2008,manual,69,fabia,90000,gasoline,skoda,no
4,650,sedan,1995,manual,102,3er,150000,petrol,bmw,yes
...,...,...,...,...,...,...,...,...,...,...
314095,5250,desconocido,2016,auto,150,159,150000,desconocido,alfa_romeo,no
314096,3200,sedan,2004,manual,225,leon,150000,petrol,seat,yes
314097,1199,convertible,2000,auto,101,fortwo,125000,petrol,smart,no
314098,9200,bus,1996,manual,102,transporter,150000,gasoline,volkswagen,no


## Entrenamiento del modelo 

In [13]:
#definición variables categóricas

variables_categoricas = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

#bucle para visualizar las variables categóricas
for variable_categorica in variables_categoricas:
    print(clean_data[variable_categorica])
    print()

0               coupe
1                 suv
2               small
3               small
4               sedan
             ...     
314095    desconocido
314096          sedan
314097    convertible
314098            bus
314099          wagon
Name: VehicleType, Length: 273310, dtype: object

0         manual
1           auto
2         manual
3         manual
4         manual
           ...  
314095      auto
314096    manual
314097      auto
314098    manual
314099    manual
Name: Gearbox, Length: 273310, dtype: object

0         desconocido
1               grand
2                golf
3               fabia
4                 3er
             ...     
314095            159
314096           leon
314097         fortwo
314098    transporter
314099           golf
Name: Model, Length: 273310, dtype: object

0            gasoline
1            gasoline
2              petrol
3            gasoline
4              petrol
             ...     
314095    desconocido
314096         petrol
314097       

In [14]:
#codificacion one hot para las variables

clean_oh = pd.get_dummies(clean_data, drop_first=True)

clean_oh

Unnamed: 0,Price,RegistrationYear,Power,Mileage,VehicleType_convertible,VehicleType_coupe,VehicleType_desconocido,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_no,NotRepaired_yes
0,18300,2011,190,125000,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,9800,2004,163,125000,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1500,2001,75,150000,False,False,False,False,False,True,...,False,False,False,False,False,False,True,False,True,False
3,3600,2008,69,90000,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,True,False
4,650,1995,102,150000,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
314095,5250,2016,150,150000,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
314096,3200,2004,225,150000,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,True
314097,1199,2000,101,125000,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,True,False
314098,9200,1996,102,150000,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False


In [15]:
#creación conjuntos de entrenamiento y validación

train_valid_oh, test_oh = train_test_split(clean_oh, test_size=0.2, random_state=12345)
train_oh, valid_oh = train_test_split(train_valid_oh, test_size=0.2, random_state=12345)

#definición características y objetivos de entrena,miento

train_oh_features = train_oh.drop('Price', axis=1)
train_oh_target = train_oh['Price']

#definición características y objetivos de validación

valid_oh_features = valid_oh.drop('Price', axis=1)
valid_oh_target = valid_oh['Price']

#definición características y objetivos de prueba

test_oh_features = test_oh.drop('Price', axis=1)
test_oh_target = test_oh['Price']

In [16]:
#definición de una funcion para la métrica
def rsme(real, prediccion):
    return math.sqrt(mean_squared_error(real, prediccion))

In [49]:
#entrenamiento modelo de regresión lineal

modelo_rl = LinearRegression()

start_time_rl = time.time() #iniciar cronómetro

modelo_rl.fit(train_oh_features, train_oh_target)

end_time_rl = time.time() #terminar cronómetro

#calcular el tiempo total
training_time_rl = end_time_rl - start_time_rl
print(f"Tiempo transcurrido: {training_time_rl:.2f} segundos")


Tiempo transcurrido: 0.80 segundos


In [50]:
#generando las predicciones

pred_time_rl = time.time() #iniciar cronómetro

pred_train = modelo_rl.predict(train_oh_features)
pred_valid = modelo_rl.predict(valid_oh_features)
pred_test = modelo_rl.predict(test_oh_features)

pred_time_rl_end = time.time() #terminar cronómetro

#calcular el tiempo total
total_time_rl_pred = pred_time_rl_end - pred_time_rl
print(f"Tiempo transcurrido: {total_time_rl_pred:.2f} segundos")


Tiempo transcurrido: 0.29 segundos


In [19]:
print('RSME entrenamiento: ', rsme(train_oh_target, pred_train))
print('RSME validación: ', rsme(valid_oh_target, pred_valid))
print('RSME prueba: ', rsme(test_oh_target, pred_test))

RSME entrenamiento:  2878.2674145060137
RSME validación:  2922.8635002450524
RSME prueba:  2901.430116956347


In [52]:
#entrenando un modelo de bosque

modelo_rf = RandomForestRegressor(n_estimators=100, max_depth=None, verbose=True, n_jobs=8)

start_time_rf = time.time() #iniciar cronómetro

modelo_rf.fit(train_oh_features, train_oh_target)

end_time_rf = time.time() #terminar cronómetro

#calcular tiempo total
total_time_rf = end_time_rf - start_time_rf
print(f"Tiempo transcurrido: {total_time_rf:.2f} segundos")

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    6.3s


Tiempo transcurrido: 16.58 segundos


[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   16.5s finished


In [53]:
#generando las predicciones

start_pred_rf = time.time() #iniciar cronómetro

pred_train_rf = modelo_rf.predict(train_oh_features)
pred_valid_rf = modelo_rf.predict(valid_oh_features)
pred_test_rf = modelo_rf.predict(test_oh_features)

end_pred_rf = time.time() #terminar cronómetro

#calcular tiempo total
total_pred_rf = end_pred_rf - start_pred_rf
print(f"Tiempo transcurrido: {total_pred_rf:.2f} segundos")

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.9s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.1s


Tiempo transcurrido: 1.52 segundos


[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.3s finished


In [22]:
print('RSME entrenamiento: ', rsme(train_oh_target, pred_train_rf))
print('RSME validación: ', rsme(valid_oh_target, pred_valid_rf))
print('RSME prueba: ', rsme(test_oh_target, pred_test_rf))

RSME entrenamiento:  995.6804235307849
RSME validación:  1833.7725518057646
RSME prueba:  1821.6217264773675


In [54]:
#entrenamiento modelo CatBoost

model_cb = CatBoostRegressor(
    iterations=1200,
    learning_rate=0.12
)

start_time_cb = time.time() #iniciar cronómetro

model_cb.fit(train_oh_features, train_oh_target)

end_time_cb = time.time() #terminar cronómetro

#calcular tiempo total
total_train_cb = end_time_cb - start_time_cb
print(f"Tiempo transcurrido: {total_train_cb:.2f} segundos")

0:	learn: 4293.8039727	total: 4.61ms	remaining: 5.53s
1:	learn: 3987.3768197	total: 8.26ms	remaining: 4.95s
2:	learn: 3731.0519145	total: 11.8ms	remaining: 4.69s
3:	learn: 3511.7081818	total: 15.1ms	remaining: 4.5s
4:	learn: 3330.6516185	total: 18.4ms	remaining: 4.4s
5:	learn: 3172.9374621	total: 21.6ms	remaining: 4.31s
6:	learn: 3030.2351341	total: 24.9ms	remaining: 4.24s
7:	learn: 2918.9287916	total: 28.1ms	remaining: 4.18s
8:	learn: 2817.8291461	total: 31.3ms	remaining: 4.14s
9:	learn: 2730.3064180	total: 34.6ms	remaining: 4.12s
10:	learn: 2656.0425353	total: 37.9ms	remaining: 4.1s
11:	learn: 2586.6885685	total: 41.1ms	remaining: 4.07s
12:	learn: 2529.1424081	total: 44.4ms	remaining: 4.05s
13:	learn: 2477.6158808	total: 47.4ms	remaining: 4.01s
14:	learn: 2436.5518702	total: 50.6ms	remaining: 3.99s
15:	learn: 2398.6046845	total: 53.7ms	remaining: 3.97s
16:	learn: 2361.3059894	total: 56.6ms	remaining: 3.94s
17:	learn: 2330.2527794	total: 59.8ms	remaining: 3.92s
18:	learn: 2303.9219421

In [55]:
#generando las predicciones

start_pred_cb = time.time() #iniciar cronómetro

pred_train_cb = model_cb.predict(train_oh_features)
pred_valid_cb = model_cb.predict(valid_oh_features)
pred_test_cb = model_cb.predict(test_oh_features)

end_pred_cb = time.time() #terminar cronómetro
#calcular tiempo total
total_pred_cb = end_pred_cb - start_pred_cb
print(f"Tiempo transcurrido: {total_pred_cb:.2f} segundos")

Tiempo transcurrido: 0.07 segundos


In [25]:
print('RSME entrenamiento: ', rsme(train_oh_target, pred_train_cb))
print('RSME validación: ', rsme(valid_oh_target, pred_valid_cb))
print('RSME prueba: ', rsme(test_oh_target, pred_test_cb))

RSME entrenamiento:  1614.4519749156932
RSME validación:  1766.2747747540286
RSME prueba:  1742.900596337515


In [56]:
#definiendo hiperparámetros del modelo XGBoost
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror',  # Para regresión
    n_estimators=2000,  
    learning_rate=0.12,  
    max_depth=6,  
    subsample=0.8,  
    colsample_bytree=0.8,  
    random_state=12345,
    early_stopping_rounds=50,
    eval_metric='rmse'
)

#entrenando el modelo con evaluación en el conjunto de validación

start_train_xgb = time.time() #iniciar timer

xgb_model.fit(
    train_oh_features, train_oh_target,
    eval_set=[(valid_oh_features, valid_oh_target)],
    verbose=100
)

end_train_xgb = time.time() #terminar timer

#calcular tiempo total
total_train_xgb = end_train_xgb - start_train_xgb
print(f"Tiempo transcurrido: {total_train_xgb:.2f} segundos")

[0]	validation_0-rmse:4241.68769
[100]	validation_0-rmse:1858.92938
[200]	validation_0-rmse:1820.29659
[300]	validation_0-rmse:1799.65442
[400]	validation_0-rmse:1786.87422
[500]	validation_0-rmse:1779.24237
[600]	validation_0-rmse:1772.50220
[700]	validation_0-rmse:1767.52366
[800]	validation_0-rmse:1763.69068
[900]	validation_0-rmse:1759.22815
[1000]	validation_0-rmse:1757.12180
[1100]	validation_0-rmse:1754.71943
[1200]	validation_0-rmse:1754.36247
[1205]	validation_0-rmse:1754.24449
Tiempo transcurrido: 14.57 segundos


In [57]:
#generando las predicciones

start_pred_xgb = time.time() #iniciar timer

pred_train_xb = xgb_model.predict(train_oh_features)
pred_valid_xb = xgb_model.predict(valid_oh_features)
pred_test_xb = xgb_model.predict(test_oh_features)

end_pred_xgb = time.time() #terminar timer

#calcular tiempo total
total_pred_xgb = end_pred_xgb - start_pred_xgb
print(f"Tiempo transcurrido: {total_pred_xgb:.2f} segundos")

Tiempo transcurrido: 0.40 segundos


In [28]:
print('RSME entrenamiento: ', rsme(train_oh_target, pred_train_xb))
print('RSME validación: ', rsme(valid_oh_target, pred_valid_xb))
print('RSME prueba: ', rsme(test_oh_target, pred_test_xb))

RSME entrenamiento:  1454.9304107069863
RSME validación:  1754.018030123978
RSME prueba:  1730.8454581504382


In [29]:
#armando conjuntos de entrenamiento sin OneHot
train_valid, test = train_test_split(clean_data, test_size=0.20)
train, valid = train_test_split(train_valid, test_size=0.20)

#definiendo características sin OneHot

train_features = train.drop(['Price'], axis=1)
valid_features = valid.drop(['Price'], axis=1)
test_features = test.drop(['Price'], axis=1)

#definiendo objetivos sin OneHot

train_target = train['Price']
valid_target = valid['Price']
test_target = test['Price']


In [58]:
#entrenamiento modelo CatBoost sin OneHot

model_cb_2 = CatBoostRegressor(
    iterations=3000,
    learning_rate=0.13,
    cat_features=variables_categoricas
)
start_train_cb2 = time.time() #iniciar timer

model_cb_2.fit(train_features, train_target, eval_set=(valid_features, valid_target))

end_train_cb2 = time.time() #terminar timer

#calcular tiempo total
total_train_cb2 = end_train_cb2 - start_train_cb2
print(f"Tiempo transcurrido: {total_train_cb2:.2f} segundos")

0:	learn: 4258.4139300	test: 4264.0797510	best: 4264.0797510 (0)	total: 18ms	remaining: 53.9s
1:	learn: 3935.8620000	test: 3941.3174017	best: 3941.3174017 (1)	total: 36.5ms	remaining: 54.7s
2:	learn: 3675.3116825	test: 3680.1609230	best: 3680.1609230 (2)	total: 52.5ms	remaining: 52.5s
3:	learn: 3451.2496397	test: 3454.7123462	best: 3454.7123462 (3)	total: 69.5ms	remaining: 52.1s
4:	learn: 3257.3159477	test: 3258.9347238	best: 3258.9347238 (4)	total: 82.2ms	remaining: 49.2s
5:	learn: 3080.5249084	test: 3083.1319770	best: 3083.1319770 (5)	total: 94.1ms	remaining: 47s
6:	learn: 2926.2519299	test: 2929.8389775	best: 2929.8389775 (6)	total: 110ms	remaining: 47s
7:	learn: 2805.4487513	test: 2809.1966264	best: 2809.1966264 (7)	total: 124ms	remaining: 46.4s
8:	learn: 2706.3791930	test: 2709.1836108	best: 2709.1836108 (8)	total: 139ms	remaining: 46.2s
9:	learn: 2624.5413675	test: 2627.7066902	best: 2627.7066902 (9)	total: 156ms	remaining: 46.6s
10:	learn: 2537.4145026	test: 2540.9850018	best: 2

In [59]:
#generando las predicciones sin OneHot

start_pred_cb2 = time.time() #iniciar timer

pred_train_cb_2 = model_cb_2.predict(train_features)
pred_valid_cb_2 = model_cb_2.predict(valid_features)
pred_test_cb_2 = model_cb_2.predict(test_features)

end_pred_cb2 = time.time() #terminar timer

#calcular tiempo total
total_pred_cb2 = end_pred_cb2 - start_pred_cb2
print(f"Tiempo transcurrido: {total_pred_cb2:.2f} segundos")

Tiempo transcurrido: 0.38 segundos


In [32]:
print('RSME entrenamiento: ', rsme(train_target, pred_train_cb_2))
print('RSME validación: ', rsme(valid_target, pred_valid_cb_2))
print('RSME prueba: ', rsme(test_target, pred_test_cb_2))

RSME entrenamiento:  1569.8115894949171
RSME validación:  1726.5336971256236
RSME prueba:  1722.684440594792


In [60]:
#copiar el dataset original para preservar las variables categóricas
X_train_lgbm = train_features.copy()
X_valid_lgbm = valid_features.copy()
X_test_lgbm = test_features.copy()

#convertir variables categóricas a tipo 'category' para LightGBM
for col in variables_categoricas:
    X_train_lgbm[col] = X_train_lgbm[col].astype('category')
    X_valid_lgbm[col] = X_valid_lgbm[col].astype('category')

#entrenar modelo LightGBM

lgbm_model = lgb.LGBMRegressor(
    objective='regression',
    n_estimators=5000,
    learning_rate=0.1,
    num_leaves=31,
    max_depth=-1,
    min_child_samples=20,
    subsample=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1,
    random_state=42
)

start_train_lgb = time.time() #iniciar timer

lgbm_model.fit(X_train_lgbm, train_target, eval_set=[(X_valid_lgbm, valid_target)], eval_metric='rmse')

end_train_lgb = time.time() #terminar timer

#calcular tiempo total
total_train_lgb = end_train_lgb - start_train_lgb
print(f"Tiempo transcurrido: {total_train_lgb:.2f} segundos")



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001438 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 656
[LightGBM] [Info] Number of data points in the train set: 174918, number of used features: 9
[LightGBM] [Info] Start training from score 4780.866377
Tiempo transcurrido: 35.24 segundos


In [61]:
#evaluar el modelo

start_pred_lgb = time.time() #iniciar timer

y_pred_lgbm = lgbm_model.predict(X_valid_lgbm)

end_pred_lgb = time.time() #terminar timer

#calcular tiempo total
total_pred_lgb = end_pred_lgb - start_pred_lgb
print(f"Tiempo transcurrido: {total_pred_lgb:.2f} segundos")

rmse_lgbm = math.sqrt(mean_squared_error(valid_target, y_pred_lgbm))
print(f"RMSE de LightGBM: {rmse_lgbm}")

Tiempo transcurrido: 0.53 segundos
RMSE de LightGBM: 1713.072361243481


## Análisis del modelo

Análisis de los resultados:

LightGBM obtuvo el mejor desempeño, con el menor RMSE, lo que indica que el modelo logra predecir con mayor precisión los precios de los autos en comparación con los otros enfoques.
XGBoost también tuvo un desempeño destacable, obteniendo la segunda mejor puntuación, lo que sugiere que los algoritmos basados en árboles pueden ser más efectivos para este tipo de problema.
RandomForestRegressor tuvo un rendimiento moderado, superando a la regresión lineal, pero sin igualar la precisión de los modelos de boosting.
La regresión lineal mostró el peor desempeño, lo cual era predecible debido a la presencia de variables categóricas y relaciones no lineales en los datos.

Por otra parte, algo que es posible observar es que los modelos, conforme se vuelven más complejos, estos toman más tiempo en entrenar. Cabe mencionar que esto también puede variar dependiendo de la cantidad de datos e iteraciones que se hagan a los modelos. Modelos como la regresión lineal o bosque tardaron menos de 30 segundos, mientras que catboost y xgboost invirtieron más de 30 segundos para entrenarse.

Una reflexión que se puede destacar es que si bien, los modelos simples son eficientes en tiempo, estos pueden tener un margen de error mayor. Los modelos de descendiente gradiente son capaces de entregar resultados más precisos pero a una mayor inversión de tiempo. 