# Rusty Bargain - Modelo de valor

El servicio de venta de autos usados Rusty Bargain está desarrollando una aplicación para atraer nuevos clientes. Gracias a esa app, puedes averiguar rápidamente el valor de mercado de tu coche. Tienes acceso al historial: especificaciones técnicas, versiones de equipamiento y precios. Tienes que crear un modelo que determine el valor de mercado.
A Rusty Bargain le interesa:
- la calidad de la predicción;
- la velocidad de la predicción;
- el tiempo requerido para el entrenamiento

## 0. Inicialización

In [9]:
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [10]:
# Instalar LightGBM, CatBoost y XGBoost
try:
    import lightgbm as lgb
except Exception:
    !pip install lightgbm --quiet
    import lightgbm as lgb

try:
    from catboost import CatBoostRegressor
except Exception:
    !pip install catboost --quiet
    from catboost import CatBoostRegressor

try:
    import xgboost as xgb
except Exception:
    !pip install xgboost --quiet
    import xgboost as xgb

## 1. Preparación de datos

In [11]:
df = pd.read_csv('car_data.csv')
print(df.head())
print(df.info())
print(df.isnull().sum())

        DateCrawled  Price VehicleType  RegistrationYear Gearbox  Power  \
0  24/03/2016 11:52    480         NaN              1993  manual      0   
1  24/03/2016 10:58  18300       coupe              2011  manual    190   
2  14/03/2016 12:52   9800         suv              2004    auto    163   
3  17/03/2016 16:54   1500       small              2001  manual     75   
4  31/03/2016 17:25   3600       small              2008  manual     69   

   Model  Mileage  RegistrationMonth  FuelType       Brand NotRepaired  \
0   golf   150000                  0    petrol  volkswagen         NaN   
1    NaN   125000                  5  gasoline        audi         yes   
2  grand   125000                  8  gasoline        jeep         NaN   
3   golf   150000                  6    petrol  volkswagen          no   
4  fabia    90000                  7  gasoline       skoda          no   

        DateCreated  NumberOfPictures  PostalCode          LastSeen  
0  24/03/2016 00:00               

In [12]:
# Definir features y target
features_col = ['VehicleType', 'RegistrationYear', 'Gearbox', 'Power',
                'Model', 'Mileage', 'RegistrationMonth', 'FuelType', 'Brand', 'NotRepaired']
target_col = 'Price'

features = df[features_col]
target = df[target_col]

In [13]:
# Dividir en conjunto de entrenamiento y prueba
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345)

In [14]:
# Preprocesamiento
num_cols = features.select_dtypes(
    include=['int64', 'float64']).columns.tolist()
cat_cols = features.select_dtypes(
    include=['object', 'category']).columns.tolist()
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'), StandardScaler())
cat_transformer = make_pipeline(SimpleImputer(
    strategy='constant', fill_value='missing'), OneHotEncoder(handle_unknown='ignore', sparse_output=False))
cat_nan_transformer = make_pipeline(SimpleImputer(
    strategy='constant', fill_value='missing'))

preprocessor = ColumnTransformer(
    [('num', num_transformer, num_cols), ('cat', cat_transformer, cat_cols)], remainder='drop')
preprocessor_nan = ColumnTransformer(
    [('num', num_transformer, num_cols), ('cat', cat_nan_transformer, cat_cols)], remainder='drop')

## 2. Entrenamiento de modelos

In [15]:
# Función para entrenar y evaluar modelos
def evaluate_model(name, model, features_train, target_train, features_test, target_test, cat_features=None):
    # Train model
    start_train = time.time()
    if name in ['CatBoost'] and cat_features is not None:
        null_columns = features_train.columns[features_train.isnull().any()]
        for col in null_columns:
            features_train[col] = features_train[col].fillna(
                'missing').astype(str)
            features_test[col] = features_test[col].fillna(
                'missing').astype(str)
        model.fit(features_train, target_train,
                  cat_features=cat_features, verbose=0)
        pipe = model
    else:
        pipe = make_pipeline(preprocessor, model)
        pipe.fit(features_train, target_train)
    train_time = time.time() - start_train

    # Predict
    start_pred = time.time()
    pred = pipe.predict(features_test)
    pred_time = time.time() - start_pred
    pred_time_per_sample = pred_time / features_test.shape[0]

    total_time = train_time + pred_time
    rmse = root_mean_squared_error(target_test, pred)
    print(f"{name}: train={train_time:.3f}s, pred={pred_time:.3f}s ({pred_time_per_sample:.6f}s/sample), total={total_time:.3f}s, RECM(RMSE)={rmse:.2f}")
    return {'model': name, 'rmse': rmse, 'train_time_s': train_time, 'pred_time_per_sample_s': pred_time_per_sample, 'total_time_s': total_time, 'pipe': pipe}

In [16]:
# Modelos
models = {
    'LinearRegression': LinearRegression(),
    'DecisionTree': DecisionTreeRegressor(random_state=12345),
    'RandomForest': RandomForestRegressor(n_estimators=200, n_jobs=-1, random_state=12345),
    'LightGBM': lgb.LGBMRegressor(n_estimators=200, learning_rate=0.1, n_jobs=-1, verbose=-1, force_row_wise=True, random_state=12345),
    'LightGBM_2': lgb.LGBMRegressor(n_estimators=200, learning_rate=0.05, max_depth=12, n_jobs=-1, verbose=-1, force_row_wise=True, random_state=12345),
    'CatBoost': CatBoostRegressor(iterations=200, learning_rate=0.1, verbose=0, random_seed=12345),
    'XGBoost': xgb.XGBRegressor(n_estimators=200, learning_rate=0.1, n_jobs=-1, random_state=12345)
}

In [17]:
# Identificar índices de features categóricas
cat_features_cols = features_train.select_dtypes(
    include=['object', 'category']).columns.tolist()
cat_features_idx = [features_train.columns.get_loc(
    c) for c in cat_features_cols] if cat_features_cols else None

## 3. Análisis de modelos

In [18]:
# Función para evaluar modelos
results = []
for name, m in models.items():
    if name == 'CatBoost':
        res = evaluate_model(name, m, features_train, target_train,
                             features_test, target_test, cat_features=cat_features_idx)
    else:
        res = evaluate_model(name, m, features_train,
                             target_train, features_test, target_test)
    results.append(res)

LinearRegression: train=9.109s, pred=0.487s (0.000007s/sample), total=9.596s, RECM(RMSE)=3172.36
DecisionTree: train=46.351s, pred=0.456s (0.000006s/sample), total=46.807s, RECM(RMSE)=2191.97
RandomForest: train=1803.793s, pred=4.407s (0.000062s/sample), total=1808.199s, RECM(RMSE)=1736.32
LightGBM: train=6.863s, pred=0.762s (0.000011s/sample), total=7.624s, RECM(RMSE)=1792.44
LightGBM_2: train=21.870s, pred=0.625s (0.000009s/sample), total=22.494s, RECM(RMSE)=1847.40
CatBoost: train=39.998s, pred=0.266s (0.000004s/sample), total=40.265s, RECM(RMSE)=1847.02
XGBoost: train=10.107s, pred=0.448s (0.000006s/sample), total=10.556s, RECM(RMSE)=1814.79


## Conclusiones

- LinearRegression utilizado como prueba de cordura tuvo un error muy alto (RMSE=3172.36), confirmando que los modelos más complejos sí aportan valor.
- DecisionTree es el modelo más simple: entrena y predice muy rápido, pero su error (RMSE=2191.97) es el más alto, lo que indica baja capacidad de generalización.
- RandomForest logra el mejor desempeño en términos de error (RMSE=1736.32), pero a costa de un tiempo de entrenamiento extremadamente alto (más de 1100 segundos), lo que lo hace poco eficiente frente a los demás.
- LightGBM y XGBoost ofrecen un balance muy favorable entre precisión y velocidad: entrenan en segundos, predicen casi instantáneamente y mantienen errores relativamente bajos (1792.44 y 1814.79, respectivamente).
- LightGBM_2 y CatBoost tienen un rendimiento similar, con errores algo mayores (1847.40 y 1847.02), sin destacar frente a LightGBM ni XGBoost en velocidad o precisión.


- Si priorizamos precisión, RandomForest es el ganador, aunque con un costo computacional elevado.
- Si buscamos el mejor equilibrio entre precisión y eficiencia, LightGBM y XGBoost son las mejores opciones.