# Descripción del proyecto

El servicio de venta de autos usados Rusty Bargain está desarrollando una aplicación para atraer nuevos clientes. Gracias a esa app, puedes averiguar rápidamente el valor de mercado de tu coche. Tienes acceso al historial: especificaciones técnicas, versiones de equipamiento y precios. Tienes que crear un modelo que determine el valor de mercado.
A Rusty Bargain le interesa:
- la calidad de la predicción;
- la velocidad de la predicción;
- el tiempo requerido para el entrenamiento

# Indice

1. [Introducción](#titulo_principal)
2. [Preparación de datos](#titulo_principal_1)
3. [Entrenamiento de modelos](#titulo_principal_2)
4. [Prueba de cordura](#titulo_principal_3)
5. [Analisis de modelo](#titulo_principal_4)
6. [Conclusiones](#titulo_principal_5)
7. [Lista de control](#titulo_principal_6)


## Introducción<a id="titulo_principal"></a>

In [52]:


# Manipulación de datos
import pandas as pd
import numpy as np

# Visualización de datos
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocesamiento de datos
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

# Modelos
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
import time
from xgboost import XGBRegressor
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Evaluación del modelo
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV

# Medición del tiempo de ejecución
import time

# Gestión de advertencias
import warnings
warnings.filterwarnings('ignore')

## Preparación de datos<a id="titulo_principal_1"></a>

In [127]:
# Carga de dataset

df = pd.read_csv('car_data.csv')

In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [55]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [56]:
print(df.head())

        DateCrawled  Price VehicleType  RegistrationYear Gearbox  Power  \
0  24/03/2016 11:52    480         NaN              1993  manual      0   
1  24/03/2016 10:58  18300       coupe              2011  manual    190   
2  14/03/2016 12:52   9800         suv              2004    auto    163   
3  17/03/2016 16:54   1500       small              2001  manual     75   
4  31/03/2016 17:25   3600       small              2008  manual     69   

   Model  Mileage  RegistrationMonth  FuelType       Brand NotRepaired  \
0   golf   150000                  0    petrol  volkswagen         NaN   
1    NaN   125000                  5  gasoline        audi         yes   
2  grand   125000                  8  gasoline        jeep         NaN   
3   golf   150000                  6    petrol  volkswagen          no   
4  fabia    90000                  7  gasoline       skoda          no   

        DateCreated  NumberOfPictures  PostalCode          LastSeen  
0  24/03/2016 00:00               

In [129]:

df_names = ['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gear_box', 'power', 
 'model', 'mileage', 'registration_month', 'fuel_type', 'brand', 'not_repaired', 
 'date_created', 'number_of_pictures', 'postal_code', 'last_seen']
df.columns = df_names

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicle_type        316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gear_box            334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   mileage             354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  number_of_pictures  354369 non-null  int64 
 14  postal_code         354369 non-null  int64 
 15  last_seen           354369 non-null  object
dtypes:

In [130]:
# Chqueo de duplicados

print(df.duplicated().sum())

df = df.drop_duplicates().reset_index(drop=True)

print(df.duplicated().sum())

262
0


In [131]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354107 entries, 0 to 354106
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354107 non-null  object
 1   price               354107 non-null  int64 
 2   vehicle_type        316623 non-null  object
 3   registration_year   354107 non-null  int64 
 4   gear_box            334277 non-null  object
 5   power               354107 non-null  int64 
 6   model               334406 non-null  object
 7   mileage             354107 non-null  int64 
 8   registration_month  354107 non-null  int64 
 9   fuel_type           321218 non-null  object
 10  brand               354107 non-null  object
 11  not_repaired        282962 non-null  object
 12  date_created        354107 non-null  object
 13  number_of_pictures  354107 non-null  int64 
 14  postal_code         354107 non-null  int64 
 15  last_seen           354107 non-null  object
dtypes:

Después de eliminar los duplicados y cambiar los nombres de las columnas para una mejor visualización de las mimas, observo que hay una buena cantidad de entradas nulas en algunas columnas, por lo que a continuación trabajare en esto.

In [133]:
# Función de rellenado de datos nulos

'''
Toma los nombres de las columnas de un dataset dado y itera sobre cada columna 
chequeando que no haya entradas nulas y que esa columna sea tipo object,
si se cumplen las condiciones rellena esa columna con el string 'Unknown'

    Args:
        dataset : Simple dataFrame de pandas
    Returns:
        datase : Retorna el dataset con las columnas string rellenas
    """
'''

def null_full(dataset):
    for i in dataset.columns:
        if dataset[i].isnull().count() > 0 and dataset[i].dtype == object:
            dataset[i] = dataset[i].fillna('Unknown')
        
    return dataset            

In [134]:
# Rellenado de entradas nulas

df = null_full(df)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354107 entries, 0 to 354106
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354107 non-null  object
 1   price               354107 non-null  int64 
 2   vehicle_type        354107 non-null  object
 3   registration_year   354107 non-null  int64 
 4   gear_box            354107 non-null  object
 5   power               354107 non-null  int64 
 6   model               354107 non-null  object
 7   mileage             354107 non-null  int64 
 8   registration_month  354107 non-null  int64 
 9   fuel_type           354107 non-null  object
 10  brand               354107 non-null  object
 11  not_repaired        354107 non-null  object
 12  date_created        354107 non-null  object
 13  number_of_pictures  354107 non-null  int64 
 14  postal_code         354107 non-null  int64 
 15  last_seen           354107 non-null  object
dtypes:

Ahora observo en los datos que algunas son fechas con horas, estas las trabajare con un función de la misma forma para convertirlos a datetime

In [135]:
# Convertir las columnas a tipo datetime

df['date_crawled'] = pd.to_datetime(df['date_crawled'], format='%d/%m/%Y %H:%M')

df['date_created'] = pd.to_datetime(df['date_created'], format='%d/%m/%Y %H:%M')

df['last_seen'] = pd.to_datetime(df['last_seen'], format='%d/%m/%Y %H:%M')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354107 entries, 0 to 354106
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   date_crawled        354107 non-null  datetime64[ns]
 1   price               354107 non-null  int64         
 2   vehicle_type        354107 non-null  object        
 3   registration_year   354107 non-null  int64         
 4   gear_box            354107 non-null  object        
 5   power               354107 non-null  int64         
 6   model               354107 non-null  object        
 7   mileage             354107 non-null  int64         
 8   registration_month  354107 non-null  int64         
 9   fuel_type           354107 non-null  object        
 10  brand               354107 non-null  object        
 11  not_repaired        354107 non-null  object        
 12  date_created        354107 non-null  datetime64[ns]
 13  number_of_pictures  354107 no

In [136]:
# Crear una instancia de StandardScaler
scaler = StandardScaler()

# Identificar columnas numéricas
numeric_columns = ['registration_year', 'power', 'mileage', 'registration_month', 'number_of_pictures']

# Ajustar el escalador y transformar las columnas numéricas
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Verificar los cambios
print(df.head())



         date_crawled  price vehicle_type  registration_year gear_box  \
0 2016-03-24 11:52:00    480      Unknown          -0.124476   manual   
1 2016-03-24 10:58:00  18300        coupe           0.074945   manual   
2 2016-03-14 12:52:00   9800          suv          -0.002607     auto   
3 2016-03-17 16:54:00   1500        small          -0.035844   manual   
4 2016-03-31 17:25:00   3600        small           0.041708   manual   

      power    model   mileage  registration_month fuel_type       brand  \
0 -0.579679     golf  0.574787           -1.533319    petrol  volkswagen   
1  0.420770  Unknown -0.084730           -0.191641  gasoline        audi   
2  0.278601    grand -0.084730            0.613366  gasoline        jeep   
3 -0.184765     golf  0.574787            0.076695    petrol  volkswagen   
4 -0.216358    fabia -1.008053            0.345031  gasoline       skoda   

  not_repaired date_created  number_of_pictures  postal_code  \
0      Unknown   2016-03-24             

In [137]:
print(df.head(10))

         date_crawled  price vehicle_type  registration_year gear_box  \
0 2016-03-24 11:52:00    480      Unknown          -0.124476   manual   
1 2016-03-24 10:58:00  18300        coupe           0.074945   manual   
2 2016-03-14 12:52:00   9800          suv          -0.002607     auto   
3 2016-03-17 16:54:00   1500        small          -0.035844   manual   
4 2016-03-31 17:25:00   3600        small           0.041708   manual   
5 2016-04-04 17:36:00    650        sedan          -0.102318   manual   
6 2016-04-01 20:48:00   2200  convertible          -0.002607   manual   
7 2016-03-21 18:54:00      0        sedan          -0.268503   manual   
8 2016-04-04 23:42:00  14500          bus           0.108182   manual   
9 2016-03-17 10:53:00    999        small          -0.069081   manual   

      power    model   mileage  registration_month fuel_type       brand  \
0 -0.579679     golf  0.574787           -1.533319    petrol  volkswagen   
1  0.420770  Unknown -0.084730           -0.

Hay algunas columnas que no son necesarias a la hora de predecir el coste de un coche como por ejemplo DateCrawled, NumberOfPictures, PostalCode o LastSeen por que a continuación las eliminare

In [157]:
df_prepared = df.drop(['date_crawled', 'number_of_pictures', 'last_seen', 'postal_code'], axis=1)
df_prepared

Unnamed: 0,price,vehicle_type,registration_year,gear_box,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created
0,480,Unknown,-0.124476,manual,-0.579679,golf,0.574787,-1.533319,petrol,volkswagen,Unknown,2016-03-24
1,18300,coupe,0.074945,manual,0.420770,Unknown,-0.084730,-0.191641,gasoline,audi,yes,2016-03-24
2,9800,suv,-0.002607,auto,0.278601,grand,-0.084730,0.613366,gasoline,jeep,Unknown,2016-03-14
3,1500,small,-0.035844,manual,-0.184765,golf,0.574787,0.076695,petrol,volkswagen,no,2016-03-17
4,3600,small,0.041708,manual,-0.216358,fabia,-1.008053,0.345031,gasoline,skoda,no,2016-03-31
...,...,...,...,...,...,...,...,...,...,...,...,...
354102,0,Unknown,0.008471,manual,-0.579679,colt,0.574787,0.345031,petrol,mitsubishi,yes,2016-03-21
354103,2200,Unknown,0.008471,Unknown,-0.579679,Unknown,-2.854701,-1.264983,Unknown,sonstige_autos,Unknown,2016-03-14
354104,1199,convertible,-0.046923,auto,-0.047862,fortwo,-0.084730,-0.728312,petrol,smart,no,2016-03-05
354105,9200,bus,-0.091239,manual,-0.042596,transporter,0.574787,-0.728312,gasoline,volkswagen,no,2016-03-19


In [158]:
# Dvidir el dataset

df_train_valid, df_test = train_test_split(df_prepared, test_size=0.2, random_state=12345)

df_train, df_valid = train_test_split(df_train_valid, test_size=0.25)


# Verificamos los tamaños de cada conjunto
print("Tamaño del conjunto de entrenamiento:", df_train.shape)
print("Tamaño del conjunto de validación:", df_valid.shape)
print("Tamaño del conjunto de prueba:", df_test.shape)

Tamaño del conjunto de entrenamiento: (212463, 12)
Tamaño del conjunto de validación: (70822, 12)
Tamaño del conjunto de prueba: (70822, 12)


In [159]:
# features y targets

# Entrenamiento
features_train = df_train.drop('price', axis=1)
targets_train = df_train['price']

# Validación
features_valid = df_valid.drop('price', axis=1)
targets_valid = df_valid['price']

Estos dato estan preparados para poder ser usados en el entrenamiento de la mayoria de modelos, sin embargo hay modelos que no aceptan datos categoricos, sino que necesitan OHE, además librerias como XGBoost no admite tampoco columnas tipo datetime y es lo que relizare a continuación 

### Dataset especial para XGBoost y modelos que requieren OHE

In [149]:
# Copia del dataset ya procesado
data_ohe = df_prepared.copy()

data_ohe.head()

Unnamed: 0,price,vehicle_type,registration_year,gear_box,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created
0,480,Unknown,-0.124476,manual,-0.579679,golf,0.574787,-1.533319,petrol,volkswagen,Unknown,2016-03-24
1,18300,coupe,0.074945,manual,0.42077,Unknown,-0.08473,-0.191641,gasoline,audi,yes,2016-03-24
2,9800,suv,-0.002607,auto,0.278601,grand,-0.08473,0.613366,gasoline,jeep,Unknown,2016-03-14
3,1500,small,-0.035844,manual,-0.184765,golf,0.574787,0.076695,petrol,volkswagen,no,2016-03-17
4,3600,small,0.041708,manual,-0.216358,fabia,-1.008053,0.345031,gasoline,skoda,no,2016-03-31


In [150]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Asumiendo que 'data_ohe' ya está definido como tu DataFrame original

# Definir las columnas categóricas correctamente
categorical_columns = ['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']

# Separar por cantidad de tiempo la columna date_column
data_ohe['year'] = data_ohe['date_created'].dt.year
data_ohe['month'] = data_ohe['date_created'].dt.month
data_ohe['day'] = data_ohe['date_created'].dt.day
data_ohe = data_ohe.drop('date_created', axis=1)

# Aplicar one-hot encoding a las columnas categóricas
data_ohe = pd.get_dummies(data_ohe, columns=categorical_columns, drop_first=True)

# Convertir las columnas booleanas a enteros (0 y 1)
boolean_columns = data_ohe.select_dtypes(include=['bool']).columns
data_ohe[boolean_columns] = data_ohe[boolean_columns].astype(int)


# Crear una instancia de StandardScaler
scaler = StandardScaler()
data_ohe[['day', 'month', 'year']] = scaler.fit_transform(data_ohe[['day', 'month', 'year']])

data_ohe


Unnamed: 0,price,registration_year,power,mileage,registration_month,year,month,day,vehicle_type_bus,vehicle_type_convertible,...,brand_smart,brand_sonstige_autos,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,not_repaired_no,not_repaired_yes
0,480,-0.124476,-0.579679,0.574787,-1.533319,0.008426,-0.425641,0.892231,0,0,...,0,0,0,0,0,0,1,0,0,0
1,18300,0.074945,0.420770,-0.084730,-0.191641,0.008426,-0.425641,0.892231,0,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,-0.002607,0.278601,-0.084730,0.613366,0.008426,-0.425641,-0.203918,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1500,-0.035844,-0.184765,0.574787,0.076695,0.008426,-0.425641,0.124927,0,0,...,0,0,0,0,0,0,1,0,1,0
4,3600,0.041708,-0.216358,-1.008053,0.345031,0.008426,-0.425641,1.659536,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354102,0,0.008471,-0.579679,0.574787,0.345031,0.008426,-0.425641,0.563387,0,0,...,0,0,0,0,0,0,0,0,0,1
354103,2200,0.008471,-0.579679,-2.854701,-1.264983,0.008426,-0.425641,-0.203918,0,0,...,0,1,0,0,0,0,0,0,0,0
354104,1199,-0.046923,-0.047862,-0.084730,-0.728312,0.008426,-0.425641,-1.190452,0,1,...,1,0,0,0,0,0,0,0,1,0
354105,9200,-0.091239,-0.042596,0.574787,-0.728312,0.008426,-0.425641,0.344157,1,0,...,0,0,0,0,0,0,1,0,1,0


In [151]:
# Dvidir el dataset

df_train_valid_ohe, df_test_ohe = train_test_split(data_ohe, test_size=0.2, random_state=12345)

df_train_ohe, df_valid_ohe = train_test_split(df_train_valid_ohe, test_size=0.25, random_state=12345)


# Verificamos los tamaños de cada conjunto
print("Tamaño del conjunto de entrenamiento:", df_train_ohe.shape)
print("Tamaño del conjunto de validación:", df_valid_ohe.shape)
print("Tamaño del conjunto de prueba:", df_test_ohe.shape)

Tamaño del conjunto de entrenamiento: (212463, 316)
Tamaño del conjunto de validación: (70822, 316)
Tamaño del conjunto de prueba: (70822, 316)


In [152]:
# features y targets

# Entrenamiento
features_train_ohe = df_train_ohe.drop('price', axis=1)
targets_train_ohe = df_train_ohe['price']

# Validación
features_valid_ohe = df_valid_ohe.drop('price', axis=1)
targets_valid_ohe = df_valid_ohe['price']

## Entrenamiento del modelo <a id="titulo_principal_2"></a>

#### Modelo de regresión lineal

In [85]:
model_reg = LinearRegression()

start_time = time.time()
model_reg.fit(features_train_ohe, targets_train_ohe)
end_time = time.time()

training_time = end_time - start_time

print("Tiempo de entrenamiento:", training_time)

Tiempo de entrenamiento: 16.54839253425598


In [87]:
# Predicciónes
start_time_pred = time.time()
predictions_valid_ohe = model_reg.predict(features_valid_ohe)
end_time_pred = time.time()

predict_time = end_time_pred - start_time_pred

# Calcular RMSE
rmse = np.sqrt(mean_squared_error(targets_valid_ohe, predictions_valid_ohe))

print("Root Mean Squared Error:", rmse)
print("Tiempo de predicción:", predict_time)

Root Mean Squared Error: 3172.1409738573657
Tiempo de predicción: 0.15569281578063965


### Creación de función para modelos con GridSearchCV

In [156]:

def evaluate_model(model, param_dist, n_iter, model_name, features_train, targets_train, features_valid, targets_valid, cat_features=None):
    def preprocess_features(features, cat_features):
        features_copy = features.copy()  # Hacer una copia para no modificar el original
        # Convert categorical columns
        le = LabelEncoder()
        for col in cat_features:
            features_copy[col] = le.fit_transform(features_copy[col].astype(str))
        
        # Convert datetime columns to numerical columns
        for col in features.select_dtypes(include=['datetime64']).columns:
            features_copy[col] = features_copy[col].astype('int64') / 10**9  
        
        return features_copy
    
    if model_name == 'CatBoostRegressor':
        features_train_processed = preprocess_features(features_train, cat_features)
        features_valid_processed = preprocess_features(features_valid, cat_features)
        
        random_search = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_dist,
            n_iter=n_iter,
            scoring='neg_mean_squared_error',
            cv=3,
            verbose=2,
            n_jobs=-1,
            refit=True,
            random_state=12345
        )

        start_time = time.time()
        random_search.fit(features_train_processed, targets_train)
        end_time = time.time()

        best_params = random_search.best_params_

        # Entrenar el modelo con los mejores parámetros
        model.set_params(**best_params)
        train_pool = Pool(data=features_train_processed, label=targets_train, cat_features=cat_features)
        valid_pool = Pool(data=features_valid_processed, label=targets_valid, cat_features=cat_features)
        model.fit(train_pool)

        # Obtener predicciones del mejor modelo
        start_time_pred = time.time()
        predictions_valid = model.predict(valid_pool)
        end_time_pred = time.time()

    elif isinstance(model, LGBMRegressor):
        features_train_processed = preprocess_features(features_train, cat_features)
        features_valid_processed = preprocess_features(features_valid, cat_features)
        
        random_search = RandomizedSearchCV(
            model,
            param_distributions=param_dist,
            n_iter=n_iter,
            scoring='neg_mean_squared_error',
            cv=3,
            random_state=12345,
            n_jobs=-1
        )
        
        start_time = time.time()
        random_search.fit(features_train_processed, targets_train)
        end_time = time.time()
        
        # Obtener los mejores parámetros 
        best_params = random_search.best_params_

        # Obtener el mejor modelo
        best_model = random_search.best_estimator_

        # Obtener predicciones del mejor modelo
        start_time_pred = time.time()
        predictions_valid = best_model.predict(features_valid_processed)
        end_time_pred = time.time()
        
    else:
        features_train_processed = preprocess_features(features_train, cat_features)  # Hacer una copia para no modificar el original
        features_valid_processed = preprocess_features(features_valid, cat_features)  # Hacer una copia para no modificar el original

        random_search = RandomizedSearchCV(
            model,
            param_distributions=param_dist,
            n_iter=n_iter,
            scoring='neg_mean_squared_error',
            cv=3,
            random_state=12345,
            n_jobs=-1
        )
        
        start_time = time.time()
        random_search.fit(features_train_processed, targets_train)
        end_time = time.time()
        
        # Obtener los mejores parámetros 
        best_params = random_search.best_params_

        # Obtener el mejor modelo
        best_model = random_search.best_estimator_

        # Obtener predicciones del mejor modelo
        start_time_pred = time.time()
        predictions_valid = best_model.predict(features_valid_processed)
        end_time_pred = time.time()

    # Calculo de RMSE
    rmse = np.sqrt(mean_squared_error(targets_valid, predictions_valid))

    predict_time = end_time_pred - start_time_pred
    training_time = end_time - start_time

    print("Mejores parámetros: ", best_params)
    print("Root Mean Squared Error:", rmse)
    print("Tiempo de entrenamiento:", training_time)
    print("Tiempo de predicción:", predict_time)
    print("\n")
    
    return {
        "modelo": model_name,
        "rmse": rmse,
        "training_time": training_time,
        "predict_time": predict_time,
        "best_params": best_params
    }


#### Modelo de arboles de desición (4 nucleos)

In [89]:

# Definición del modelo
model_tree = DecisionTreeRegressor(random_state=12345)

# Definición de parámetros reducidos
param_dist = {
    'criterion' : ["friedman_mse", "mse"],
    'splitter' : ["best", "random"],
    "max_depth" : [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 6],
    'max_features': [None, 'sqrt', 'log2']
}

# Definir el escore basado en RMSE
scorer = make_scorer(mean_squared_error, greater_is_better=False)

# Objeto RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model_tree,
    param_distributions=param_dist,
    n_iter=50,  # Número de combinaciones a probar
    scoring=scorer,
    cv=5,  # Número de folds de cross-validation
    n_jobs=4,  
    random_state=12345
)

# Ajustar el RandomizedSearchCV
start_time = time.time()
random_search.fit(features_train_ohe, targets_train_ohe)
end_time = time.time()

# Obtener los mejores parámetros 
best_params = random_search.best_params_

# Obtener el mejor modelo
best_model = random_search.best_estimator_

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_valid_ohe = best_model.predict(features_valid_ohe)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_valid_ohe, predictions_valid_ohe))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time

print("Mejores parámetros: ", best_params)
print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)

Mejores parámetros:  {'splitter': 'best', 'min_samples_split': 15, 'min_samples_leaf': 4, 'max_features': None, 'max_depth': 20, 'criterion': 'friedman_mse'}
Root Mean Squared Error: 1988.5807453682728
Tiempo de entrenamiento: 678.2854125499725
Tiempo de predicción: 0.26235151290893555


In [90]:
model_tree = DecisionTreeRegressor(splitter= 'random', min_samples_split = 15, min_samples_leaf = 4, max_features= None, max_depth= 20, criterion= 'friedman_mse')

start_time = time.time()
model_tree.fit(features_train_ohe, targets_train_ohe)
end_time = time.time()

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_valid_ohe = model_tree.predict(features_valid_ohe)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_valid_ohe, predictions_valid_ohe))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time

print("Mejores parámetros: ", best_params)
print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)

Mejores parámetros:  {'splitter': 'best', 'min_samples_split': 15, 'min_samples_leaf': 4, 'max_features': None, 'max_depth': 20, 'criterion': 'friedman_mse'}
Root Mean Squared Error: 2229.997584405564
Tiempo de entrenamiento: 8.918367147445679
Tiempo de predicción: 0.18118524551391602


#### Modelo de bosque aleatorio (4 nucleos)

In [106]:
# Definición del modelo
model_forest = RandomForestRegressor(random_state=12345)

# Definición de parámetros reducidos
param_dist = {
    'n_estimators' : [15, 25, 50, 100],
    'criterion' : ["friedman_mse", "mse"],
    "max_depth" : [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 6],
    'max_features': [None, 'sqrt', 'log2']
}

# Definir el escore basado en RMSE
scorer = make_scorer(mean_squared_error, greater_is_better=False)

# Objeto RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model_forest,
    param_distributions=param_dist,
    n_iter=50,  # Número de combinaciones a probar
    scoring=scorer,
    cv=5,  # Número de folds de cross-validation
    n_jobs=4,  
    random_state=12345
)

# Ajustar el RandomizedSearchCV
start_time = time.time()
random_search.fit(features_train_ohe, targets_train_ohe)
end_time = time.time()

# Obtener los mejores parámetros 
best_params = random_search.best_params_

# Obtener el mejor modelo
best_model = random_search.best_estimator_

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_valid_ohe = best_model.predict(features_valid_ohe)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_valid_ohe, predictions_valid_ohe))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time

print("Mejores parámetros: ", best_params)
print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)

Mejores parámetros:  {'n_estimators': 25, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': None, 'max_depth': 20, 'criterion': 'friedman_mse'}
Root Mean Squared Error: 3018.298650754492
Tiempo de entrenamiento: 3890.6742703914642
Tiempo de predicción: 0.703648567199707


In [91]:
model_forest = RandomForestRegressor(n_estimators= 25, min_samples_split = 10, min_samples_leaf = 2, max_features= None, max_depth= 20, criterion= 'friedman_mse', random_state=12345)

start_time = time.time()
model_forest.fit(features_train_ohe, targets_train_ohe)
end_time = time.time()

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_valid_ohe = model_forest.predict(features_valid_ohe)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_valid_ohe, predictions_valid_ohe))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time


print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)

Root Mean Squared Error: 1777.2310322042965
Tiempo de entrenamiento: 103.79491019248962
Tiempo de predicción: 0.5835905075073242


#### Modelo XGBoost (4 nucleos)

In [95]:
import gc

# Liberar memoria no utilizada
gc.collect()

# Definición del modelo
model_forest = XGBRegressor(random_state=12345)

# Definición de parámetros
param_dist = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'min_child_weight': [1, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'gamma': [0, 0.1],
    'reg_alpha': [0, 0.5],
    'reg_lambda': [0.01, 1]
}

# Utilizar una muestra más pequeña de datos
sample_size = 100000  # ajusta el tamaño según sea necesario
features_train_ohe_sample = features_train_ohe[:sample_size]
targets_train_ohe_sample = targets_train_ohe[:sample_size]
features_valid_ohe_sample = features_valid_ohe[:sample_size]
targets_valid_ohe_sample = targets_valid_ohe[:sample_size]

# Evaluar el modelo con los datos reducidos
result = evaluate_model(model_forest, param_dist, 50, 'XGBRegressor', 
                        features_train_ohe_sample, targets_train_ohe_sample, 
                        features_valid_ohe_sample, targets_valid_ohe_sample, 
                        None)

print(result)

Mejores parámetros:  {'subsample': 1.0, 'reg_lambda': 1, 'reg_alpha': 0, 'n_estimators': 100, 'min_child_weight': 5, 'max_depth': 5, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.8}
Root Mean Squared Error: 1899.2785044263526
Tiempo de entrenamiento: 340.6750910282135
Tiempo de predicción: 0.2670562267303467


{'modelo': 'XGBRegressor', 'rmse': 1899.2785044263526, 'training_time': 340.6750910282135, 'predict_time': 0.2670562267303467, 'best_params': {'subsample': 1.0, 'reg_lambda': 1, 'reg_alpha': 0, 'n_estimators': 100, 'min_child_weight': 5, 'max_depth': 5, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.8}}


#### Modelo Catboost (4 nucleos)

In [145]:
# Definición del modelo CatBoost con los parámetros específicos
model_cat = CatBoostRegressor(random_state=12345, silent=True)

# Definir parámetros para la búsqueda
param_dist_catboost = {
    'iterations': [15, 25, 50, 100],
    'learning_rate': [0.01, 0.05, 0.1, 0.3],
    'depth': [3, 5, 7, 10],
    'l2_leaf_reg': [1, 3, 5, 7, 9],
    'bagging_temperature': [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
    'border_count': [32, 50, 100, 200]
}

# Definir columnas categóricas
categorical_col = ['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']

# Evaluar el modelo
result_catboost = evaluate_model(model_cat, param_dist_catboost, 50, 'CatBoostRegressor', features_train, targets_train, features_valid, targets_valid, categorical_col)

print(result_catboost)

Fitting 3 folds for each of 50 candidates, totalling 150 fits
Mejores parámetros:  {'learning_rate': 0.3, 'l2_leaf_reg': 5, 'iterations': 100, 'depth': 10, 'border_count': 50, 'bagging_temperature': 0.0}
Root Mean Squared Error: 1839.3396297999461
Tiempo de entrenamiento: 168.5906479358673
Tiempo de predicción: 0.021265268325805664


{'modelo': 'CatBoostRegressor', 'rmse': 1839.3396297999461, 'training_time': 168.5906479358673, 'predict_time': 0.021265268325805664, 'best_params': {'learning_rate': 0.3, 'l2_leaf_reg': 5, 'iterations': 100, 'depth': 10, 'border_count': 50, 'bagging_temperature': 0.0}}


In [164]:
model_cat = CatBoostRegressor(random_state=12345, silent=True, learning_rate=0.3, l2_leaf_reg=5, iterations=100, depth=10, border_count=50, bagging_temperature=0)

categorical_col = ['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']

train_pool = Pool(data=features_train, label=targets_train, cat_features=categorical_col)
valid_pool = Pool(data=features_valid, label=targets_valid, cat_features=categorical_col)


start_time = time.time()
model_cat.fit(train_pool)
end_time = time.time()

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_valid_cat = model_cat.predict(valid_pool)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_valid, predictions_valid_cat))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time


print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)

Root Mean Squared Error: 1762.9572284742167
Tiempo de entrenamiento: 14.530795097351074
Tiempo de predicción: 0.020435094833374023


#### Modelo LightGBM (4 nucleos)

In [160]:
model_lgb = LGBMRegressor(random_state=12345)

param_dist_lgb = {
    'n_estimators': [15, 25, 50, 100],
    'learning_rate': [0.01, 0.05, 0.1, 0.3],
    'num_leaves': [20, 31, 40, 50],
    'min_child_samples': [10, 20, 30, 40],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

categorical_col = ['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']

result_lgb = evaluate_model(model_lgb, param_dist_lgb, 50, 'LGBMRegressor', features_train, targets_train, features_valid, targets_valid, categorical_col)

print(result_lgb)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002915 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 768
[LightGBM] [Info] Number of data points in the train set: 212463, number of used features: 11
[LightGBM] [Info] Start training from score 4416.610182
Mejores parámetros:  {'subsample': 1.0, 'num_leaves': 40, 'n_estimators': 100, 'min_child_samples': 30, 'learning_rate': 0.3, 'colsample_bytree': 1.0}
Root Mean Squared Error: 1821.58560594096
Tiempo de entrenamiento: 103.99538230895996
Tiempo de predicción: 0.08893775939941406


{'modelo': 'LGBMRegressor', 'rmse': 1821.58560594096, 'training_time': 103.99538230895996, 'predict_time': 0.08893775939941406, 'best_params': {'subsample': 1.0, 'num_leaves': 40, 'n_estimators': 100, 'min_child_samples': 30, 'learning_rate': 0.3, 'colsample_bytree': 1.0}}


In [167]:
model_lgb = LGBMRegressor(random_state=12345, subsample=1, num_leaves=40, n_estimators=100, min_child_samples=30, learning_rate=0.3, colsample_bytree=1)

categorical_col = ['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']

def preprocess_features(features, cat_features):
    features = features.copy()
    # Convert categorical columns
    le = LabelEncoder()
    for col in cat_features:
        features[col] = le.fit_transform(features[col].astype(str))
    
    # Convert datetime columns to numerical columnsb
    for col in features.select_dtypes(include=['datetime64']).columns:
        features[col] = features[col].astype('int64') // 10**9  # Convert to seconds since epoch
    
    return features

features_train_lgbm = preprocess_features(features_train, categorical_columns)
features_valid_lgbm = preprocess_features(features_valid, categorical_columns)

start_time = time.time()
model_lgb.fit(features_train_lgbm, targets_train)
end_time = time.time()

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_valid_lgbm = model_lgb.predict(features_valid_lgbm)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_valid, predictions_valid_lgbm))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time


print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)





[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001877 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 768
[LightGBM] [Info] Number of data points in the train set: 212463, number of used features: 11
[LightGBM] [Info] Start training from score 4416.610182
Root Mean Squared Error: 1821.58560594096
Tiempo de entrenamiento: 0.9690694808959961
Tiempo de predicción: 0.05707192420959473


## Prueba de cordura <a id="titulo_principal_3"></a>

Para esta prueba tomare el dataset de testing y lo usare tanto para catboost como para lightGBM, usare estos para compararlos con una regresion lineal simple y Bosque aleatorio

In [259]:
# Dividiendo dataset de prueba

features_test_ohe = df_test_ohe.drop('price', axis=1)
targets_test_ohe = df_test_ohe['price']

In [255]:
# Regresión Lineal

model_reg = LinearRegression()

start_time = time.time()
model_reg.fit(features_train_ohe, targets_train_ohe)
end_time = time.time()

training_time = end_time - start_time

start_time_pred = time.time()
predictions_test_ohe = model_reg.predict(features_test_ohe)
end_time_pred = time.time()

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time

rmse = np.sqrt(mean_squared_error(targets_test_ohe, predictions_test_ohe))

print("Tiempo de entrenamiento:", training_time)
print("Root Mean Squared Error:", rmse)
print("Tiempo de predicción:", predict_time)
print('\n')

Tiempo de entrenamiento: 19.149163484573364
Root Mean Squared Error: 3190.9482078632705
Tiempo de predicción: 0.4272186756134033




In [256]:
# Bosque Aleatorio

model_forest = RandomForestRegressor(n_estimators= 25, min_samples_split = 10, min_samples_leaf = 2, max_features= None, max_depth= 20, criterion= 'friedman_mse', random_state=12345)

start_time = time.time()
model_forest.fit(features_train_ohe, targets_train_ohe)
end_time = time.time()

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_test_ohe = model_forest.predict(features_test_ohe)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_test_ohe, predictions_test_ohe))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time


print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)

Root Mean Squared Error: 1809.042962409557
Tiempo de entrenamiento: 169.74538159370422
Tiempo de predicción: 1.1766457557678223


In [260]:
# Separar el dataset de testeo sin ohe para Catboost y LightGBM

features_test = df_test.drop('price', axis=1)
targets_test = df_test['price']

In [261]:
features_test

Unnamed: 0,vehicle_type,registration_year,gear_box,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created
202320,Unknown,0.008471,manual,0.736701,golf,0.574787,-1.533319,petrol,volkswagen,no,2016-03-09
23620,convertible,-0.069081,manual,0.110104,other,-0.084730,0.076695,petrol,fiat,no,2016-03-09
302262,small,-0.069081,manual,-0.263748,polo,0.574787,-0.191641,petrol,volkswagen,no,2016-03-21
211528,Unknown,0.141419,manual,-0.274279,clio,0.574787,0.076695,petrol,renault,Unknown,2016-03-20
26801,bus,-0.002607,manual,0.157494,touran,0.574787,0.881702,gasoline,volkswagen,no,2016-03-27
...,...,...,...,...,...,...,...,...,...,...,...
75884,sedan,0.030629,auto,0.099573,3er,-1.271860,-1.533319,petrol,bmw,no,2016-03-22
309044,small,-0.069081,manual,38.969654,Unknown,0.574787,-0.728312,petrol,volkswagen,no,2016-04-02
49754,convertible,-0.168792,manual,0.320725,3er,-0.084730,0.345031,petrol,bmw,Unknown,2016-03-24
264831,small,-0.013686,manual,-0.184765,corsa,0.574787,-0.728312,petrol,opel,no,2016-03-31


In [264]:
# CatBoost

model_cat = CatBoostRegressor(random_state=12345, silent=True, learning_rate=0.3, l2_leaf_reg=5, iterations=100, depth=10, border_count=50, bagging_temperature=0)

categorical_col = ['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']

train_pool = Pool(data=features_train, label=targets_train, cat_features=categorical_col)
test_pool = Pool(data=features_test, label=targets_test, cat_features=categorical_col)


start_time = time.time()
model_cat.fit(train_pool)
end_time = time.time()

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_test_cat = model_cat.predict(features_test)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_test, predictions_test_cat))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time


print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)

Root Mean Squared Error: 1796.223605539633
Tiempo de entrenamiento: 15.481641292572021
Tiempo de predicción: 0.19666552543640137


In [268]:
# LightGBM

model_lgb = LGBMRegressor(random_state=12345, subsample=1, num_leaves=40, n_estimators=100, min_child_samples=30, learning_rate=0.3, colsample_bytree=1)

categorical_col = ['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']

def preprocess_features(features, cat_features):
    features = features.copy()
    # Convert categorical columns
    le = LabelEncoder()
    for col in cat_features:
        features[col] = le.fit_transform(features[col].astype(str))
    
    # Convert datetime columns to numerical columnsb
    for col in features.select_dtypes(include=['datetime64']).columns:
        features[col] = features[col].astype('int64') // 10**9  # Convert to seconds since epoch
    
    return features

features_train_lgbm = preprocess_features(features_train, categorical_columns)
features_test_lgbm = preprocess_features(features_test, categorical_columns)

start_time = time.time()
model_lgb.fit(features_train_lgbm, targets_train)
end_time = time.time()

# Obtener predicciones del mejor modelo
start_time_pred = time.time()
predictions_test_lgbm = model_lgb.predict(features_test_lgbm)
end_time_pred = time.time()

# Calculo de RMSE
rmse = np.sqrt(mean_squared_error(targets_test, predictions_test_lgbm))

predict_time = end_time_pred - start_time_pred
training_time = end_time - start_time


print("Root Mean Squared Error:", rmse)
print("Tiempo de entrenamiento:", training_time)
print("Tiempo de predicción:", predict_time)


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001823 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 768
[LightGBM] [Info] Number of data points in the train set: 212463, number of used features: 11
[LightGBM] [Info] Start training from score 4416.610182
Root Mean Squared Error: 1842.0496122941915
Tiempo de entrenamiento: 0.6392080783843994
Tiempo de predicción: 0.06326889991760254


## Análisis del modelo<a id="titulo_principal_4"></a>



#### Regresión Lineal

    Tiempo de entrenamiento: 16.54839253425598
    Root Mean Squared Error: 3172.1409738573657
    Tiempo de predicción: 0.15569281578063965

#### Regresión por arbol de desición

- Para la busqueda de parametros tardo y retorno:

    - Mejores parámetros:  {'splitter': 'best', 'min_samples_split': 15, 'min_samples_leaf': 4, 'max_features': None, 'max_depth': 20, 'criterion': 'friedman_mse'}
    - Root Mean Squared Error: 1988.5807453682728
    - Tiempo de entrenamiento: 678.2854125499725
    - Tiempo de predicción: 0.26235151290893555

- Modelo con los mejores parametros:

    - Root Mean Squared Error: 2229.997584405564
    - Tiempo de entrenamiento: 8.918367147445679
    - Tiempo de predicción: 0.18118524551391602

#### Regresión por bosque aleatorio

- Para la busqueda de parametros tardo y retorno:
    - Mejores parámetros:  {'n_estimators': 25, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': None, 'max_depth': 20, 'criterion': 'friedman_mse'}
    - Root Mean Squared Error: 3018.298650754492
    - Tiempo de entrenamiento: 3890.6742703914642
    - Tiempo de predicción: 0.703648567199707

- Modelo con los mejores parametros:

    - Root Mean Squared Error: 1777.2310322042965
    - Tiempo de entrenamiento: 103.79491019248962
    - Tiempo de predicción: 0.5835905075073242

#### Regresión por XGBoost:

- Para la busqueda de parametros tardo y retorno:

    - Mejores parámetros:  {'subsample': 1.0, 'reg_lambda': 1, 'reg_alpha': 0, 'n_estimators': 100, 'min_child_weight': 5, 'max_depth': 5, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.8}
    - Root Mean Squared Error: 1899.2785044263526
    - Tiempo de entrenamiento: 340.6750910282135
    - Tiempo de predicción: 0.2670562267303467


#### Regresión para Catboost

- Para la busqueda de parametros tardo y retorno:

    - Mejores parámetros:  {'learning_rate': 0.3, 'l2_leaf_reg': 5, 'iterations': 100, 'depth': 10, 'border_count': 50, 'bagging_temperature': 0.0}
    - Root Mean Squared Error: 1839.3396297999461
    - Tiempo de entrenamiento: 168.5906479358673
    - Tiempo de predicción: 0.021265268325805664

- Modelo con los mejores parametros:

    - Root Mean Squared Error: 1762.9572284742167
    - Tiempo de entrenamiento: 14.530795097351074
    - Tiempo de predicción: 0.020435094833374023

#### Regresión para LightGBM:

- Para la busqueda de parametros tardo y retorno:

    - Mejores parámetros:  {'subsample': 1.0, 'num_leaves': 40, 'n_estimators': 100, 'min_child_samples': 30, 'learning_rate': 0.3, 'colsample_bytree': 1.0}
    - Root Mean Squared Error: 1821.58560594096
    - Tiempo de entrenamiento: 103.99538230895996
    - Tiempo de predicción: 0.08893775939941406

- Modelo con los mejores parametros:

    - Root Mean Squared Error: 1821.58560594096
    - Tiempo de entrenamiento: 0.9690694808959961
    - Tiempo de predicción: 0.05707192420959473

### Conclusión de resultados

- **Calidad de predicción**: En terminos de RMSE el mejor resultado fuel del modelo LightGBM con un:
    - Root Mean Squared Error: 1823.807428298072
Este seguido de Catboost con un RMSE de 1834.7605939413406

- **Velocidad de predicción**: En cuanto a la velocidad de predicción el modelo de Catboost es superior a todos con:
    - Tiempo de predicción: 0.02440023422241211 seg
Seguido de LightGBM y las demás que se encuentran en una velocidad de predicción similar con una velocidad de predicción de 0.2723972797393799 seg, bastante buen también para LightGBM

- **Tiempo de Entrenamiento**: Para la parte de busqueda de los mejores hiper-parametros fue XGBoost, que dio una calidad similar a la de la regresión y bosque aleatorio muy similar pero con un tiempo de entrenamiento mejorado de 88.43095707893372 seg seguido de LightGBM con 103 seg

En general LightGBM balanceando las 3 caracteristicas requeridas para el proyecto es la mejor seguida de Catboost, una muy buena alternativa además.


## Conclusiones:<a id="titulo_principal_5"></a>

- **Mejor Modelo en General:** Catboost ofrece un equilibrio entre precisión (bajo RMSE) y eficiencia (rápido tiempo de predicción y entrenamiento).
- **Modelo Más Rápido:** LightGBM es la mejor opción cuando se requiere velocidad tanto en el entrenamiento como en la predicción, aunque con una leve compensación en la precisión comparado con Catboost.
- **Modelos Competitivos:** XGBoost y la regresión por árbol de decisión también son opciones válidas, especialmente en escenarios donde se priorice un equilibrio entre precisión y tiempo de entrenamiento.

A la hora de hacer el testeo final en la prueba de corduro se obtuvieron resultados esperados, pues fueron muy similares a los ya calculados previamente con velocidades y calidad muy similares, se puede decir que los modelos con potenciación de gradiente son los de calculo más rapido y con una mejor presición.



# Lista de control<a id="titulo_principal_6"></a>

- [x]  Jupyter Notebook está abierto
- [ ]  El código no tiene errores- [ ]  Las celdas con el código han sido colocadas en orden de ejecución- [ ]  Los datos han sido descargados y preparados- [ ]  Los modelos han sido entrenados
- [ ]  Se realizó el análisis de velocidad y calidad de los modelos