<div style="text-align: center;">
    <h2>Modelado - XGBoost</h2>
</div>

## Índice

- [1 - Preparación de Datos](#preparaciondedatos)
- [1.1 - Instalación Librerias](#instalacionlibrerias)
- [1.2 - Carga de Datos](#cargadedatos)
- [2 - Transformación de Datos](#transformaciondedatos)
- [3 - Modelado - XGBoost](#xgboost)

### 1 - Preparación de Datos <a name="preparaciondedatos"></a>

#### 1.1 - Instalación Librerias <a name="instalacionlibrerias"></a>

In [4]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install optuna

Note: you may need to restart the kernel to use updated packages.


#### 1.2 - Importar Librerias <a name="importarlibrerias"></a>

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import xgboost as xgb
import lightgbm as lgb
import optuna
import warnings
import joblib
from optuna.samplers import TPESampler

warnings.filterwarnings('ignore')


Note: You have installed the 'manylinux2014' variant of XGBoost. Certain features such as GPU algorithms or federated learning are not available. To use these features, please upgrade to a recent Linux distro with glibc 2.28+, and install the 'manylinux_2_28' variant.


In [6]:
# Se define función que cuenta la cantidad de valores nulos en cada columna de un DataFrame y se calcula
# el porcentaje de valores nulos por columna.

def contar_valores_nulos_con_porcentaje(dataframe):

    nulos_por_columna = dataframe.isnull().sum()
    porcentaje_nulos_por_columna = (nulos_por_columna / len(dataframe)) * 100

    resultados = pd.DataFrame({
        'Cantidad de Nulos': nulos_por_columna,
        'Porcentaje de Nulos (%)': porcentaje_nulos_por_columna
    })

    return resultados

#Se define función que elimina los valores atípicos de una columna específica en un DataFrame.
#La misma utiliza el método de rango intercuartílico para identificar valores atípicos, que son aquellos puntos de datos
#que caen fuera del rango definido por el primer cuartil (Q1) y el tercer cuartil (Q3).

def remove_outliers(df, column_name):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df_filtered = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]

    return df_filtered

#Se define función que elimina iterativamente los valores atípicos de una columna específica en un DataFrame hasta que
#no se encuentren más valores atípicos según el criterio del rango intercuartílico definido en la función "remove_outliers".

def remove_outliers_iteratively(df, column_name):
    df_clean = df.copy()
    while True:
        initial_len = len(df_clean)
        df_clean = remove_outliers(df_clean, column_name)
        final_len = len(df_clean)
        if initial_len == final_len:
            break
    return df_clean

In [7]:
def count_outliers(df, column):
    """
    Cuenta el número de valores atípicos en una columna específica de un DataFrame.

    Args:
    df (pandas.DataFrame): El DataFrame que contiene los datos.
    column (str): El nombre de la columna en la que se contarán los valores atípicos.

    Returns:
    int: El número de valores atípicos en la columna especificada.
    """
    # Calcular Q1 (primer cuartil), Q3 (tercer cuartil) y el IQR (rango intercuartílico)
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    # Definir los límites para los valores atípicos
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Contar los valores atípicos
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    num_outliers = outliers.shape[0]

    return num_outliers

#### 1.3 - Carga de Datos <a name="cargadedatos"></a>

In [8]:
train_df = pd.read_csv('train.csv')  
test_df = pd.read_csv('test.csv')  
sample_submission = pd.read_csv('sample_submission.csv')

In [9]:
train_df.head(5)

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200
1,1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999
2,2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900
3,3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000
4,4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500


In [10]:
test_df.head(5)

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title
0,188533,Land,Rover LR2 Base,2015,98000,Gasoline,240.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,White,Beige,None reported,Yes
1,188534,Land,Rover Defender SE,2020,9142,Hybrid,395.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,Silver,Black,None reported,Yes
2,188535,Ford,Expedition Limited,2022,28121,Gasoline,3.5L V6 24V PDI DOHC Twin Turbo,10-Speed Automatic,White,Ebony,None reported,
3,188536,Audi,A6 2.0T Sport,2016,61258,Gasoline,2.0 Liter TFSI,Automatic,Silician Yellow,Black,None reported,
4,188537,Audi,A6 2.0T Premium Plus,2018,59000,Gasoline,252.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,A/T,Gray,Black,None reported,Yes


In [11]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125690 entries, 0 to 125689
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            125690 non-null  int64 
 1   brand         125690 non-null  object
 2   model         125690 non-null  object
 3   model_year    125690 non-null  int64 
 4   milage        125690 non-null  int64 
 5   fuel_type     122307 non-null  object
 6   engine        125690 non-null  object
 7   transmission  125690 non-null  object
 8   ext_col       125690 non-null  object
 9   int_col       125690 non-null  object
 10  accident      124058 non-null  object
 11  clean_title   111451 non-null  object
dtypes: int64(3), object(9)
memory usage: 11.5+ MB


### 2 - Transformación de Datos <a name="transformaciondedatos"></a>

In [12]:
object_cols = train_df.select_dtypes(include=['object']).columns

label_encoders = {}

for col in object_cols:
    le = LabelEncoder()
    train_df[col + '_encoded'] = le.fit_transform(train_df[col].astype(str))
    label_encoders[col] = le  

In [13]:
object_cols = test_df.select_dtypes(include=['object']).columns

label_encoders = {}

for col in object_cols:
    le = LabelEncoder()

    test_df[col + '_encoded'] = le.fit_transform(test_df[col].astype(str))
    label_encoders[col] = le 

In [14]:
contar_valores_nulos_con_porcentaje(train_df)

Unnamed: 0,Cantidad de Nulos,Porcentaje de Nulos (%)
id,0,0.0
brand,0,0.0
model,0,0.0
model_year,0,0.0
milage,0,0.0
fuel_type,5083,2.69608
engine,0,0.0
transmission,0,0.0
ext_col,0,0.0
int_col,0,0.0


In [15]:
train_df['clean_title'].value_counts()

clean_title
Yes    167114
Name: count, dtype: int64

In [16]:
# Encontrar el valor más frecuente en la columna 'clean_title'
most_frequent = train_df['clean_title'].value_counts().idxmax()

# Reemplazar los valores nulos (NaN) con el valor más frecuente
train_df['clean_title'] = train_df['clean_title'].fillna(most_frequent)


In [17]:
train_df['accident'].value_counts()

accident
None reported                             144514
At least 1 accident or damage reported     41567
Name: count, dtype: int64

In [18]:
# Encontrar el valor más frecuente en la columna 'clean_title'
most_frequent = train_df['accident'].value_counts().idxmax()

# Reemplazar los valores nulos (NaN) con el valor más frecuente
train_df['accident'] = train_df['accident'].fillna(most_frequent)


In [19]:
train_df['fuel_type'].value_counts()

fuel_type
Gasoline          165940
Hybrid              6832
E85 Flex Fuel       5406
Diesel              3955
–                    781
Plug-In Hybrid       521
not supported         15
Name: count, dtype: int64

In [20]:
# Encontrar el valor más frecuente en la columna 'clean_title'
most_frequent = train_df['fuel_type'].value_counts().idxmax()

# Reemplazar los valores nulos (NaN) con el valor más frecuente
train_df['fuel_type'] = train_df['fuel_type'].fillna(most_frequent)


In [21]:
contar_valores_nulos_con_porcentaje(test_df)

Unnamed: 0,Cantidad de Nulos,Porcentaje de Nulos (%)
id,0,0.0
brand,0,0.0
model,0,0.0
model_year,0,0.0
milage,0,0.0
fuel_type,3383,2.691543
engine,0,0.0
transmission,0,0.0
ext_col,0,0.0
int_col,0,0.0


In [22]:
test_df['clean_title'].value_counts()

clean_title
Yes    111451
Name: count, dtype: int64

In [23]:
# Encontrar el valor más frecuente en la columna 'clean_title'
most_frequent = test_df['clean_title'].value_counts().idxmax()

# Reemplazar los valores nulos (NaN) con el valor más frecuente
test_df['clean_title'] = test_df['clean_title'].fillna(most_frequent)


In [24]:
test_df['accident'].value_counts()

accident
None reported                             96263
At least 1 accident or damage reported    27795
Name: count, dtype: int64

In [25]:
# Encontrar el valor más frecuente en la columna 'clean_title'
most_frequent = test_df['accident'].value_counts().idxmax()

# Reemplazar los valores nulos (NaN) con el valor más frecuente
test_df['accident'] = test_df['accident'].fillna(most_frequent)


In [26]:
test_df['fuel_type'].value_counts()

fuel_type
Gasoline          110533
Hybrid              4676
E85 Flex Fuel       3523
Diesel              2686
–                    538
Plug-In Hybrid       337
not supported         14
Name: count, dtype: int64

In [27]:
# Encontrar el valor más frecuente en la columna 'clean_title'
most_frequent = test_df['fuel_type'].value_counts().idxmax()

# Reemplazar los valores nulos (NaN) con el valor más frecuente
test_df['fuel_type'] = test_df['fuel_type'].fillna(most_frequent)


In [28]:
train_df['price'] = pd.to_numeric(train_df['price'], errors='coerce')


### 3 - Modelado - XGBoost <a name="xgboost"></a>

In [29]:
X = train_df[['brand_encoded', 'model_encoded', 'model_year', 'milage' ,'fuel_type_encoded', 
              'engine_encoded', 'transmission_encoded', 'ext_col_encoded', 'int_col_encoded', 
              'accident_encoded', 'clean_title_encoded']]
y = train_df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
test_df = test_df[['brand_encoded', 'model_encoded', 'model_year', 'milage' ,'fuel_type_encoded', 
              'engine_encoded', 'transmission_encoded', 'ext_col_encoded', 'int_col_encoded', 
              'accident_encoded', 'clean_title_encoded']]

#### Optuna XGBboost

In [31]:
def objective_xgboost(trial):

    xgboost_params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 16),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 1e-1),
        'subsample': trial.suggest_float('subsample', 0.4, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 100),
        'gamma': trial.suggest_float('gamma', 0.0, 10.0),
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 10.0),  
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 10.0),  
    }

    model_xgboost = XGBRegressor(
        **xgboost_params,
        random_state=42,
        objective="reg:squarederror" 
    )

    model_xgboost.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

    y_pred = model_xgboost.predict(X_test)

    return mean_squared_error(y_test, y_pred, squared=False)


In [32]:
study_xgboost = optuna.create_study(
    study_name="XGBoost_used_car", 
    sampler=TPESampler(), 
    direction="minimize"
)

optuna.logging.set_verbosity(optuna.logging.WARNING)

study_xgboost.optimize(objective_xgboost, n_trials=400, show_progress_bar=True)


[I 2024-09-29 17:29:45,500] A new study created in memory with name: XGBoost_used_car


  0%|          | 0/400 [00:00<?, ?it/s]

In [33]:
print("Best trial:", study_xgboost.best_trial)

Best trial: FrozenTrial(number=233, state=TrialState.COMPLETE, values=[67706.12921503252], datetime_start=datetime.datetime(2024, 9, 29, 18, 29, 21, 360167), datetime_complete=datetime.datetime(2024, 9, 29, 18, 29, 35, 986520), params={'n_estimators': 700, 'max_depth': 7, 'learning_rate': 0.012525991125616226, 'subsample': 0.8196358516487999, 'colsample_bytree': 0.5147240850455552, 'min_child_weight': 44, 'gamma': 2.981639862905725, 'reg_alpha': 0.5212122480997886, 'reg_lambda': 0.1289786873295609}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'n_estimators': IntDistribution(high=1000, log=False, low=100, step=1), 'max_depth': IntDistribution(high=16, log=False, low=3, step=1), 'learning_rate': FloatDistribution(high=0.1, log=True, low=0.0001, step=None), 'subsample': FloatDistribution(high=1.0, log=False, low=0.4, step=None), 'colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.4, step=None), 'min_child_weight': IntDistribution(high=100, log=False

In [34]:
print("Best parameters:", study_xgboost.best_params)

Best parameters: {'n_estimators': 700, 'max_depth': 7, 'learning_rate': 0.012525991125616226, 'subsample': 0.8196358516487999, 'colsample_bytree': 0.5147240850455552, 'min_child_weight': 44, 'gamma': 2.981639862905725, 'reg_alpha': 0.5212122480997886, 'reg_lambda': 0.1289786873295609}


In [35]:
xgb_final = XGBRegressor(**study_xgboost.best_params, verbose=False)
xgb_final.fit(X_train, y_train)

y_pred_test = xgb_final.predict(X_test)
print("Root Mean squared error: ", mean_squared_error(y_test, y_pred_test, squared = False))

Root Mean squared error:  67752.77934332646


In [36]:
y_pred_test = xgb_final.predict(test_df)

In [37]:
sample_submission["price"] =  y_pred_test
sample_submission.to_csv('submission_xgboost.csv',index=False)

In [38]:
joblib.dump(xgb_final, 'xgb_final_model.pkl')

['xgb_final_model.pkl']