## **Fase 6 - Machine Learning: Construcción y Evaluación del Modelo Predictivo**
*En esta fase, nos centramos en construir un modelo predictivo que sea capaz de estimar con precisión el **precio por noche** de las propiedades. Este paso es crucial para convertir los datos procesados en un conocimiento valioso, ya que nos permitirá hacer predicciones sobre el precio de futuras propiedades a partir de sus características. A través de técnicas de machine learning avanzadas, buscamos encontrar el modelo más eficiente, que no solo sea preciso, sino también interpretativo y generalizable*

A lo largo de esta fase, aplicaremos una serie de algoritmos y evaluaremos su rendimiento, seleccionando aquel que mejor se ajuste a nuestras necesidades. Además, dedicaremos tiempo a ajustar los parámetros del modelo para maximizar su capacidad predictiva, asegurando así que sea robusto y confiable.
Con este enfoque, buscamos no solo predecir el precio de forma precisa, sino también proporcionar una comprensión profunda de los factores clave que afectan los precios de las propiedades, lo que puede ser útil en la toma de decisiones empresariales y en el desarrollo de futuras investigaciones.

In [None]:
# General
import pandas as pd
import numpy as np

# Escaladores
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Train, Test
from sklearn.model_selection import train_test_split

# Modelos
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

# Hiperparametrización
from sklearn.model_selection import GridSearchCV

# Métricas
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import root_mean_squared_error

In [2]:
df_encoded = pd.read_csv('data/df_processed_ML.csv')

In [3]:
df_encoded

Unnamed: 0,prices_per_night,ratings,cleaning_fee,dormitorios,camas,baños,maximum_guests,check_in_hour,check_out_hour,total_hours_checkin,...,dormitorio y lavandería,entretenimiento,exterior,internet y oficina,para familias,privacidad y seguridad,seguridad en el hogar,servicios,habitacion,alojamiento entero
0,115.0,0.00,0.0,1.0,1.0,1.0,2.0,900.0,720.0,9.0,...,7.0,1.0,2.0,1.0,0.0,0.0,1.0,4.0,0.0,0.0
1,46.0,0.00,15.0,1.0,1.0,0.5,1.0,1020.0,660.0,7.0,...,6.0,0.0,0.0,2.0,0.0,2.0,0.0,1.0,1.0,0.0
2,47.0,4.66,0.0,1.0,1.0,0.5,1.0,900.0,720.0,9.0,...,8.0,1.0,1.0,1.0,0.0,1.0,0.0,5.0,1.0,0.0
3,100.0,4.89,35.0,1.0,1.0,1.0,1.0,960.0,720.0,8.0,...,10.0,10.0,3.0,2.0,2.0,0.0,5.0,4.0,0.0,0.0
4,33.0,4.40,0.0,1.0,1.0,0.5,1.0,900.0,660.0,9.0,...,4.0,0.0,1.0,1.0,0.0,0.0,3.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2492,55.0,4.74,10.0,1.0,1.0,0.5,3.0,900.0,660.0,9.0,...,7.0,1.0,0.0,2.0,0.0,3.0,0.0,1.0,1.0,0.0
2493,60.0,4.78,0.0,1.0,1.0,0.5,2.0,900.0,600.0,9.0,...,4.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0
2494,104.0,4.96,0.0,2.0,3.0,2.0,4.0,900.0,720.0,9.0,...,9.0,1.0,0.0,2.0,0.0,0.0,2.0,3.0,0.0,0.0
2495,120.0,4.83,50.0,1.0,1.0,1.0,2.0,900.0,660.0,9.0,...,8.0,3.0,3.0,2.0,0.0,0.0,2.0,1.0,0.0,0.0


**Train Test Split**

In [4]:
X = df_encoded.drop("prices_per_night", axis = 1)
y = df_encoded["prices_per_night"]
print(X.shape, y.shape)

(2497, 25) (2497,)


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

X_train: (1997, 25), y_train: (1997,)
X_test: (500, 25), y_test: (500,)


**Escaladores**

In [6]:
x_scaler = MinMaxScaler()
X_train = x_scaler.fit_transform(X_train)
X_test = x_scaler.transform(X_test)

y_scaler = MinMaxScaler()
y_train = y_scaler.fit_transform(np.array(y_train).reshape(-1, 1))
y_test = y_scaler.transform(np.array(y_test).reshape(-1, 1))

**Selección de los Modelos**
- Evaluaremos cada modelo mediante las **métricas de rendimiento**, tales como el **Error Cuadrático Medio (RMSE)** y el **R^2**, con el fin de seleccionar el que brinde el mejor rendimiento predictivo.

In [7]:
# Definimos los modelos
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Support Vector Regressor": SVR(),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42),
    "LightGBM": LGBMRegressor(random_state=42),
    "MLP Regressor": MLPRegressor(random_state=42)
}

In [8]:
# Lista para almacenar los resultados
resultados_lista = []

# Bucle para entrenar cada modelo y calcular métricas
for model_name, model in models.items():
    model.fit(X_train, y_train.ravel())
    
    y_hat = model.predict(X_test)
    
    # Desescalado de las predicciones
    y_test_inv = y_scaler.inverse_transform(y_test.reshape(-1, 1)).ravel()
    y_hat_inv = y_scaler.inverse_transform(y_hat.reshape(-1, 1)).ravel()
    
    # Cálculo de métricas
    mae = mean_absolute_error(y_test_inv, y_hat_inv)
    mse = mean_squared_error(y_test_inv, y_hat_inv)
    rmse = root_mean_squared_error(y_test_inv, y_hat_inv)
    r2 = r2_score(y_test_inv, y_hat_inv)
    
    # Almacenar resultados
    resultados_lista.append({
        "model_name": model_name,
        "mae": mae,
        "mse": mse,
        "rmse" : rmse,
        "r2_score": r2
    })

# Crear DataFrame y ordenar por r2_score
resultados = pd.DataFrame(resultados_lista)
resultados = resultados.sort_values(by="r2_score", ascending=False)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000984 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 582
[LightGBM] [Info] Number of data points in the train set: 1997, number of used features: 24
[LightGBM] [Info] Start training from score 0.143845


In [9]:
resultados

Unnamed: 0,model_name,mae,mse,rmse,r2_score
5,LightGBM,9.72744,244.834744,15.647196,0.860073
1,Random Forest,8.654486,265.400936,16.291131,0.848319
4,XGBoost,8.297636,292.500981,17.10266,0.832831
3,Gradient Boosting,13.716214,376.139506,19.394316,0.785031
0,Linear Regression,17.901984,615.842171,24.816168,0.648037
6,MLP Regressor,19.023449,636.52404,25.229428,0.636217
2,Support Vector Regressor,20.278318,652.793475,25.549823,0.626919


**Ajuste de Hiperparámetros**
   - Para maximizar el rendimiento del modelo, ajustaremos los hiperparámetros clave utilizando **Grid Search**. Este paso es crucial para obtener el mejor modelo posible para nuestros datos.

In [None]:
lgb = LGBMRegressor(random_state=42) # Quizas el objective se pone aquí
param_grid = {
    'n_estimators': [100, 500, 700, 1000], # Mas arboles mejoran rendimiento
    'num_leaves' : [15, 31, 63, 127], # Aumenta flexibilidad (riesgo sobreajuste)
    'min_child_samples' : [10, 20, 50, 100], # Prevencion de sobreajuste
    'max_depth': [3, 5, 7, 10, 12], # control de complejidad
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.25],
    'min_split_gain' : [0, 0.01, 0.1],
    'subsample': [ 0.7, 0.8, 1],
    'reg_lambda' : [0, 0.1, 1, 10], # Estabiliza
    'reg_alpha' : [0, 0.1, 1, 10], # reduce complejidad
    'feature_fraction' : [0.6, 0.8, 1.0], # Reduce riesgo sobreajuste
    'bagging_fraction' : [0.6, 0.8, 1.0],
    'bagging_freq' : [0, 1, 5],
    'boosting_type' : ['gbdt', 'dart', 'goss'],
    'objective' : ['regression', 'huber']
}

In [None]:
grid_search = GridSearchCV(estimator=lgb, param_grid=param_grid, 
                           scoring='neg_mean_absolute_error', cv=3, 
                           verbose=1, n_jobs=-1)

In [None]:
%%time

grid_search.fit(X_train, y_train)
# Mostramos los mejores parámetros y resultados
best_params = grid_search.best_params_
print("Mejores hiperparámetros:", best_params)

# comprobamos los  mejores parámetros
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculamos y mostramos las métricas
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE", rmse)
print("R²:", r2)