# Modelado con Machine Learning

## 1. Objetivo de la etapa
Desarrollar y evaluar modelos de Machine Learning supervisados para predecir el precio, evaluando y validando el mismo a partir de metricas, como asi tambien comparando distintos algoritmos y seleccionando el que presente mejor desempeño.

## 2. Importacion de librerias y carga del dataset procesado

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,root_mean_squared_error, r2_score

import sys
from pathlib import Path
PROJECT_ROOT = Path.cwd().parents[0]
sys.path.append(str(PROJECT_ROOT))


In [2]:
df = pd.read_csv('../data/processed/listings_processed.csv')

df.shape

(23491, 94)

In [3]:
df.head()

Unnamed: 0,price,accommodates,bedrooms,bathrooms,latitude,longitude,number_of_reviews,reviews_per_month,review_scores_rating,minimum_nights,...,neighbourhood_cleansed_Villa Gral. Mitre,neighbourhood_cleansed_Villa Lugano,neighbourhood_cleansed_Villa Luro,neighbourhood_cleansed_Villa Ortuzar,neighbourhood_cleansed_Villa Pueyrredon,neighbourhood_cleansed_Villa Real,neighbourhood_cleansed_Villa Riachuelo,neighbourhood_cleansed_Villa Santa Rita,neighbourhood_cleansed_Villa Soldati,neighbourhood_cleansed_Villa Urquiza
0,3983.0,2,1,1,-34.58184,-58.42415,26,0.27,95.0,2,...,False,False,False,False,False,False,False,False,False,False
1,1593.0,1,1,1,-34.59761,-58.39468,20,0.16,95.0,1,...,False,False,False,False,False,False,False,False,False,False
2,2987.0,2,1,1,-34.59382,-58.42994,1,0.06,100.0,1,...,False,False,False,False,False,False,False,False,False,False
3,2987.0,2,1,1,-34.59398,-58.42853,0,0.0,0.0,1,...,False,False,False,False,False,False,False,False,False,False
4,2987.0,2,1,1,-34.59348,-58.42949,66,1.89,99.0,1,...,False,False,False,False,False,False,False,False,False,False


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23491 entries, 0 to 23490
Data columns (total 94 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   price                                     23491 non-null  float64
 1   accommodates                              23491 non-null  int64  
 2   bedrooms                                  23491 non-null  int64  
 3   bathrooms                                 23491 non-null  int64  
 4   latitude                                  23491 non-null  float64
 5   longitude                                 23491 non-null  float64
 6   number_of_reviews                         23491 non-null  int64  
 7   reviews_per_month                         23491 non-null  float64
 8   review_scores_rating                      23491 non-null  float64
 9   minimum_nights                            23491 non-null  int64  
 10  availability_365                  

## 3. Definicion del problema de modelado

In [5]:
y = df['price']  #Variable objetivo 
X = df.drop(columns=['price'])  #Features

In [6]:
X.shape, y.shape

((23491, 93), (23491,))

## 4. Division del dataset en entrenamiento y prueba

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [8]:
X_train.shape, y_train.shape

((18792, 93), (18792,))

## 5. Modelo base: Regresion Lineal

In [9]:
from src.modeling.linear_regression import train_linear_regression

linreg_model = train_linear_regression(X_train, y_train)

In [10]:
y_pred = linreg_model.predict(X_test)

## 6. Evaluacion del modelo

In [11]:
mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.4f}")

MAE: 1313.54
RMSE: 2261.10
R²: 0.3024


El modelo de regresión lineal explica aproximadamente el 28% de la variabilidad del precio,presentando un error promedio absoluto cercano a 1300 unidades.
Estos resultados sirven como referencia para evaluar modelos más complejos.


## 7. Segundo modelo: Arbol de Decision

### 7.1 Entrenamiento

In [12]:
from src.modeling.decision_tree import train_decision_tree

dt_model = train_decision_tree(X_train, y_train)

### 7.2 Prediccion

In [13]:
y_pred_dt = dt_model.predict(X_test)

### 7.3 Evaluacion

In [14]:
mae_dt = mean_absolute_error(y_test, y_pred_dt)
rmse_dt = root_mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print(f"MAE Decision Tree: {mae_dt:.2f}")
print(f"RMSE Decision Tree: {rmse_dt:.2f}")
print(f"R² Decision Tree: {r2_dt:.4f}")

MAE Decision Tree: 1223.32
RMSE Decision Tree: 2143.58
R² Decision Tree: 0.3730


### 7.4 Comparacion con el modelo base

In [15]:
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree'],
    'MAE': [mae, mae_dt],
    'RMSE': [rmse, rmse_dt],
    'R²': [r2, r2_dt]
})

results

Unnamed: 0,Model,MAE,RMSE,R²
0,Linear Regression,1313.541818,2261.096622,0.302368
1,Decision Tree,1223.321861,2143.579745,0.373


Se compara el desempeño del modelo en los conjuntos de entrenamiento y prueba para evaluar posibles signos de sobreajuste.


In [16]:
y_train_pred_dt = dt_model.predict(X_train)

mae_train = mean_absolute_error(y_train, y_train_pred_dt)
rmse_train = np.sqrt(root_mean_squared_error(y_train, y_train_pred_dt))
r2_train = r2_score(y_train, y_train_pred_dt)

print(f'MAE Train Decision Tree: {mae_train: .2f}')
print(f'RMSE Train Decision Tree: {rmse_train:.2f}')
print(f'R² Train Decision Tree: {r2_train:.4f}')

MAE Train Decision Tree:  1151.56
RMSE Train Decision Tree: 44.90
R² Train Decision Tree: 0.4788


El Árbol de Decisión mejora el desempeño respecto al modelo base, capturando relaciones no lineales entre las variables.
Sin embargo, su desempeño depende fuertemente de la profundidad del árbol, lo que motiva el uso de modelos ensamblados.


## 8. Tercer modelo: Random Forest


In [17]:
from src.modeling.random_forest import train_random_forest

### 8.1 Entrenamiento

In [18]:
rf_model = train_random_forest(X_train,
                               y_train,
                               {'random_state': 42,
                                'n_estimators': 200, # Numero de árboles en el bosque
                                'max_depth': 12, # Profundidad máxima de cada arbol
                                'min_samples_leaf': 20 , # Minimo de muestras en cada hoja para evitar sobreajuste
                                'n_jobs': -1 # Utiliza todos los núcleos disponibles para acelerar el entrenamiento
                                })

### 8.2 Prediccion

In [19]:
y_pred_rf = rf_model.predict(X_test)

### Evaluacion del modelo

In [20]:
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = root_mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"MAE Random Forest: {mae_rf:.2f}")
print(f"RMSE Random Forest: {rmse_rf:.2f}")
print(f"R² Random Forest: {r2_rf:.4f}")

MAE Random Forest: 1157.91
RMSE Random Forest: 2057.11
R² Random Forest: 0.4226


### 8.4 Conclusion de los modelos 

El modelo Random Forest presenta el mejor desempeño entre los modelos evaluados, reduciendo el error y mejorando la capacidad explicativa del precio, al capturar relaciones no lineales y reducir el sobreajuste observado en el Arbol de Decision.

## 9. Cuarto modelo: XGBoost

In [21]:
from src.modeling.xgboost import train_xgboost

In [22]:
xgboost = train_xgboost(X_train,y_train)

In [23]:
y_pred_xgb = xgboost.predict(X_test)

mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = root_mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"MAE Random Forest Optimizado: {mae_xgb:.2f}")
print(f"RMSE Random Forest Optimizado: {rmse_xgb:.2f}")
print(f"R² Random Forest Optimizado: {r2_xgb:.4f}")

MAE Random Forest Optimizado: 1116.99
RMSE Random Forest Optimizado: 1994.52
R² Random Forest Optimizado: 0.4572


### 9.3 Comparacion final de modelos


In [24]:
final_results = pd.DataFrame({
    'Modelo': ["Regresión Lineal", "Árbol de Decisión", "Random Forest", "RXGBoost"],
    'MAE': [mae, mae_dt, mae_rf, mae_xgb],
    'RMSE': [rmse, rmse_dt, rmse_rf, rmse_xgb],
    'R2': [r2, r2_dt, r2_rf, r2_xgb]
})

final_results

Unnamed: 0,Modelo,MAE,RMSE,R2
0,Regresión Lineal,1313.541818,2261.096622,0.302368
1,Árbol de Decisión,1223.321861,2143.579745,0.373
2,Random Forest,1157.911394,2057.114435,0.422562
3,RXGBoost,1116.988858,1994.52394,0.457166


## 9. Optimizacion de hiperparametros con GridSearchCV

Lo vamos a aplicar a XGBoost para mejorar el mismo.

### 9.1 Definicion del grid

In [25]:
from src.modeling.xgboost_grid import train_model_xgb_grid

param_grid = {
    "n_estimators": [200, 400],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.05, 0.1],
    "subsample": [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0]
}


### 9.2 Ejecucion del grid

In [26]:
best_xgb, best_params = train_model_xgb_grid(X_train, y_train, param_grid)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


### 9.3 Mejores parametros

In [27]:
print("Mejores hiperparametros encontrados:", best_params)

Mejores hiperparametros encontrados: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}


## 10. Selecion del modelo final

### 10.1 Evaluacion final en test

In [None]:
y_pred_best_xgb = best_xgb.predict(X_test)

mae_best_xgb = mean_absolute_error(y_test, y_pred_best_xgb)
rmse_best_xgb = root_mean_squared_error(y_test, y_pred_best_xgb)
r2_best_xgb = r2_score(y_test, y_pred_best_xgb)
  
print(f"MAE Random Forest Optimizado: {mae_best_xgb:.2f}")
print(f"RMSE Random Forest Optimizado: {rmse_best_xgb:.2f}")
print(f"R² Random Forest Optimizado: {r2_best_xgb:.4f}")

MAE Random Forest Optimizado: 1113.84
RMSE Random Forest Optimizado: 1996.62
R² Random Forest Optimizado: 0.4560


### 10.2 Comparacion final

In [29]:
final_results = pd.DataFrame({
    'Modelo': ["Regresión Lineal", "Árbol de Decisión", "Random Forest", "XGBoost","XGBoost Optimizado"],
    'MAE': [mae, mae_dt, mae_rf, mae_xgb,mae_best_xgb],
    'RMSE': [rmse, rmse_dt, rmse_rf, rmse_xgb,rmse_best_xgb],
    'R2': [r2, r2_dt, r2_rf, r2_xgb,r2_best_xgb]
})

final_results

Unnamed: 0,Modelo,MAE,RMSE,R2
0,Regresión Lineal,1313.541818,2261.096622,0.302368
1,Árbol de Decisión,1223.321861,2143.579745,0.373
2,Random Forest,1157.911394,2057.114435,0.422562
3,XGBoost,1116.988858,1994.52394,0.457166
4,XGBoost Optimizado,1113.837583,1996.621023,0.456024


El modelo Random Forest optimizado mediante GridSearchCV presenta el mejor desempeño,logrando una reducción adicional del error y una mejora en la capacidad explicativa del precio.
Este modelo fue seleccionado como modelo final del proyecto.


### 10.3 Guardado de modelo

In [30]:
import joblib

joblib.dump(best_xgb, "../models/xgb_optimizado_v1.pkl")

['../models/xgb_optimizado_v1.pkl']

In [31]:
loaded_model = joblib.load("../models/xgb_optimizado_v1.pkl")

y_pred_loaded = loaded_model.predict(X_test)

print("Predicciones cargadas OK:", y_pred_loaded[:5])


Predicciones cargadas OK: [2182.6335 2215.5483 1527.3965 1670.221  3643.7832]


In [32]:
nuevo = pd.DataFrame(
    np.zeros((1,len(X_train.columns))),
    columns=X_train.columns
)

In [33]:
import json

json_input = """
{
  "accommodates": 4,
  "bedrooms": 1,
  "bathrooms": 2,
  "number_of_reviews": 100230,
  "reviews_per_month": 10,
  "neighbourhood_cleansed_Boca": 1
}
"""

data = json.loads(json_input)


In [34]:
for key, value in data.items():
    if key in nuevo.columns:
        nuevo.at[0, key] = value


In [35]:
precio_estimado = loaded_model.predict(nuevo)

print("Precio estimado Airbnb: ",precio_estimado[0])

Precio estimado Airbnb:  6049.3843


In [36]:
assert list(nuevo.columns) == list(X_train.columns)
