# Modelado con Machine Learning

## 1. Objetivo de la etapa
Desarrollar y evaluar modelos de Machine Learning supervisados para predecir el precio, evaluando y validando el mismo a partir de metricas, como asi tambien comparando distintos algoritmos y seleccionando el que presente mejor desempeño.

## 2. Importacion de librerias y carga del dataset procesado

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,root_mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv('../data/processed/listings_processed.csv')

df.shape, df.head()

((23491, 56),
     price  accommodates  bedrooms  bathrooms  number_of_reviews  \
 0  3983.0             2         1          1                 26   
 1  1593.0             1         1          1                 20   
 2  2987.0             2         1          1                  1   
 3  2987.0             2         1          1                  0   
 4  2987.0             2         1          1                 66   
 
    reviews_per_month  room_type_Hotel room  room_type_Private room  \
 0               0.27                 False                   False   
 1               0.16                 False                    True   
 2               0.06                 False                    True   
 3               0.00                 False                    True   
 4               1.89                 False                    True   
 
    room_type_Shared room  neighbourhood_cleansed_Almagro  ...  \
 0                  False                           False  ...   
 1              

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23491 entries, 0 to 23490
Data columns (total 56 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   price                                     23491 non-null  float64
 1   accommodates                              23491 non-null  int64  
 2   bedrooms                                  23491 non-null  int64  
 3   bathrooms                                 23491 non-null  int64  
 4   number_of_reviews                         23491 non-null  int64  
 5   reviews_per_month                         23491 non-null  float64
 6   room_type_Hotel room                      23491 non-null  bool   
 7   room_type_Private room                    23491 non-null  bool   
 8   room_type_Shared room                     23491 non-null  bool   
 9   neighbourhood_cleansed_Almagro            23491 non-null  bool   
 10  neighbourhood_cleansed_Balvanera  

## 3. Definicion del problema de modelado

In [4]:
y = df['price']  #Variable objetivo 
X = df.drop(columns=['price'])  #Features

In [5]:
X.shape, y.shape

((23491, 55), (23491,))

## 4. Division del dataset en entrenamiento y prueba

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [7]:
X_train.shape, y_train.shape

((18792, 55), (18792,))

## 5. Modelo base: Regresion Lineal

In [8]:
linreg_model = LinearRegression()
linreg_model.fit(X_train, y_train)

In [9]:
y_pred = linreg_model.predict(X_test)

## 6. Evaluacion del modelo

In [10]:
mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'R²: {r2:.4f}')

MAE: 1329.91
RMSE: 2295.89
R²: 0.2807


El modelo de regresión lineal explica aproximadamente el 28% de la variabilidad del precio,presentando un error promedio absoluto cercano a 1300 unidades.
Estos resultados sirven como referencia para evaluar modelos más complejos.


## 7. Segundo modelo: Arbol de Decision

In [11]:
from sklearn.tree import DecisionTreeRegressor

### 7.1 Entrenamiento

In [12]:
dt = DecisionTreeRegressor(random_state=42,
                           max_depth=10, # Limita la profundidad del árbol para evitar sobreajuste - overfitting
                           min_samples_leaf=20 # Minimo de muestras en cada hoja para evitar que el arbol se ajuste demasiado a los datos de entrenamiento - overfitting
                           )
dt.fit(X_train, y_train)

### 7.2 Prediccion

In [13]:
y_pred_dt = dt.predict(X_test)

### 7.3 Evaluacion

In [14]:
mae_dt = mean_absolute_error(y_test, y_pred_dt)
rmse_dt = root_mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print(f'MAE Decision Tree: {mae_dt:.2f}')
print(f'RMSE Decision Tree: {rmse_dt:.2f}')
print(f'R² Decision Tree: {r2_dt:.4f}')

MAE Decision Tree: 1216.51
RMSE Decision Tree: 2153.05
R² Decision Tree: 0.3674


### 7.4 Comparacion con el modelo base

In [15]:
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree'],
    'MAE': [mae, mae_dt],
    'RMSE': [rmse, rmse_dt],
    'R²': [r2, r2_dt]
})

results

Unnamed: 0,Model,MAE,RMSE,R²
0,Linear Regression,1329.910195,2295.892225,0.280731
1,Decision Tree,1216.511849,2153.048364,0.367448


Se compara el desempeño del modelo en los conjuntos de entrenamiento y prueba para evaluar posibles signos de sobreajuste.


In [16]:
y_train_pred_dt = dt.predict(X_train)

mae_train = mean_absolute_error(y_train, y_train_pred_dt)
rmse_train = np.sqrt(root_mean_squared_error(y_train, y_train_pred_dt))
r2_train = r2_score(y_train, y_train_pred_dt)

print(f'MAE Train Decision Tree: {mae_train:.2f}')
print(f'RMSE Train Decision Tree: {rmse_train:.2f}')
print(f'R² Train Decision Tree: {r2_train:.4f}')

MAE Train Decision Tree: 1188.76
RMSE Train Decision Tree: 45.67
R² Train Decision Tree: 0.4420


El Árbol de Decisión mejora el desempeño respecto al modelo base, capturando relaciones no lineales entre las variables.
Sin embargo, su desempeño depende fuertemente de la profundidad del árbol, lo que motiva el uso de modelos ensamblados.


## 8. Tercer modelo: Random Forest


In [17]:
from sklearn.ensemble import RandomForestRegressor

### 8.1 Entrenamiento

In [19]:
rf = RandomForestRegressor(random_state=42,
                           n_estimators=200, # Número de árboles en el bosque
                           max_depth=12, 
                           min_samples_leaf=20 ,
                           n_jobs=-1 # Utiliza todos los núcleos disponibles para acelerar el entrenamiento
                           )

rf.fit(X_train, y_train)

### 8.2 Prediccion