## Modelo Predictivo - Regresión

Usando los datos de **autocasion_procesado.csv** vamos a probar diferentes modelos de regresión para predecir la columna de **precio**.

### X, y

Define **X** e **y**:
- **X** son las variables independientes, es decir, las columnas que vamos a utilizar para predecir **y**. Elige las columnas que creas que son más importantes, puedes guiarte usando el **Feature Selection** del ejercicio de la semana pasada.
- **y** es la variable dependiente, es decir, la variable que queremos predecir usando los datos de **X**. En este ejercicio es la columna **precio**.

Una vez definidos **X** e **y** usa la función _**train_test_split()**_ para crear las variables:
- **X_train**, **y_train**
- **X_test**, **y_test**

Utiliza los siguientes parámetros: **test_size = 0.2** y **random_state = 42**

---

### Escaladores

Define 2 escaladores:
- **Escalador para X**: Define un escalador para **X**, llámalo **X_scaler**, y utiliza **X_train** para entrenar ese objeto. Transforma **X_train** y **X_test** con este escalador.
- **Escalador para y**: Define un escalador para **y**, llámalo **y_scaler**, y utiliza **y_train** para entrenar ese objeto. Transforma **y_train** e **y_test** con este escalador.

Puedes usar **MinMaxScaler()** o **StandardScaler()**.

---

### Modelos y Métricas

Entrena diferentes modelos de regresión usando **X_train** e **y_train**.
Genera las predicciones para cada modelo usando **X_test** y calcula las siguientes métricas comparando **y_test** e **y_hat**:
- **Mean Absolute Error** (**MAE**)
- **Mean Squared Error** (**MSE**)
- **R Squared** (**r2_score**)

Recuerda que para calcular las métricas debes invertir la transformación de los escaladores para **y_test** e **y_hat**.

---

### Comparación

Genera un **DataFrame** con los resultados del punto anterior. Este **DataFrame** debe tener las siguientes 4 columnas:

|model_name|mae|mse|r2_score|
|----------|---|---|--------|

Ordena el **DataFrame** por la fila que tenga el mejor **r2_score**.

---

_Opcional: Intenta llegar a más de 0.90 en r2_score._ 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Escaladores
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Train, Test
from sklearn.model_selection import train_test_split

# Modelos
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Métricas
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

df = pd.read_csv("autocasion_procesado.csv")

df

Unnamed: 0,Kilómetros,Cambio,Potencia (cv),Garantía,largo,ancho,alto,batalla_mm,peso_masa_kg,puertas,...,combustible_Mixto Gasolina/Etanol,combustible_nan,sobrealimentacion_Compresor Lisholm,sobrealimentacion_Compresor de raices,sobrealimentacion_Compresor y turbo,sobrealimentacion_Doble turbo,sobrealimentacion_Tipo de sobrealimentador,sobrealimentacion_Turbo,sobrealimentacion_Turbo de geometría variable,sobrealimentacion_nan
0,3900.0,1.0,179.0,12.0,4366.333333,1795.666667,1476.666667,2688.333333,1829.000000,5.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,4000.0,1.0,180.0,24.0,4403.666667,1808.000000,1503.333333,2653.000000,1880.000000,5.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,10.0,1.0,140.0,36.0,3657.000000,1627.000000,1480.000000,2300.000000,1425.000000,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,10.0,1.0,140.0,36.0,3657.000000,1627.000000,1480.000000,2300.000000,1425.000000,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,10.0,1.0,180.0,12.0,4549.666667,1834.333333,1482.333333,2717.666667,1946.666667,5.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87638,60000.0,0.0,390.0,12.0,4950.000000,2008.000000,1776.000000,2984.000000,2980.000000,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
87639,66070.0,0.0,390.0,12.0,4950.000000,2008.000000,1776.000000,2984.000000,2980.000000,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
87640,31800.0,0.0,390.0,12.0,4950.000000,2008.000000,1776.000000,2984.000000,2980.000000,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
87641,6500.0,0.0,343.0,0.0,4519.000000,1852.000000,1299.333333,2450.000000,2051.666667,2.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87643 entries, 0 to 87642
Data columns (total 56 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Kilómetros                                     87643 non-null  float64
 1   Cambio                                         87643 non-null  float64
 2   Potencia (cv)                                  87643 non-null  float64
 3   Garantía                                       87643 non-null  float64
 4   largo                                          87643 non-null  float64
 5   ancho                                          87643 non-null  float64
 6   alto                                           87643 non-null  float64
 7   batalla_mm                                     87643 non-null  float64
 8   peso_masa_kg                                   87643 non-null  float64
 9   puertas                                        876

In [4]:
df.isna().sum()

Kilómetros                                       0
Cambio                                           0
Potencia (cv)                                    0
Garantía                                         0
largo                                            0
ancho                                            0
alto                                             0
batalla_mm                                       0
peso_masa_kg                                     0
puertas                                          0
plazas                                           0
cilindrada_cm3                                   0
cilindros                                        0
urbano                                           0
carretera                                        0
medio                                            0
co2                                              0
deposito                                         0
precio                                           0
month                          

## X, y
Define X e y:
X son las variables independientes, es decir, las columnas que vamos a utilizar para predecir y. Elige las columnas que creas que son más importantes, puedes guiarte usando el Feature Selection del ejercicio de la semana pasada.
y es la variable dependiente, es decir, la variable que queremos predecir usando los datos de X. En este ejercicio es la columna precio.

In [6]:
df[['Potencia (cv)','deposito','year','Garantía','Kilómetros']]

Unnamed: 0,Potencia (cv),deposito,year,Garantía,Kilómetros
0,179.0,4866.666667,2023.0,12.0,3900.0
1,180.0,4533.333333,2023.0,24.0,4000.0
2,140.0,3500.000000,2023.0,36.0,10.0
3,140.0,3500.000000,2023.0,36.0,10.0
4,180.0,5066.666667,2023.0,12.0,10.0
...,...,...,...,...,...
87638,390.0,7000.000000,2021.0,12.0,60000.0
87639,390.0,7000.000000,2021.0,12.0,66070.0
87640,390.0,7000.000000,2020.0,12.0,31800.0
87641,343.0,6600.000000,2009.0,0.0,6500.0


In [7]:
X = df.drop('precio', axis = 1)
y = df["precio"]

print(f"X: {X.shape}")
print(f"y: {y.shape}")

X: (87643, 55)
y: (87643,)


In [8]:
X = np.array(df[['Potencia (cv)','deposito','year','Garantía','Kilómetros']])
y = np.array(df['precio'])

In [9]:
X.shape,y.shape

((87643, 5), (87643,))

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
print(f"Conjunto de Train: {X_train.shape, y_train.shape}")
print(f"Conjunto de Test: {X_test.shape, y_test.shape}")

Conjunto de Train: ((70114, 5), (70114,))
Conjunto de Test: ((17529, 5), (17529,))


## Escaladores
Define 2 escaladores:

Escalador para X: Define un escalador para X, llámalo X_scaler, y utiliza X_train para entrenar ese objeto. Transforma X_train y X_test con este escalador.
Escalador para y: Define un escalador para y, llámalo y_scaler, y utiliza y_train para entrenar ese objeto. Transforma y_train e y_test con este escalador.

In [13]:
X_scaler = MinMaxScaler()
y_scaler = MinMaxScaler()

In [14]:
X_scaler.fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [15]:
y_scaler.fit(y_train.reshape(-1, 1))
y_train_scaled = y_scaler.transform(y_train.reshape(-1, 1))
y_test_scaled = y_scaler.transform(y_test.reshape(-1, 1))

## Modelos y Métricas
Entrena diferentes modelos de regresión usando X_train e y_train. Genera las predicciones para cada modelo usando X_test y calcula las siguientes métricas comparando y_test e y_hat:###

Mean Absolute Error (M E)
Mean Squared Error ( SE)
R Squared (r2_sc
ore)
Recuerda que para calcular las métricas debes invertir la transformación de los escaladores para y_test e y_hat.

## Linear Regression

In [18]:
model_lr = LinearRegression(n_jobs=-1)
model_lr.fit(X_train_scaled, y_train_scaled)

In [19]:
y_hat_scaled_lr = model_lr.predict(X_test_scaled)
y_hat_lr = y_scaler.inverse_transform(y_hat_scaled_lr)
y_test_inv_lr = y_scaler.inverse_transform(y_test_scaled)

In [20]:
mae_lr = mean_absolute_error(y_test_inv_lr, y_hat_lr)
mse_lr = mean_squared_error(y_test_inv_lr, y_hat_lr)
r2_lr = r2_score(y_test_inv_lr, y_hat_lr)

In [21]:
print("Linear Regression:")
print(f"MAE: {mae_lr:.2f}")
print(f"MSE: {mse_lr:.2f}")
print(f"R2 Score: {r2_lr:.2f}")

Linear Regression:
MAE: 9547.88
MSE: 4964702455.08
R2 Score: 0.05


## Random Forest Regressor

In [23]:
model_rfr = RandomForestRegressor(n_estimators=200)
model_rfr.fit(X_train_scaled, y_train_scaled.flatten())

In [24]:
y_hat_scaled_rfr = model_rfr.predict(X_test_scaled)

In [25]:
y_hat_scaled_rfr_reshaped = y_hat_scaled_rfr.reshape(-1, 1)
y_test_scaled_reshaped = y_test_scaled.reshape(-1, 1)

In [26]:
y_hat_rfr = y_scaler.inverse_transform(y_hat_scaled_rfr_reshaped)
y_test_inv_rfr = y_scaler.inverse_transform(y_test_scaled_reshaped)

In [27]:
mae_rfr = mean_absolute_error(y_test_inv_rfr, y_hat_rfr)
mse_rfr = mean_squared_error(y_test_inv_rfr, y_hat_rfr)
r2_rfr = r2_score(y_test_inv_rfr, y_hat_rfr)

In [28]:
print("\nRandom Forest Regressor:")
print(f"MAE: {mae_rfr:.2f}")
print(f"MSE: {mse_rfr:.2f}")
print(f"R2 Score: {r2_rfr:.2f}")


Random Forest Regressor:
MAE: 3227.86
MSE: 964844649.35
R2 Score: 0.81


## Decision Tree Regressor

In [30]:
model_dtr = DecisionTreeRegressor(max_depth=6)
model_dtr.fit(X_train_scaled, y_train_scaled)

In [31]:
y_hat_scaled_dtr = model_dtr.predict(X_test_scaled)
y_hat_scaled_dtr_rshape = y_hat_scaled_dtr.reshape(-1,1)
y_hat_dtr = y_scaler.inverse_transform(y_hat_scaled_dtr_rshape)

In [32]:
y_test_scaled_rshape = y_test_scaled.reshape(-1, 1)
y_test_inv_dtr = y_scaler.inverse_transform(y_test_scaled_rshape)

In [33]:
mae_dtr = mean_absolute_error(y_test_inv_dtr, y_hat_dtr)
mse_dtr = mean_squared_error(y_test_inv_dtr, y_hat_dtr)
r2_dtr = r2_score(y_test_inv_dtr, y_hat_dtr)

In [34]:
print("\nDecision Tree Regressor:")
print(f"MAE: {mae_dtr:.2f}")
print(f"MSE: {mse_dtr:.2f}")
print(f"R2 Score: {r2_dtr:.2f}")


Decision Tree Regressor:
MAE: 6967.10
MSE: 2383627601.64
R2 Score: 0.54


In [35]:
model_knr = KNeighborsRegressor()
model_knr.fit(X_train_scaled, y_train_scaled)

In [36]:
y_hat_scaled_knr = model_knr.predict(X_test_scaled)
y_hat_scaled_knr_rshape = y_hat_scaled_knr.reshape(-1,1)
y_hat_knr = y_scaler.inverse_transform(y_hat_scaled_knr_rshape)

In [37]:
y_test_scaled_rshape = y_test_scaled.reshape(-1, 1)
y_test_inv_knr = y_scaler.inverse_transform(y_test_scaled_rshape)

In [38]:
mae_knr = mean_absolute_error(y_test_inv_knr, y_hat_knr)
mse_knr = mean_squared_error(y_test_inv_knr, y_hat_knr)
r2_knr = r2_score(y_test_inv_knr, y_hat_knr)

In [39]:
print("\nKNeighborsRegressor:")
print(f"MAE: {mae_dtr:.2f}")
print(f"MSE: {mse_dtr:.2f}")
print(f"R2 Score: {r2_dtr:.2f}")


KNeighborsRegressor:
MAE: 6967.10
MSE: 2383627601.64
R2 Score: 0.54


In [40]:
resultado = {
    'Model': ['Random Forest', 'Decision Tree', 'Linear Regression','KNeighbors'],
    'MAE': [mae_rfr, mae_dtr, mae_lr, mae_knr],
    'MSE': [mse_rfr, mse_dtr, mse_lr, mse_knr],
    'R2 Score': [r2_rfr, r2_dtr, r2_lr, r2_knr]
}
resultado

{'Model': ['Random Forest',
  'Decision Tree',
  'Linear Regression',
  'KNeighbors'],
 'MAE': [3227.863609869077,
  6967.096037856533,
  9547.882616976194,
  5300.866385228289],
 'MSE': [964844649.3471539,
  2383627601.6400895,
  4964702455.077975,
  3681493110.247934],
 'R2 Score': [0.8148181735126636,
  0.5425123482458964,
  0.04712881061258811,
  0.29341410680201285]}

In [41]:
metricas_resultado = pd.DataFrame(resultado).sort_values(by='R2 Score', ascending=False)

In [42]:
metricas_resultado

Unnamed: 0,Model,MAE,MSE,R2 Score
0,Random Forest,3227.86361,964844600.0,0.814818
1,Decision Tree,6967.096038,2383628000.0,0.542512
3,KNeighbors,5300.866385,3681493000.0,0.293414
2,Linear Regression,9547.882617,4964702000.0,0.047129


In [43]:
##############################################################################################################################