<a href="https://colab.research.google.com/github/fralfaro/MAT281_2023/blob/main/docs/labs/lab_071.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MAT281 - Laboratorio N°071


<a id='p1'></a>
## I.- Problema 01


<img src="https://drive.google.com/uc?export=view&id=1cI62fPIKkkofrAHLQaWLfcIr3qlE1TAZ" width = "350" align="center"/>



Los datos se refieren a las casas encontradas en un distrito determinado de California y algunas estadísticas resumidas sobre ellas basadas en los datos del censo de 1990. Tenga en cuenta que los datos no se limpian, por lo que se requieren algunos pasos de procesamiento previo.

Las columnas son las siguientes, sus nombres se explican por sí mismos:

* longitude
* latitude
* housingmedianage
* total_rooms
* total_bedrooms
* population
* households
* median_income
* medianhousevalue
* ocean_proximity


El objetivo es poder predecir el valor promedio de cada propiedad.
Para poder completar correctamente este laboratorio, es necesario seguir la siguiente rúbrica de trabajo:

1. Definición del problema
2. Estadística descriptiva
3. Visualización descriptiva
4. Preprocesamiento
5. Selección de modelo (Por lo menos debe comparar cuatro modelos)
6. Métricas y análisis de resultados
7. Visualizaciones del modelo
8. Conclusiones

> **Observación**: El alumno tiene la libertad de desarrollar un análisis más completo del problema. Puede tomar como referencia el siguiente [link](https://www.kaggle.com/camnugent/california-housing-prices).

In [49]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import tree
from sklearn import svm
from sklearn import neighbors
import time

In [20]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def regression_metrics(df):
    """
    Aplicar las distintas métricas definidas
    :param df: DataFrame con las columnas: ['y', 'yhat']
    :return: DataFrame con las métricas especificadas
    """
    df_result = pd.DataFrame()

    y_true = df['y']
    y_pred = df['yhat']

    df_result['mae'] = [round(mean_absolute_error(y_true, y_pred), 4)]
    df_result['mse'] = [round(mean_squared_error(y_true, y_pred), 4)]
    df_result['rmse'] = [round(np.sqrt(mean_squared_error(y_true, y_pred)), 4)]
    df_result['mape'] = [round(mean_absolute_percentage_error(y_true, y_pred), 4)]
    df_result['smape'] = [round(2 * mean_absolute_percentage_error(y_true, y_pred) / (mean_absolute_percentage_error(y_true, y_pred) + 100), 4)]

    return df_result

In [21]:
from sklearn.datasets import fetch_california_housing

# Cargar los datos de housing
housing_data = fetch_california_housing(as_frame=True)

# Convertir los datos en un DataFrame de pandas
housing = housing_data['data']
housing['target'] = housing_data['target']

# Visualizar las primeras filas del DataFrame
housing.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [22]:
housing.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [23]:
#Quitamos los valores nulos
housing.dropna(inplace=True)

In [29]:
#Estandarizamos los datos
housing_scalar= StandardScaler().fit_transform(housing)
housing_scalar

array([[ 2.34476576,  0.98214266,  0.62855945, ...,  1.05254828,
        -1.32783522,  2.12963148],
       [ 2.33223796, -0.60701891,  0.32704136, ...,  1.04318455,
        -1.32284391,  1.31415614],
       [ 1.7826994 ,  1.85618152,  1.15562047, ...,  1.03850269,
        -1.33282653,  1.25869341],
       ...,
       [-1.14259331, -0.92485123, -0.09031802, ...,  1.77823747,
        -0.8237132 , -0.99274649],
       [-1.05458292, -0.84539315, -0.04021111, ...,  1.77823747,
        -0.87362627, -1.05860847],
       [-0.78012947, -1.00430931, -0.07044252, ...,  1.75014627,
        -0.83369581, -1.01787803]])

In [53]:
X= housing_scalar.copy()
y=housing['target'].values

In [54]:
#Entrenamiento del modelo
X_train,X_test,y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [63]:
#Modelos de Regresion
reg_linear= LinearRegression()
reg_svr= SVR()
rf_model= RandomForestRegressor()
reg_neig= neighbors.KNeighborsRegressor()

In [60]:
#Regresion Lineal
start_time= time.time()
reg_linear.fit(X_train, y_train)
linear_pred=reg_linear.predict(X_test)
linear_time=time.time()-start_time
linear_metrics = regression_metrics(pd.DataFrame({'y': y_test , 'yhat':linear_pred}))
print('Metricas para el modelos de SVM')
print(linear_metrics)
print('Tiempo de ejecucion',linear_time)

Metricas para el modelos de SVM
   mae  mse  rmse  mape  smape
0  0.0  0.0   0.0   0.0    0.0
Tiempo de ejecucion 0.036470651626586914


In [59]:
# SVM
start_time= time.time()
reg_svr.fit(X_train, y_train)
svr_pred=reg_svr.predict(X_test)
svr_time=time.time()-start_time
svr_metrics = regression_metrics(pd.DataFrame({'y': y_test , 'yhat':svr_pred}))
print('Metricas para el modelos de SVM')
print(svr_metrics)
print('Tiempo de ejecucion',svr_time)

Metricas para el modelos de SVM
      mae     mse    rmse    mape   smape
0  0.0497  0.0054  0.0736  3.4273  0.0663
Tiempo de ejecucion 1.5435314178466797


In [65]:
#Random Forest
start_time= time.time()
rf_model.fit(X_train, y_train)
rf_pred=rf_model.predict(X_test)
rf_time=time.time()-start_time
rf_metrics = regression_metrics(pd.DataFrame({'y': y_test , 'yhat':rf_pred}))
print('Metricas para el modelos de SVM')
print(rf_metrics)
print('Tiempo de ejecucion',rf_time)

Metricas para el modelos de SVM
      mae  mse   rmse    mape   smape
0  0.0002  0.0  0.001  0.0239  0.0005
Tiempo de ejecucion 7.731144905090332


In [66]:
#KNeighborsRegressor
start_time= time.time()
reg_neig.fit(X_train, y_train)
neig_pred=reg_neig.predict(X_test)
neig_time=time.time()-start_time
neig_metrics = regression_metrics(pd.DataFrame({'y': y_test , 'yhat':neig_pred}))
print('Metricas para el modelos de SVM')
print(neig_metrics)
print('Tiempo de ejecucion',neig_time)

Metricas para el modelos de SVM
      mae     mse    rmse    mape   smape
0  0.1006  0.0231  0.1519  6.9012  0.1291
Tiempo de ejecucion 0.5248086452484131


In [147]:
#Creeamos una lista con los valores metricos de cada modelos

metric_list=list(neig_metrics.columns)
mae_list=[]
mae_list.append(float(linear_metrics['mae']))
mae_list.append(float(svr_metrics['mae']))
mae_list.append(float(rf_metrics['mae']))
mae_list.append(float(neig_metrics['mae']))

mse_list=[]
mse_list.append(float(linear_metrics['mse']))
mse_list.append(float(svr_metrics['mse']))
mse_list.append(float(rf_metrics['mse']))
mse_list.append(float(neig_metrics['mse']))

rmse_list=[]
rmse_list.append(float(linear_metrics['rmse']))
rmse_list.append(float(svr_metrics['rmse']))
rmse_list.append(float(rf_metrics['rmse']))
rmse_list.append(float(neig_metrics['rmse']))

smape_list=[]
smape_list.append(float(linear_metrics['smape']))
smape_list.append(float(svr_metrics['smape']))
smape_list.append(float(rf_metrics['smape']))
smape_list.append(float(neig_metrics['smape']))

In [148]:
#Comparando los distintos modelos
modelos=['regresion_linel','svm','random_forest','kn_neighbors_regresion']
tiempo=[linear_time,svr_time,rf_time, neig_time]
result=pd.DataFrame({'Modelo':modelos, 'tiempo':tiempo,'mae':mae_list, 'mse':mse_list,'rmse':rmse_list,'smape':smape_list})
result

Unnamed: 0,Modelo,tiempo,mae,mse,rmse,smape
0,regresion_linel,0.036471,0.0,0.0,0.0,0.0
1,svm,1.543531,0.0497,0.0054,0.0736,0.0663
2,random_forest,7.731145,0.0002,0.0,0.001,0.0005
3,kn_neighbors_regresion,0.524809,0.1006,0.0231,0.1519,0.1291


Notemos que por la metricas el mejor modelo es el de regresion lineal.