<a href="https://colab.research.google.com/github/GuiOSousa/HouseValueRegression/blob/main/HouseValue.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Preparando o Ambiente

In [274]:
!pip install pandas



In [275]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Tratando os Dados

##Acessando a base de dados

In [276]:
import pandas as pd

dataset = pd.read_csv('/content/drive/MyDrive/Datasets/HouseRentData/house_prices.csv')

dataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,INLAND,78100
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,INLAND,77100
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,INLAND,92300
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,INLAND,84700


##Tratando valores NaN e nulos

In [277]:
print(dataset.isna().sum())
print(dataset.isnull().sum())

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64


Como nossos valores NaN/nulos representam uma parcela mínima da nossa base de dados, podemos optar por simplesmente excuí-los, sem que haja impactos significantes no treinamento do modelo

In [278]:
dataset = dataset.dropna()
dataset = dataset.reset_index(drop=True)

##Decodificando valores de texto em valores numéricos

Modelos de linguagem geralmente requerem uma decodificação quando trabalhamos com Strings.


No bloco abaixo, substituimos a "distância do mar", atributo descrito em texto, por valores inteiros.

In [279]:
datasetEncoded = dataset[:]

datasetEncoded = datasetEncoded.replace("ISLAND", 0)
datasetEncoded = datasetEncoded.replace("NEAR BAY", 1)
datasetEncoded = datasetEncoded.replace("NEAR OCEAN", 2)
datasetEncoded = datasetEncoded.replace("<1H OCEAN", 3)
datasetEncoded = datasetEncoded.replace("INLAND", 4)

datasetEncoded

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,1,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,1,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,1,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,1,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,1,342200
...,...,...,...,...,...,...,...,...,...,...
20428,-121.09,39.48,25,1665,374.0,845,330,1.5603,4,78100
20429,-121.21,39.49,18,697,150.0,356,114,2.5568,4,77100
20430,-121.22,39.43,17,2254,485.0,1007,433,1.7000,4,92300
20431,-121.32,39.43,18,1860,409.0,741,349,1.8672,4,84700


##Atributos de posição

Embora Latitude e Longitude não sejam os melhores atributos para se analisar quando fatores socioeconômicos são mais importantes que a posição física do objeto em questão, materemos ambas pois não possuímos, na base de dados, outros dados referentes a localização.

No contexto dado, informações como estado, cidade/distrito, bairros, etc. trariam análises mais relevantes, portanto seria mais prudente remover as colunas de Latitude e Longitude.*texto em itálico*


#Separando Teste e Treinamento

In [280]:
trainingDataset = datasetEncoded.sample(frac=0.9)

trainingDataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
16338,-120.98,37.79,20,2458,491.0,1227,481,2.7857,4,110900
400,-122.27,37.90,42,1650,274.0,645,256,5.6228,1,375400
9864,-122.24,38.31,38,1938,301.0,823,285,6.1089,1,280800
10424,-117.71,33.57,4,3289,753.0,1285,651,4.0450,3,226000
1980,-119.80,36.72,43,1286,360.0,972,345,0.9513,4,50400
...,...,...,...,...,...,...,...,...,...,...
17470,-121.92,37.27,29,5536,862.0,2651,881,5.6358,3,282100
7009,-118.02,33.93,33,4711,988.0,2984,931,3.6028,3,184700
10201,-117.78,33.87,16,5609,952.0,2624,934,5.3307,3,169600
148,-122.22,37.80,52,2286,464.0,1073,441,3.0298,1,199600


In [281]:
X_trainingDataset = trainingDataset.drop("median_house_value", axis=1)
Y_trainingDataset = trainingDataset.drop(X_trainingDataset, axis=1)

In [282]:
testDataset = datasetEncoded.drop(trainingDataset.index)

testDataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,1,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,1,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,1,341300
7,-122.25,37.84,52,3104,687.0,1157,647,3.1200,1,241400
29,-122.28,37.84,52,729,160.0,395,155,1.6875,1,132000
...,...,...,...,...,...,...,...,...,...,...
20365,-121.98,38.52,27,3044,565.0,1583,514,2.7989,4,126700
20366,-122.05,38.56,20,1005,168.0,457,157,5.6790,4,225000
20409,-121.53,39.08,15,1810,441.0,1157,375,2.0469,4,55100
20419,-121.43,39.18,36,1124,184.0,504,171,2.1667,4,93800


In [283]:
X_testDataset = testDataset.drop("median_house_value", axis=1)
Y_testDataset = testDataset.drop(X_testDataset, axis=1)

#Gradient Boosting Regressor
Utilizamos um Gradient Boosting Regressor genérico (sem alteração nos hiperparâmetros) para a tarefa de regressão.

O Gradient Boosting Regressor é um modelo de ensemble (combina modelos mais fracos em modelos mais fortes durante o treinamento) de Árvores de Decisão.

In [284]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

model = GradientBoostingRegressor()

model.fit(X_trainingDataset, Y_trainingDataset)

prediction = model.predict(X_testDataset)

  y = column_or_1d(y, warn=True)


In [285]:
predictionDataFrame = pd.DataFrame(prediction)

predictionDataFrame

Unnamed: 0,0
0,413484.397991
1,379062.297359
2,338870.805400
3,247446.092652
4,147064.858308
...,...
2038,96916.209262
2039,212550.594109
2040,78027.663277
2041,77574.338863


##Avaliando as predições

Métricas comuns como MAE e MSE são difíceis de interpretar pois nossos valores não estão normalizados, então MAPE, RSMLE e R² foram escolhidas.

Como nossos resultados são predições de preços de propriedades, que apresentam valores na casa das centenas de milhar, MAE e MSE apresentariam valores extremamente altos independente da qualidade do modelo, portanto, pouco confiáveis para dizer a qualidade das previsões.

In [286]:
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_log_error, r2_score
import numpy as np


mape = mean_absolute_percentage_error(Y_testDataset, predictionDataFrame)
msle = mean_squared_log_error(Y_testDataset, predictionDataFrame)
r2 = r2_score(Y_testDataset, predictionDataFrame)
rmsle = np.sqrt(msle)

print(f"MAPE: {mape:.2f}")
print(f"RMSLE: {rmsle:.2f}")
print(f"R2: {r2:.2f}")

MAPE: 0.21
RMSLE: 0.27
R2: 0.76


Analisando os valores de MAPE, RSMLE e R², podemos concluir que os resultados foram aceitáveis para um modelo genérico, considerando também que váriaveis de localização melhores levariam a predições mais precisas.

A pouca diferença entre MAPE (Mean Absolute Percentage Error) e RMSLE (Root Mean Squared Logarithmic Error) indica que não houve uma grande quantidade de erros críticos, em que os valores da predição e da verdade são completamente destoantes.

#Plotagem do Mapa

In [287]:
import folium
from folium.plugins import *
from google.colab import output

map = folium.Map(tiles='openstreetmap',location=[36.69245689448622, -119.7175285756223],zoom_start=7,zoom_control=True)

In [288]:
import math

housesLayer = folium.FeatureGroup(name="Houses", show=True)

for i in range(len(testDataset)-1):
  house = testDataset.iloc[i]
  coordinates = [house['latitude'], house['longitude']]
  truePrice = Y_testDataset.iloc[i, 0]
  predictedPrice = int(predictionDataFrame.iloc[i,0])

  error = math.fabs((predictedPrice - truePrice)/truePrice)
  if error <= 0.05:
    color = 'blue'
  elif error <= 0.1:
    color = 'green'
  elif error <= 0.2:
    color = 'yellow'
  elif error <= 0.5:
    color = 'orange'
  else:
     color = 'red'



  housePopup = folium.Popup(f"Predicted Price: {predictedPrice}<br>Real Price: {int(truePrice)}", max_width=500)
  houseMarker = folium.CircleMarker(location=coordinates,
                                    popup= housePopup,
                                    fill_color = color,
                                    fill_opacity = 0.6,
                                    stroke = False,
                                     ).add_to(map)

housesLayer.add_to(map)

<folium.map.FeatureGroup at 0x7adb9013e5f0>

In [289]:
map