<a href="https://colab.research.google.com/github/GuiOSousa/HouseValueRegression/blob/main/HouseValue.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Preparando o Ambiente

In [290]:
!pip install pandas



In [291]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Tratando os Dados

##Acessando a base de dados

In [292]:
import pandas as pd

dataset = pd.read_csv('/content/drive/MyDrive/Datasets/HouseRentData/house_prices.csv')

dataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,INLAND,78100
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,INLAND,77100
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,INLAND,92300
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,INLAND,84700


##Tratando valores NaN e nulos

In [293]:
print(dataset.isna().sum())
print(dataset.isnull().sum())

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64


Como nossos valores NaN/nulos representam uma parcela mínima da nossa base de dados, podemos optar por simplesmente excuí-los, sem que haja impactos significantes no treinamento do modelo

In [294]:
dataset = dataset.dropna()
dataset = dataset.reset_index(drop=True)

##Decodificando valores de texto em valores numéricos

Modelos de linguagem geralmente requerem uma decodificação quando trabalhamos com Strings.


No bloco abaixo, substituimos a "distância do mar", atributo descrito em texto, por valores inteiros.

In [295]:
datasetEncoded = dataset[:]

datasetEncoded = datasetEncoded.replace("ISLAND", 0)
datasetEncoded = datasetEncoded.replace("NEAR BAY", 1)
datasetEncoded = datasetEncoded.replace("NEAR OCEAN", 2)
datasetEncoded = datasetEncoded.replace("<1H OCEAN", 3)
datasetEncoded = datasetEncoded.replace("INLAND", 4)

datasetEncoded

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,1,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,1,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,1,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,1,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,1,342200
...,...,...,...,...,...,...,...,...,...,...
20428,-121.09,39.48,25,1665,374.0,845,330,1.5603,4,78100
20429,-121.21,39.49,18,697,150.0,356,114,2.5568,4,77100
20430,-121.22,39.43,17,2254,485.0,1007,433,1.7000,4,92300
20431,-121.32,39.43,18,1860,409.0,741,349,1.8672,4,84700


##Atributos de posição

Embora Latitude e Longitude não sejam os melhores atributos para se analisar quando fatores socioeconômicos são mais importantes que a posição física do objeto em questão, materemos ambas pois não possuímos, na base de dados, outros dados referentes a localização.

No contexto dado, informações como estado, cidade/distrito, bairros, etc. trariam análises mais relevantes, portanto seria mais prudente remover as colunas de Latitude e Longitude.*texto em itálico*


#Separando Teste e Treinamento

In [296]:
trainingDataset = datasetEncoded.sample(frac=0.9)

trainingDataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
18993,-122.66,38.42,14,5315,1037.0,2228,950,4.0230,3,208400
5972,-117.72,34.06,32,2209,654.0,1718,569,1.9643,4,113200
10313,-117.62,33.42,23,2656,515.0,998,435,4.0294,2,500001
15257,-117.23,33.23,13,2899,657.0,1946,579,2.9875,3,172000
13733,-117.25,34.41,13,3682,668.0,1606,668,2.1875,4,119700
...,...,...,...,...,...,...,...,...,...,...
579,-122.08,37.72,32,2476,368.0,1048,367,5.6194,1,274700
10899,-117.76,33.79,4,8974,1268.0,3754,1241,8.2653,3,374000
17766,-121.94,37.34,42,2174,420.0,1304,464,3.1429,3,286500
4308,-118.36,34.11,35,3946,695.0,1361,620,6.5195,3,500001


In [297]:
X_trainingDataset = trainingDataset.drop("median_house_value", axis=1)
Y_trainingDataset = trainingDataset.drop(X_trainingDataset, axis=1)

In [298]:
testDataset = datasetEncoded.drop(trainingDataset.index)

testDataset

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,1,352100
8,-122.26,37.84,42,2555,665.0,1206,595,2.0804,1,226700
9,-122.25,37.84,52,3549,707.0,1551,714,3.6912,1,261100
18,-122.26,37.84,50,2239,455.0,990,419,1.9911,1,158700
47,-122.27,37.82,43,1007,312.0,558,253,1.7348,1,137500
...,...,...,...,...,...,...,...,...,...,...
20354,-121.77,38.67,42,2670,518.0,1548,534,2.2794,4,108900
20371,-121.81,38.84,37,352,65.0,238,67,2.8542,4,275000
20383,-121.59,39.14,41,1492,350.0,804,353,1.6840,4,71300
20402,-121.56,39.11,18,2171,480.0,1527,447,2.3011,4,57500


In [299]:
X_testDataset = testDataset.drop("median_house_value", axis=1)
Y_testDataset = testDataset.drop(X_testDataset, axis=1)

#Gradient Boosting Regressor
Utilizamos um Gradient Boosting Regressor genérico (sem alteração nos hiperparâmetros) para a tarefa de regressão.

O Gradient Boosting Regressor é um modelo de ensemble (combina modelos mais fracos em modelos mais fortes durante o treinamento) de Árvores de Decisão.

In [300]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

model = GradientBoostingRegressor()

model.fit(X_trainingDataset, Y_trainingDataset)

prediction = model.predict(X_testDataset)

  y = column_or_1d(y, warn=True)


In [301]:
predictionDataFrame = pd.DataFrame(prediction)

predictionDataFrame

Unnamed: 0,0
0,380794.827376
1,181707.299450
2,276819.406091
3,160616.541516
4,174864.527453
...,...
2038,89892.142521
2039,107084.803470
2040,82134.987817
2041,76075.793248


##Avaliando as predições

Métricas comuns como MAE e MSE são difíceis de interpretar pois nossos valores não estão normalizados, então MAPE, RSMLE e R² foram escolhidas.

Como nossos resultados são predições de preços de propriedades, que apresentam valores na casa das centenas de milhar, MAE e MSE apresentariam valores extremamente altos independente da qualidade do modelo, portanto, pouco confiáveis para dizer a qualidade das previsões.

In [302]:
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_log_error, r2_score
import numpy as np


mape = mean_absolute_percentage_error(Y_testDataset, predictionDataFrame)
msle = mean_squared_log_error(Y_testDataset, predictionDataFrame)
r2 = r2_score(Y_testDataset, predictionDataFrame)
rmsle = np.sqrt(msle)

print(f"MAPE: {mape:.2f}")
print(f"RMSLE: {rmsle:.2f}")
print(f"R2: {r2:.2f}")

MAPE: 0.23
RMSLE: 0.27
R2: 0.78


Analisando os valores de MAPE, RSMLE e R², podemos concluir que os resultados foram aceitáveis para um modelo genérico, considerando também que váriaveis de localização melhores levariam a predições mais precisas.

A pouca diferença entre MAPE (Mean Absolute Percentage Error) e RMSLE (Root Mean Squared Logarithmic Error) indica que não houve uma grande quantidade de erros críticos, em que os valores da predição e da verdade são completamente destoantes.

#Plotagem do Mapa

In [303]:
import folium
from folium.plugins import *
from google.colab import output

map = folium.Map(tiles='openstreetmap',location=[36.69245689448622, -119.7175285756223],zoom_start=7,zoom_control=True)

In [304]:
import math

housesLayer = folium.FeatureGroup(name="Houses", show=True)

for i in range(len(testDataset)-1):
  house = testDataset.iloc[i]
  coordinates = [house['latitude'], house['longitude']]
  truePrice = Y_testDataset.iloc[i, 0]
  predictedPrice = int(predictionDataFrame.iloc[i,0])

  error = math.fabs((predictedPrice - truePrice)/truePrice)
  if error <= 0.05:
    color = 'blue'
  elif error <= 0.1:
    color = 'green'
  elif error <= 0.2:
    color = 'yellow'
  elif error <= 0.5:
    color = 'orange'
  else:
     color = 'red'



  housePopup = folium.Popup(f"Predicted Price: {predictedPrice}<br>Real Price: {int(truePrice)}", max_width=500)
  houseMarker = folium.CircleMarker(location=coordinates,
                                    popup= housePopup,
                                    fill_color = color,
                                    fill_opacity = 0.6,
                                    stroke = False,
                                     ).add_to(map)

housesLayer.add_to(map)

<folium.map.FeatureGroup at 0x7adb8ef2bdc0>

In [305]:
map.save('map.html')
map