<h1><center><b>Estudo do Mercado Imobiliário de São Paulo</b></center></h1>

<h2>Introdução</h2>
<font size=3>São Paulo é o centro financeiro do Brasil e tem uma área de 1.521 km², com isso se torna a 10ª maior cidade do mundo. Com essas informações podemos inferir que teremos que teremos muitos imovéis com bastante diversificação para a nossa análise.</font>

<a name='INDICE'></a>
<h2>Índice</h2>
<ol>
    <li><a href='#PREPRO'>Pré-processamento</a></li>
    <ol>
        <li><a href='#BIBLIO'>Importando as bibliotecas</a></li>
        <li><a href='#DATASET'>Importando o dataset</a></li>
        <li><a href='#SPLIT'>Divindo entre <i>train</i> e <i>test</i></a></li>
        <li><a href='#FEAT'>Feature engineering</a></li>
    </ol>
    <li><a href='#MODEL'>Modelo</a></li>
    <ol>
        <li><a href='#RANDF'>Random Forrest</a></li>
    </ol>
</ol>

<a name='PREPRO'></a>
<h2>Pré-processamento</h2>
<font size=3>Nesta parte o dataset será <i>"limpo"</i> para ser utilizado no modelo.</font>
<a name='BIBLIO'></a>
<h3>Importando as bibliotecas</h3>

In [172]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error

warnings.filterwarnings("ignore")
pd.options.display.max_columns = 999

<a name='DATASET'></a>
<h3>Importando o dataset</h3>

In [137]:
df = pd.read_csv('./imoveis-sp.csv')
df.head()

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude,Subway Station,Dist2Subway
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.543138,-46.479486,Artur Alvim,0.621993
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.550239,-46.480718,Artur Alvim,1.179514
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.542818,-46.485665,Artur Alvim,0.301435
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.547171,-46.483014,Artur Alvim,0.786418
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim,rent,apartment,-23.525025,-46.482436,Artur Alvim,1.701374


<font size=3>Adicionando a variável target que será a soma do <i>preço</i> e <i>valor do condomínio</i>.</font>

In [138]:
df['y'] = df['Price'] + df['Condo']

<a name='SPLIT'></a>
<h3>Dividindo entre <i>train</i> e <i>test</i></h3>

In [139]:
X = df.drop(labels=['Price', 'Condo', 'y'], axis=1)
y = df['y']

In [140]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=75)
print('X_train:{} \ny_train:{} \nX_test:\t{} \ny_test:\t{}'.format(X_train.shape,y_train.shape,X_test.shape,y_test.shape))

X_train:(8931, 16) 
y_train:(8931,) 
X_test:	(3828, 16) 
y_test:	(3828,)


In [141]:
X_train.tail()

Unnamed: 0,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude,Subway Station,Dist2Subway
8128,47,2,2,1,1,1,0,1,0,São Mateus,sale,apartment,-23.594348,-46.46743,Vila Uniao,5.004922
5585,67,3,2,1,1,0,0,0,0,Campo Limpo,rent,apartment,-23.652268,-46.766401,Capao Redondo,0.781525
2067,57,2,2,1,1,1,0,0,0,Rio Pequeno,rent,apartment,-23.559315,-46.748244,Sao Paulo Morumbi,3.872227
8560,46,2,2,1,1,0,0,0,0,Cidade Ademar,sale,apartment,-23.674722,-46.653257,Jabaquara,3.381422
4344,92,3,2,1,1,0,0,0,0,Pinheiros,rent,apartment,-23.562718,-46.674759,Oscar Freire,0.353806


<a name='FEAT'></a>
<h3>Feature engineering</h3>
<font size=3>Verificando a quantidade de valores distintos em cada feature.</font>

In [142]:
for feature in X_train.columns:
    print('{}: {}'.format(feature, X_train[feature].nunique()))

Size: 314
Rooms: 8
Toilets: 8
Suites: 6
Parking: 10
Elevator: 2
Furnished: 2
Swimming Pool: 2
New: 2
District: 96
Negotiation Type: 2
Property Type: 1
Latitude: 6352
Longitude: 6395
Subway Station: 79
Dist2Subway: 6466


<font size=3>Como a feature <i>Property Type</i> só contém um valor distinto será retirada, e como <i>Subway Station</i> na maioria das vezes é igual ao campo <i>District</i> iremos retirar também.</font>

In [143]:
X_train.drop(['Property Type','Subway Station'], axis=1, inplace=True)
X_test.drop(['Property Type', 'Subway Station'], axis=1, inplace=True)

In [144]:
X_train.dtypes

Size                  int64
Rooms                 int64
Toilets               int64
Suites                int64
Parking               int64
Elevator              int64
Furnished             int64
Swimming Pool         int64
New                   int64
District             object
Negotiation Type     object
Latitude            float64
Longitude           float64
Dist2Subway         float64
dtype: object

<font size=3>Com o <i>OneHotEncoder</i> é possível transformar as variáveis categóricas para numéricas, primeiro vamos transformar as variáveis do conjunto de treino.</font>

In [157]:
cat_cols = ['Negotiation Type', 'District', 'Negotiation Type']
enc = OneHotEncoder(drop='first', sparse=False)
train_cat_feat = enc.fit_transform(X_train[cat_cols])
train_cat_feat = pd.DataFrame(train_cat_feat)
train_cat_feat.columns = enc.get_feature_names(cat_cols)
train_cat_feat.index = X_train.index
train_num_feat = X_train.drop(cat_cols, axis=1)
X_train_feat = pd.merge(train_cat_feat, train_num_feat, left_index=True, right_index=True)

<font size=3>Realizando o mesmo procedimento para os dados de teste.</font>

In [158]:
test_cat_feat = enc.transform(X_test[cat_cols])
test_cat_feat = pd.DataFrame(test_cat_feat)
test_cat_feat.columns = enc.get_feature_names(cat_cols)
test_cat_feat.index = X_test.index
test_num_feat = X_test.drop(cat_cols, axis=1)
X_test_feat = pd.merge(test_cat_feat, test_num_feat, left_index=True, right_index=True)

<a href='#INDICE'>Voltar para o índice</a>
<a name='MODEL'></a>
<h2>Modelo</h2>
<a name='RANDF'></a>
<h3>Random Forrest</h3>

In [177]:
rf = RandomForestRegressor(n_estimators=300, max_depth=300, min_samples_leaf=2, random_state=75)
rf.fit(X_train_feat, y_train)
y_train_pred = rf.predict(X_train_feat)

In [191]:
def metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred)**(1/2)
    msle = mean_squared_log_error(y_true, y_pred)
    print('Mean Absolute Error: \t {}\nRoot Mean Squared Error: {}\nMean Squared Log Error:  {}'
          .format(mae,rmse,msle))

In [192]:
metrics(y_train, y_train_pred)

Mean Absolute Error: 	 21791.1878981434
Root Mean Squared Error: 74125.02334204428
Mean Squared Log Error:  0.012674304165928886
