In [346]:
import pandas as pd
import numpy as np
import re
from ast import literal_eval

# Análise Exploratória de Dados Imobiliários

## 01. Conjunto de Dados : Contexto


Escrever uma breve descrição sobre os dados e algumas hipóteses

In [347]:
pd.set_option('display.max_columns', 30)

In [348]:
df = pd.read_csv('dados_imoveis_sp.csv')
df.head()

Unnamed: 0,amenities,usableAreas,id,parkingSpaces,address,suites,bathrooms,totalAreas,bedrooms,pricingInfos
0,"['PETS_ALLOWED', 'ELEVATOR', 'GARDEN', 'ELECTR...",['101'],2574084550,[1],"{'country': 'BR', 'zipCode': '04734003', 'geoJ...",[],[2],['111'],[2],"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
1,"['POOL', 'FURNISHED', 'BARBECUE_GRILL', 'ELEVA...",['140'],2583748663,[2],"{'country': 'BR', 'zipCode': '01307000', 'geoJ...",[2],[4],[],[2],"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
2,"['POOL', 'FURNISHED', 'BARBECUE_GRILL', 'ELEVA...",['50'],2562971980,[1],"{'country': 'BR', 'zipCode': '01209010', 'geoJ...",[0],[1],['50'],[2],"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
3,"['POOL', 'BARBECUE_GRILL', 'GATED_COMMUNITY', ...",['58'],2580478200,[1],"{'country': 'BR', 'zipCode': '01127000', 'geoJ...",[],[1],[],[2],"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
4,"['PETS_ALLOWED', 'GATED_COMMUNITY', 'ELECTRONI...",['64'],2583729583,[1],"{'country': 'BR', 'zipCode': '05435001', 'geoJ...",[],[1],['80'],[2],"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."


## 02. Limpeza e Tratamento de dados

Os tipos de dados em todas as colunas deste dataset estão no formato **string**. Os dados numéricos precisam de uma **conversão de tipo**.

In [349]:
cols = ['usableAreas','parkingSpaces','suites','bathrooms','totalAreas','bedrooms']

for var in cols:
    s_extracted_digits = df[var].str.extract('\[[\']{0,1}(\d*)[\']{0,1}\]').squeeze()
    df[var] = s_extracted_digits.apply(lambda x: int(x) if x.isdigit() else np.nan)
    
df.head()

Unnamed: 0,amenities,usableAreas,id,parkingSpaces,address,suites,bathrooms,totalAreas,bedrooms,pricingInfos
0,"['PETS_ALLOWED', 'ELEVATOR', 'GARDEN', 'ELECTR...",101,2574084550,1.0,"{'country': 'BR', 'zipCode': '04734003', 'geoJ...",,2,111.0,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
1,"['POOL', 'FURNISHED', 'BARBECUE_GRILL', 'ELEVA...",140,2583748663,2.0,"{'country': 'BR', 'zipCode': '01307000', 'geoJ...",2.0,4,,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
2,"['POOL', 'FURNISHED', 'BARBECUE_GRILL', 'ELEVA...",50,2562971980,1.0,"{'country': 'BR', 'zipCode': '01209010', 'geoJ...",0.0,1,50.0,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
3,"['POOL', 'BARBECUE_GRILL', 'GATED_COMMUNITY', ...",58,2580478200,1.0,"{'country': 'BR', 'zipCode': '01127000', 'geoJ...",,1,,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."
4,"['PETS_ALLOWED', 'GATED_COMMUNITY', 'ELECTRONI...",64,2583729583,1.0,"{'country': 'BR', 'zipCode': '05435001', 'geoJ...",,1,80.0,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant..."


Diversas colunas também possuem bastante informações que nos ajudaram a melhor caracterizar cada imóvel.
No entanto, essas informações estão em formato bruto, "raw data", e precisam de um tratamento.

A coluna *'address'* possui características que lembram um dicionário. Mas seu tipo é string. 
Precisamos tratar essas linhas para navegar e extrair seus atributos com facilidade.

In [350]:
print(df['address'][0],'\n')
print('Tipo de dado:',type(df['address'][0]))

{'country': 'BR', 'zipCode': '04734003', 'geoJson': '', 'city': 'São Paulo', 'streetNumber': '1850', 'level': 'STREET', 'precision': 'ROOFTOP', 'confidence': 'VALID_STREET', 'stateAcronym': 'SP', 'source': 'CORREIOS', 'point': {'lon': -46.695829, 'source': 'GOOGLE', 'lat': -23.638282}, 'ibgeCityId': '', 'zone': 'Zona Sul', 'street': 'Avenida Adolfo Pinheiro', 'locationId': 'BR>Sao Paulo>NULL>Sao Paulo>Zona Sul>Santo Amaro', 'district': '', 'name': '', 'state': 'São Paulo', 'neighborhood': 'Santo Amaro', 'poisList': ['BS:Graham Bell C/B', 'BS:Graham Bell B/C', 'BS:Rua Verbo Divino, 61', 'BS:Américo Brasiliense C/B', 'BS:Parada Marechal Deodoro 2 - B/C', 'TS:Graham Bell C/B', 'TS:Graham Bell B/C', 'TS:Rua Verbo Divino, 61', 'TS:Américo Brasiliense C/B', 'TS:Parada Marechal Deodoro 2 - B/C', 'CS:7 Molinos', 'CS:Casa de Pães Neblina Paulista', 'CS:Casa de Bolo', 'CS:Berna', 'CS:Gêmel', 'VP:Kennel Club'], 'pois': [], 'valuableZones': [{'city': 'São Paulo', 'zone': 'Zona Sul', 'name': 'Cháca

O comando **literal_eval** é um interessante comando da biblioteca **ast – Abstract Syntax Trees**. 

Ele avalia uma string contendo uma expressão Python e a executa.

In [351]:
print(literal_eval(df['address'][0]),'\n')
print('Tipo de dado:',type(literal_eval(df['address'][0])))

{'country': 'BR', 'zipCode': '04734003', 'geoJson': '', 'city': 'São Paulo', 'streetNumber': '1850', 'level': 'STREET', 'precision': 'ROOFTOP', 'confidence': 'VALID_STREET', 'stateAcronym': 'SP', 'source': 'CORREIOS', 'point': {'lon': -46.695829, 'source': 'GOOGLE', 'lat': -23.638282}, 'ibgeCityId': '', 'zone': 'Zona Sul', 'street': 'Avenida Adolfo Pinheiro', 'locationId': 'BR>Sao Paulo>NULL>Sao Paulo>Zona Sul>Santo Amaro', 'district': '', 'name': '', 'state': 'São Paulo', 'neighborhood': 'Santo Amaro', 'poisList': ['BS:Graham Bell C/B', 'BS:Graham Bell B/C', 'BS:Rua Verbo Divino, 61', 'BS:Américo Brasiliense C/B', 'BS:Parada Marechal Deodoro 2 - B/C', 'TS:Graham Bell C/B', 'TS:Graham Bell B/C', 'TS:Rua Verbo Divino, 61', 'TS:Américo Brasiliense C/B', 'TS:Parada Marechal Deodoro 2 - B/C', 'CS:7 Molinos', 'CS:Casa de Pães Neblina Paulista', 'CS:Casa de Bolo', 'CS:Berna', 'CS:Gêmel', 'VP:Kennel Club'], 'pois': [], 'valuableZones': [{'city': 'São Paulo', 'zone': 'Zona Sul', 'name': 'Cháca

A coluna *'pricingInfos'* pode possuir dois dicionários. Um contendo preço de aluguel, e outro com preço de compra.
Estamos interessados em pegar apenas o preço de aluguel.

In [352]:
literal_eval(df['pricingInfos'][91])

[{'rentalInfo': {'period': 'MONTHLY',
   'warranties': ['INSURANCE_GUARANTEE', 'GUARANTOR']},
  'yearlyIptu': '120',
  'price': '410000',
  'businessType': 'SALE',
  'monthlyCondoFee': '700'},
 {'rentalInfo': {'period': 'MONTHLY',
   'warranties': ['INSURANCE_GUARANTEE', 'GUARANTOR'],
   'monthlyRentalTotalPrice': '2400'},
  'yearlyIptu': '120',
  'price': '1700',
  'businessType': 'RENTAL',
  'monthlyCondoFee': '700'}]

A coluna *'amenities'* apresenta as facilidades que cada imóvel pode oferecer.
- Como podemos contar a frequência para nosso conjunto dados?

In [353]:
df['amenities'][:5]

0    ['PETS_ALLOWED', 'ELEVATOR', 'GARDEN', 'ELECTR...
1    ['POOL', 'FURNISHED', 'BARBECUE_GRILL', 'ELEVA...
2    ['POOL', 'FURNISHED', 'BARBECUE_GRILL', 'ELEVA...
3    ['POOL', 'BARBECUE_GRILL', 'GATED_COMMUNITY', ...
4    ['PETS_ALLOWED', 'GATED_COMMUNITY', 'ELECTRONI...
Name: amenities, dtype: object

In [354]:
def extract_neighborhood(address):
    address = literal_eval(address)
    neighborhood = address['neighborhood']
    
    return neighborhood

In [355]:
def extract_zone(address):
    address = literal_eval(address)
    zone = address['zone']
    
    return zone

In [356]:
def extract_zipcode(address):
    address = literal_eval(address)
    zipCode = address['zipCode']
    
    return zipCode

In [357]:
def get_rental_price(pricingInfos):
    price = [info['price'] for info\
     in literal_eval(pricingInfos)\
     if info['businessType'] == 'RENTAL'][0]
    
    return float(price)

In [358]:
def strings_para_lista(string):
    lista = string.replace(' ','').replace("'","")
    lista = lista.replace('[','').replace(']','').split(',')
    
    return lista

In [359]:
df['amenities'] = df['amenities'].apply(strings_para_lista)

In [360]:
print('Top 10 Facilidades dos Imóveis de São Paulo:')
print(df['amenities'].explode().value_counts()[:10])

Top 10 Facilidades dos Imóveis de São Paulo:
ELEVATOR           5129
POOL               4613
PARTY_HALL         4110
BARBECUE_GRILL     3961
SERVICE_AREA       3729
GYM                3687
PLAYGROUND         3357
GARDEN             3203
INTERCOM           2899
GATED_COMMUNITY    2854
Name: amenities, dtype: int64


In [361]:
top10_amenities = list(df['amenities'].explode().value_counts()[:10].index)

In [362]:
def has_amenity(amenities,amenity):
    if amenity in amenities:
        return 1
    else:
        return 0

In [363]:
# Cria uma coluna para cada uma das 10 principais facilidades
# com valor binário, representando ausência ou ocorrência da mesma
for amenity in top10_amenities:
    df[amenity.lower()] = df['amenities'].apply(has_amenity,amenity=amenity)

In [364]:
df['zipCode'] = df['address'].apply(extract_zipcode)
df['zone'] = df['address'].apply(extract_zone)
df['neighborhood'] = df['address'].apply(extract_neighborhood)

In [365]:
df.head()

Unnamed: 0,amenities,usableAreas,id,parkingSpaces,address,suites,bathrooms,totalAreas,bedrooms,pricingInfos,elevator,pool,party_hall,barbecue_grill,service_area,gym,playground,garden,intercom,gated_community,zipCode,zone,neighborhood
0,"[PETS_ALLOWED, ELEVATOR, GARDEN, ELECTRONIC_GA...",101,2574084550,1.0,"{'country': 'BR', 'zipCode': '04734003', 'geoJ...",,2,111.0,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant...",1,0,0,0,0,0,0,1,0,0,4734003,Zona Sul,Santo Amaro
1,"[POOL, FURNISHED, BARBECUE_GRILL, ELEVATOR, GY...",140,2583748663,2.0,"{'country': 'BR', 'zipCode': '01307000', 'geoJ...",2.0,4,,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant...",1,1,1,1,0,1,1,1,0,0,1307000,Centro,Consolação
2,"[POOL, FURNISHED, BARBECUE_GRILL, ELEVATOR, GA...",50,2562971980,1.0,"{'country': 'BR', 'zipCode': '01209010', 'geoJ...",0.0,1,50.0,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant...",1,1,1,1,0,1,1,1,0,1,1209010,Centro,Santa Efigênia
3,"[POOL, BARBECUE_GRILL, GATED_COMMUNITY, GYM, G...",58,2580478200,1.0,"{'country': 'BR', 'zipCode': '01127000', 'geoJ...",,1,,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant...",0,1,1,1,0,1,1,1,0,1,1127000,Centro,Bom Retiro
4,"[PETS_ALLOWED, GATED_COMMUNITY, ELECTRONIC_GAT...",64,2583729583,1.0,"{'country': 'BR', 'zipCode': '05435001', 'geoJ...",,1,80.0,2,"[{'rentalInfo': {'period': 'MONTHLY', 'warrant...",1,0,0,0,0,0,0,0,0,1,5435001,Zona Oeste,Sumarezinho


In [366]:
df['rental_price'] = df['pricingInfos'].apply(get_rental_price)

In [367]:
df.drop(['amenities','address','pricingInfos'], axis=1,inplace=True)

In [368]:
df.head()

Unnamed: 0,usableAreas,id,parkingSpaces,suites,bathrooms,totalAreas,bedrooms,elevator,pool,party_hall,barbecue_grill,service_area,gym,playground,garden,intercom,gated_community,zipCode,zone,neighborhood,rental_price
0,101,2574084550,1.0,,2,111.0,2,1,0,0,0,0,0,0,1,0,0,4734003,Zona Sul,Santo Amaro,2300.0
1,140,2583748663,2.0,2.0,4,,2,1,1,1,1,0,1,1,1,0,0,1307000,Centro,Consolação,9500.0
2,50,2562971980,1.0,0.0,1,50.0,2,1,1,1,1,0,1,1,1,0,1,1209010,Centro,Santa Efigênia,3000.0
3,58,2580478200,1.0,,1,,2,0,1,1,1,0,1,1,1,0,1,1127000,Centro,Bom Retiro,1900.0
4,64,2583729583,1.0,,1,80.0,2,1,0,0,0,0,0,0,0,0,1,5435001,Zona Oeste,Sumarezinho,2400.0


Existem 'id' repetidos. Isso pode indicar repetição de padrões na base. De fato, existem no máximo 72 repetições na base

In [369]:
df['id'].unique().shape

(9928,)

Mas a função duplicated mostra que não existe 71 linhas com informações duplicadas, mas analisando as duplicatas na coluna 'id' vemos que tem 72 repetições. Analisando mais a fundo podemos ver que existe um mesmo imóvel como 'id'= 2583627481 que tem diferentes valore de aluguel e por isso a função dupllicated para o dataframe e para a coluna 'id' tiveram tamanhos distintos. Escolheu-se retirar todos os 'id' repetidos.

In [370]:
df_duplicated_rows = df[df.duplicated()]
df_duplicated_rows.shape

(71, 21)

In [371]:
df_duplicated_id = df[df['id'].duplicated()]
df_duplicated_id.shape

(72, 21)

In [372]:
get_idx = 0
for i in df_duplicated_id.index:
    if i not in df_duplicated_rows.index:
        get_idx = i
df[df['id']==df_duplicated_id.loc[get_idx,:]['id']]

Unnamed: 0,usableAreas,id,parkingSpaces,suites,bathrooms,totalAreas,bedrooms,elevator,pool,party_hall,barbecue_grill,service_area,gym,playground,garden,intercom,gated_community,zipCode,zone,neighborhood,rental_price
6707,92,2583627481,2.0,1.0,3,92.0,3,1,1,0,1,1,0,1,1,0,1,1153000,Centro,Barra Funda,3000.0
7099,92,2583627481,2.0,1.0,3,92.0,3,1,1,0,1,1,0,1,1,0,1,1153000,Centro,Barra Funda,2900.0


In [373]:
df = df.drop(df_duplicated_id.index)
df.drop(['id'], axis=1,inplace=True)
df.head()

Unnamed: 0,usableAreas,parkingSpaces,suites,bathrooms,totalAreas,bedrooms,elevator,pool,party_hall,barbecue_grill,service_area,gym,playground,garden,intercom,gated_community,zipCode,zone,neighborhood,rental_price
0,101,1.0,,2,111.0,2,1,0,0,0,0,0,0,1,0,0,4734003,Zona Sul,Santo Amaro,2300.0
1,140,2.0,2.0,4,,2,1,1,1,1,0,1,1,1,0,0,1307000,Centro,Consolação,9500.0
2,50,1.0,0.0,1,50.0,2,1,1,1,1,0,1,1,1,0,1,1209010,Centro,Santa Efigênia,3000.0
3,58,1.0,,1,,2,0,1,1,1,0,1,1,1,0,1,1127000,Centro,Bom Retiro,1900.0
4,64,1.0,,1,80.0,2,1,0,0,0,0,0,0,0,0,1,5435001,Zona Oeste,Sumarezinho,2400.0


# Tratamento de Outliers

In [374]:
df.describe()

Unnamed: 0,usableAreas,parkingSpaces,suites,bathrooms,totalAreas,bedrooms,elevator,pool,party_hall,barbecue_grill,service_area,gym,playground,garden,intercom,gated_community,rental_price
count,9928.0,9534.0,8427.0,9928.0,7999.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0
mean,83.856366,1.32232,0.856889,1.930399,99.252407,2.083199,0.513699,0.461523,0.411765,0.396555,0.372683,0.369359,0.335919,0.320608,0.29029,0.285556,4484.447623
std,71.716484,1.114758,0.982001,1.190308,170.231681,0.839207,0.499837,0.498542,0.492178,0.489207,0.483543,0.482656,0.472334,0.466734,0.453919,0.451702,10535.670046
min,10.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,500.0
25%,49.0,1.0,0.0,1.0,50.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1800.0
50%,65.0,1.0,1.0,2.0,70.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2760.0
75%,96.0,2.0,1.0,2.0,103.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4500.0
max,3300.0,45.0,20.0,10.0,6000.0,7.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,710000.0


Identificando outliers e substituindo por um NaN

In [375]:
cols = ['usableAreas','parkingSpaces','suites','bathrooms','totalAreas','bedrooms', 'rental_price']
for col in cols:
    median = df[col].median()
    Q1 = df[col].quantile(q=0.25)
    Q3 = df[col].quantile(q=0.75)
    IQ = Q3-Q1
    lim_sup = Q3+1.5*IQ
    lim_inf = Q1-1.5*IQ

    df[col]=np.where(((df[col]<lim_inf)|(df[col]>lim_sup)),np.nan,df[col])

In [376]:
df.describe()

Unnamed: 0,usableAreas,parkingSpaces,suites,bathrooms,totalAreas,bedrooms,elevator,pool,party_hall,barbecue_grill,service_area,gym,playground,garden,intercom,gated_community,rental_price
count,9154.0,9202.0,7658.0,8816.0,7343.0,9927.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9928.0,9008.0
mean,69.885842,1.200174,0.618699,1.590404,74.268283,2.082704,0.513699,0.461523,0.411765,0.396555,0.372683,0.369359,0.335919,0.320608,0.29029,0.285556,3009.685613
std,31.357299,0.757478,0.61971,0.701505,35.442299,0.837797,0.499837,0.498542,0.492178,0.489207,0.483543,0.482656,0.472334,0.466734,0.453919,0.451702,1677.39035
min,10.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,500.0
25%,47.0,1.0,0.0,1.0,49.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1750.0
50%,63.0,1.0,1.0,1.0,65.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2500.0
75%,85.0,2.0,1.0,2.0,91.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3800.0
max,166.0,3.0,2.0,3.0,182.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8540.0


# Tratamento de valores faltantes

Pelas informações do dataframe, existem 3 colunas com dados faltantes que precisam ser tratados.

In [377]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9928 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   usableAreas      9154 non-null   float64
 1   parkingSpaces    9202 non-null   float64
 2   suites           7658 non-null   float64
 3   bathrooms        8816 non-null   float64
 4   totalAreas       7343 non-null   float64
 5   bedrooms         9927 non-null   float64
 6   elevator         9928 non-null   int64  
 7   pool             9928 non-null   int64  
 8   party_hall       9928 non-null   int64  
 9   barbecue_grill   9928 non-null   int64  
 10  service_area     9928 non-null   int64  
 11  gym              9928 non-null   int64  
 12  playground       9928 non-null   int64  
 13  garden           9928 non-null   int64  
 14  intercom         9928 non-null   int64  
 15  gated_community  9928 non-null   int64  
 16  zipCode          9928 non-null   object 
 17  zone          

In [378]:
(df.isnull().sum() / df.shape[0])*100

usableAreas         7.796132
parkingSpaces       7.312651
suites             22.864625
bathrooms          11.200645
totalAreas         26.037470
bedrooms            0.010073
elevator            0.000000
pool                0.000000
party_hall          0.000000
barbecue_grill      0.000000
service_area        0.000000
gym                 0.000000
playground          0.000000
garden              0.000000
intercom            0.000000
gated_community     0.000000
zipCode             0.000000
zone                0.000000
neighborhood        0.000000
rental_price        9.266720
dtype: float64

Separando as colunas que contem dados faltantes em um dataframe

In [379]:
import statistics

columns_with_nan = ['usableAreas','parkingSpaces','suites','bathrooms','totalAreas','bedrooms','rental_price']
replace_by = ['median','mode','mode','mode','median','mode','median']

for col,rpl_by in zip(columns_with_nan,replace_by):
    replacement_function = {
        'median':statistics.median,
        'mode':statistics.mode,
        'mean':statistics.mean
    }
    df[col].replace(np.NaN, replacement_function[rpl_by](df[col]),inplace=True)


Existem bairros estão em duas zonas distintas a depender do valor. Faz sentido? 

In [380]:
for neighborhood in df['neighborhood'].unique():
    unique_zone_name = df[df['neighborhood'] == neighborhood]['zone'].unique()
    if len(unique_zone_name)>1:
        print(neighborhood,':',unique_zone_name)
    
    

Consolação : ['Centro' 'Zona Oeste']
Itaim Bibi : ['Zona Oeste' 'Zona Sul']
Jardim Paulista : ['Zona Oeste' 'Centro' 'Zona Sul']
Bela Vista : ['Centro' 'Zona Oeste']
Cerqueira César : ['Zona Oeste' 'Centro' 'Zona Sul']
Perdizes : ['Zona Oeste' 'Centro']
Vila Mariana : ['Zona Sul' 'Centro']
Brooklin Paulista : ['Zona Oeste' 'Zona Sul' 'Zona Leste']
Aclimação : ['Centro' 'Zona Sul']
Cambuci : ['Centro' 'Zona Sul']
Jardim Londrina : ['Zona Sul' 'Zona Oeste']
Vila Pirajussara : ['Zona Sul' 'Zona Oeste']
Vila do Encontro : ['Zona Sul' '']
Vila Deodoro : ['Centro' 'Zona Sul']
Jardim Celeste : ['Zona Sul' 'Zona Oeste']
Vila Suzana : ['Zona Sul' 'Zona Oeste']
Barra Funda : ['Centro' 'Zona Oeste']
Jardim Íris : ['Zona Norte' 'Zona Oeste']
Vila Nova Conceição : ['Zona Oeste' 'Zona Sul']
Paraíso : ['Zona Sul' 'Centro']
Mooca : ['Zona Leste' 'Centro']
Chácara Inglesa : ['Zona Sul' 'Zona Norte']
Jardim da Glória : ['Centro' 'Zona Sul']
Jardim Vazani : ['Zona Oeste' 'Zona Sul']
Jardim das Acácias : 

Valores faltantes em zone

In [381]:
instances_without_zone = df[df['zone']=='']

In [382]:
for nghb in instances_without_zone['neighborhood'].unique():
    true_zone = df[(df['neighborhood']==nghb) & (df['zone']!='')]['zone'].to_list()[0]
    df['zone']=np.where((df['neighborhood']==nghb) & (df['zone']==''),true_zone,df['zone'])

Atribuindo colunas aos seus respectivos tipos

In [383]:
df = df.astype({'parkingSpaces': 'int32','suites': 'int32','bathrooms': 'int32','bedrooms': 'int32','zone':'category','neighborhood':'category'})
df.head()

Unnamed: 0,usableAreas,parkingSpaces,suites,bathrooms,totalAreas,bedrooms,elevator,pool,party_hall,barbecue_grill,service_area,gym,playground,garden,intercom,gated_community,zipCode,zone,neighborhood,rental_price
0,101.0,1,1,2,111.0,2,1,0,0,0,0,0,0,1,0,0,4734003,Zona Sul,Santo Amaro,2300.0
1,140.0,2,2,1,39.0,2,1,1,1,1,0,1,1,1,0,0,1307000,Centro,Consolação,2575.0
2,50.0,1,0,1,50.0,2,1,1,1,1,0,1,1,1,0,1,1209010,Centro,Santa Efigênia,3000.0
3,58.0,1,1,1,39.0,2,0,1,1,1,0,1,1,1,0,1,1127000,Centro,Bom Retiro,1900.0
4,64.0,1,1,1,80.0,2,1,0,0,0,0,0,0,0,0,1,5435001,Zona Oeste,Sumarezinho,2400.0


In [384]:
df.dtypes

usableAreas         float64
parkingSpaces         int32
suites                int32
bathrooms             int32
totalAreas          float64
bedrooms              int32
elevator              int64
pool                  int64
party_hall            int64
barbecue_grill        int64
service_area          int64
gym                   int64
playground            int64
garden                int64
intercom              int64
gated_community       int64
zipCode              object
zone               category
neighborhood       category
rental_price        float64
dtype: object

Obtendo as porcentagens de dados faltantes com relação a quantidade total em cada coluna

## Análise dos dados

### Análise univariada

In [395]:
df['neighborhood'].value_counts().sort_values(ascending=True)

Fazenda da Juta                        1
Jardim Satélite                        1
Jardim Brasília (Zona Norte)           1
Jardim Santa Terezinha (Pedreira)      1
Jardim Santa Cruz (Campo Grande)       1
                                    ... 
Jardim Paulista                      281
Indianópolis                         293
Bela Vista                           317
Pinheiros                            362
Vila Mariana                         404
Name: neighborhood, Length: 600, dtype: int64

In [396]:
df['zone'].value_counts().sort_values(ascending=True)

Zona Norte     647
Zona Leste    1171
Centro        1495
Zona Oeste    2994
Zona Sul      3621
Name: zone, dtype: int64

### Análise bivariada

In [387]:
df.pivot_table(index='neighborhood', values='rental_price', aggfunc='mean').sort_values(by='rental_price')

Unnamed: 0_level_0,rental_price
neighborhood,Unnamed: 1_level_1
Jardim dos Francos,500.000000
Vila Popular,600.000000
Vila Nova Curuçá,700.000000
Jardim Jaraguá,700.000000
Cidade Tiradentes,727.500000
...,...
Jardim Guarapiranga,5000.000000
Paineiras do Morumbi,5000.000000
Jurubatuba,5033.333333
Granja Julieta,5750.000000


In [391]:
df.pivot_table(index='zone', values='rental_price', aggfunc='mean').sort_values(by='rental_price')

Unnamed: 0_level_0,rental_price
zone,Unnamed: 1_level_1
Zona Norte,1933.81762
Zona Leste,2082.16994
Centro,2553.708361
Zona Sul,3021.919635
Zona Oeste,3684.262525


### Análise de correlação