In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# São Paulo House Prices

House prices are influenced by a variety of factors including location, size, age, condition, number of bedrooms and bathrooms, amenities, and neighborhood. The location of the house is particularly important, as it determines access to transportation, schools, shopping centers, and other amenities. Size and number of bedrooms and bathrooms also play a significant role in determining the price, as larger homes with more rooms are generally more expensive.

In this notebook, we will perform an Exploratory Data Analysis (EDA) of the [Kaggle's São Paulo House Price dataset](https://www.kaggle.com/datasets/ex0ticone/house-prices-of-sao-paulo-city) containing information on various features of houses locate in Brazik's São Paulo state and their sale prices. Our goal is to get a better understanding of the dataset and identify any patterns or correlations between the features and the target variable (the house price). We will use various visualization techniques and statistical methods to explore the data and gain insights.

Once we have a good understanding of the data, we will move on to training a regression machine learning model on the dataset, and evaluate its performance using various metrics.

## The Dataset

### Dataset description

This is a dataset of real estate properties in São Paulo, Brazil. It includes information such as the address, type of property, number of bedrooms, bathrooms, and parking spots, sale or rental price, among other features. Some examples of the features are:

|**Feature name**|**Description**|
|---|---|
|**logradouro (street):**| the name of the street where the property is located.|
|**numero (number):**| the number of the building where the property is located.|
|**bairro (neighborhood):**| the name of the neighborhood where the property is located.|
|**cep (zip code):**| the zip code of the property.|
|**cidade (city):**| the city where the property is located.|
|**tipo_imovel (property type):**| the type of property (house, apartment, flat, etc.).|
|**area_util (usable area):**| the usable area of the property in square meters.|
|**banheiros (bathrooms):**| the number of bathrooms in the property.|
|**suites (suites):**| the number of suites in the property.|
|**quartos (bedrooms):**| the number of bedrooms in the property.|
|**vagas_garagem (parking spots):**| the number of parking spots in the property.|
|**anuncio_criado (created ad):**| the date when the ad was created.|
|**tipo_anuncio (ad type):**| the type of ad (sale or rental).|
|**preco_venda (sale price):**| the price of the property if it is for sale.|
|**taxa_condominio (condominium fee):**| the monthly fee charged by the condominium, if the property is a condominium.|
|**periodicidade (periodicity):**| the frequency with which the rental price is charged (monthly, weekly, daily, etc.).|
|**preco_aluguel (rental price):**| the rental price of the property.|
|**iptu_ano (annual IPTU tax):**| the annual tax charged by the municipality.|

### Dataset Loading

This dataset comes in **.csv** format and is available through Kaggle's command line utility. Follow the [instructions to install and configure it](https://www.kaggle.com/docs/api). After that we download the dataset. We'll make sure to downdload it only if needed (the file isn't downloaded already).

In [2]:
dataset_root_dir = os.path.join(os.path.dirname(os.path.abspath("")), "data", "raw")
if not os.path.isdir(dataset_root_dir):
    os.makedirs(dataset_root_dir)
    print("Data folder created!")

if "housing_sp_city.csv" not in os.listdir(dataset_root_dir):
    
    kaggle_cmd = cmd = f"kaggle datasets download -d ex0ticone/house-prices-of-sao-paulo-city -p {dataset_root_dir}"
    os.system(kaggle_cmd)
    
    zip_file = os.path.join(dataset_root_dir, "house-prices-of-sao-paulo-city.zip")
    unzip_cmd = f"unzip {zip_file} -d {dataset_root_dir}"
    os.system(unzip_cmd)
    
    rm_cmd = f"rm {zip_file}"
    os.system(rm_cmd)
    
    print("Dataset downloaded")
else:
    print("Dataset already exists!")

Data folder created!
Downloading house-prices-of-sao-paulo-city.zip to /mnt/3273eabb-9e14-47b4-8ddd-ddb77dddcd30/workspace/HousePricePredictionApp/data/raw


 83%|████████▎ | 3.00M/3.63M [00:00<00:00, 5.95MB/s]


Archive:  /mnt/3273eabb-9e14-47b4-8ddd-ddb77dddcd30/workspace/HousePricePredictionApp/data/raw/house-prices-of-sao-paulo-city.zip
  inflating: /mnt/3273eabb-9e14-47b4-8ddd-ddb77dddcd30/workspace/HousePricePredictionApp/data/raw/housing_sp_city.csv  
Dataset downloaded


100%|██████████| 3.63M/3.63M [00:00<00:00, 5.80MB/s]


Now we can proceed to load the dataset using [Pandas](https://pandas.pydata.org/), but there's is a little trick here. If we try to use `read_csv` without any additional arguments we are presented with the following decode error:
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 166141: invalid continuation byte
```
This is caused by the fact that the file was created using some special accent characters. So we need to explicitly supply the file encoding and instruct Pandas to ignore any possible encoding errors.

In [4]:
dataset_file = os.path.join(dataset_root_dir, "housing_sp_city.csv")
sp_house_prices = pd.read_csv(dataset_file, encoding="utf-8", encoding_errors="ignore")
sp_house_prices

Unnamed: 0,logradouro,numero,bairro,cep,cidade,tipo_imovel,area_util,banheiros,suites,quartos,vagas_garagem,anuncio_criado,tipo_anuncio,preco_venda,taxa_condominio,periodicidade,preco_aluguel,iptu_ano
0,Rua Juvenal Galeno,53,Jardim da Saúde,4290030.0,São Paulo,Casa de dois andares,388.0,3.0,1.0,4.0,6.0,2017-02-07,Venda,700000,,,,
1,Rua Juruaba,16,Vila Santa Teresa (Zona Sul),4187320.0,São Paulo,Casa,129.0,2.0,1.0,3.0,2.0,2016-03-21,Venda,336000,,,,
2,Avenida Paulista,402,Bela Vista,1311000.0,São Paulo,Comercial,396.0,4.0,0.0,0.0,5.0,2018-12-18,Locação,24929,4900.0,MONTHLY,29829.0,4040.0
3,Rua Alvorada,1190,Vila Olímpia,4550004.0,São Paulo,Apartamento,80.0,2.0,1.0,3.0,2.0,2018-10-26,Venda,739643,686.0,,,1610.0
4,Rua Curitiba,380,Paraíso,4005030.0,São Paulo,Apartamento,3322.0,5.0,4.0,4.0,5.0,2018-12-14,Venda,7520099,6230.0,,,18900.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133959,Rua Glicério,255,Liberdade,1514000.0,São Paulo,Apartamento,53.0,2.0,1.0,2.0,1.0,2018-11-28,Venda,249782,210.0,,,0.0
133960,Rua Laboriosa,,Jardim das Bandeiras,5434060.0,São Paulo,Escritório,450.0,3.0,1.0,3.0,4.0,2018-08-08,Venda,1085000,,,,507.0
133961,Rua José Pereira de Carvalho,10,Vila Lageado,5337090.0,São Paulo,Apartamento,20.0,3.0,2.0,3.0,2.0,2019-02-06,Venda,623000,,,,
133962,Rua Evangelista Rodrigues,234,Alto de Pinheiros,5463000.0,São Paulo,Casa de dois andares,357.0,4.0,1.0,4.0,4.0,2018-04-14,Venda,1820000,0.0,,,665.0
