# Data Science Challange - ZAP

Author: [Douglas Trajano](https://dougtrajano.github.io/resume/)

## Description

This notebooks will download, extract and process raw files (Grupo ZAP, IBGE) and save processed datasets.

## Index

- [Imports](#Imports)
- [Parameters](#Parameters)
- [Download and extract zip files](#Download-and-extract-zip-files)
 - [Train dataset by Grupo ZAP](#Train-dataset-by-Grupo-ZAP)
 - [IBGE Censo 2010 - Agregados por setor censitário](#IBGE-Censo-2010---Agregados-por-setor-censitário)
 - [IBGE Censo 2010 - Shapefile](#IBGE-Censo-2010---Shapefile)
 - [IBGE Censo 2010 - Parameters](#IBGE-Censo-2010---Parameters)
- [Load train dataset](#Load-train-dataset)
 - [Filtering](#Filtering)
 - [Processing train](#Processing-train)
- [Load test dataset](#Load-test-dataset)
 - [Processing test](#Processing-test)

## Imports

In [None]:
from processing import *
import pandas as pd
import numpy as np

## Parameters

In [None]:
files = {
    "train": {
        "url": "https://s3.amazonaws.com/grupozap-data-challenge/data/source-4-ds-train.json.zip",
        "zip_name": "source-4-ds-train.json.zip",
        "json_name": "source-4-ds-train.json",
        "output_path": "../data/raw/"
    },
    "test": {
        "url": "https://s3.amazonaws.com/grupozap-data-challenge/data/source-4-ds-test.json.zip",
        "zip_name": "source-4-ds-test.json.zip",
        "json_name": "source-4-ds-test.json",
        "output_path": "../data/raw/"
    },
    "ibge": {
        "censo": {
            "url": "https://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Resultados_do_Universo/Agregados_por_Setores_Censitarios/SP_Capital_20190823.zip",
            "zip_name": "SP_Capital_20190823.zip",
            "output_path": "../data/raw/",
            "zip_path": "Base informaçoes setores2010 universo SP_Capital/CSV",
            "ext_files": ".csv"
        },
        "shapefile": {
            "url": "http://geoftp.ibge.gov.br/organizacao_do_territorio/malhas_territoriais/malhas_de_setores_censitarios__divisoes_intramunicipais/censo_2010/setores_censitarios_shp/sp/sp_setores_censitarios.zip",
            "zip_name": "sp_setores_censitarios.zip",
            "shp_name": "33SEE250GC_SIR.shp",
            "output_path": "../data/raw/"
        }
    },
    "download_files": True
}


converted_features = {
    "pricingInfos_price": "price",
    "pricingInfos_businessType": "businessType",
    "pricingInfos_yearlyIptu": "yearlyIptu",
    "pricingInfos_monthlyCondoFee": "monthlyCondoFee",
}

censo_config = {
    "DomicilioRenda_SP1.csv": {
        "V001": "total_dom_part_improvisados",
        "V002": "renda_nom_dom_part",
        "V003": "renda_nom_dom_part_perm",
        "V004": "renda_nom_dom_part_imp",
        "V005": "renda_nom_dom_sal_baixo1",
        "V006": "renda_nom_dom_sal_baixo2",
        "V007": "renda_nom_dom_sal_baixo3",
        "V008": "renda_nom_dom_sal_baixo4",
        "V009": "renda_nom_dom_sal_medio1",
        "V010": "renda_nom_dom_sal_medio2",
        "V011": "renda_nom_dom_sal_medio3",
        "V012": "renda_nom_dom_sal_alto1",
        "V013": "renda_nom_dom_sal_alto2",
        "V014": "renda_nom_dom_sem_rendimento"
    },
    "Entorno01_SP1.csv": {
        "V1005": "rural_urbano",
        "V002": "ident_logradouro_proprios",
        "V003": "nao_ident_logradouro_proprios",
        "V004": "ident_logradouro_alugados",
        "V005": "nao_ident_logradouro_alugados",
        "V008": "ilum_publica_proprios",
        "V009": "nao_ilum_publica_proprios",
        "V010": "ilum_publica_alugados",
        "V011": "nao_ilum_publica_alugados"
    }
}

## Download and extract zip files

### Train dataset by Grupo ZAP

Download and extract

In [None]:
if files["download_files"]:
    for file in files:
        download_url(url=files["train"]["url"], file_name=files["train"]["zip_name"],
                     to_path=files["train"]["output_path"])
        
        file_path = files["train"]["output_path"] + files["train"]["zip_name"]
        extract_zip(file_path=file_path, to_path=files["train"]["output_path"])

### IBGE Censo 2010 - Agregados por setor censitário

Download and extract

In [None]:
if files["download_files"]:
    download_url(url=files["ibge"]["censo"]["url"],
                 file_name=files["ibge"]["censo"]["zip_name"],
                 to_path=files["ibge"]["censo"]["output_path"])
    
    file_path = files["ibge"]["censo"]["output_path"] + files["ibge"]["censo"]["zip_name"]
    
    extract_zip(file_path=file_path, to_path=files["ibge"]["censo"]["output_path"])

### IBGE Censo 2010 - Shapefile

Download and extract

In [None]:
if files["download_files"]:
    download_url(url=files["ibge"]["shapefile"]["url"],
                 file_name=files["ibge"]["shapefile"]["zip_name"],
                 to_path=files["ibge"]["shapefile"]["output_path"])
    
    file_path = files["ibge"]["shapefile"]["output_path"] + files["ibge"]["shapefile"]["zip_name"]
    
    extract_zip(file_path=file_path, to_path=files["ibge"]["shapefile"]["output_path"])

### IBGE Censo 2010 - Parameters

`ibge_paths` is a dict with file as key and path as value for each file inside IBGE folder.

In [None]:
ibge_path = files["ibge"]["censo"]["output_path"] + files["ibge"]["censo"]["zip_path"]
ibge_paths = get_files_path(path=ibge_path, file_extension=files["ibge"]["censo"]["ext_files"])

print("IBGE Censo files:", len(ibge_paths))

ibge_paths

In [None]:
shapefile_path = files["ibge"]["shapefile"]["output_path"] + files["ibge"]["shapefile"]["shp_name"]
shapefile_path

---

## Load train dataset

In [None]:
file_path = files["train"]["output_path"] + files["train"]["json_name"]

raw_train = load_json(file_path)

print("Training set size:", len(raw_train))

### Filtering

As requested in project description, we'll only work with `"APARTMENT"` items.

In [None]:
raw_train = [item for item in raw_train if item["unitTypes"] in ["APARTMENT"]]

print("Training set size (after filter):", len(raw_train))

### Processing train

Convert nested dictionary into flattened dictionary

In [None]:
%%time
raw_train_flatten = [flatten_dict(item) for item in raw_train]

Apply processing steps

In [None]:
%%time
train_processed = processing(raw_train_flatten, converted_features, ibge_paths, shapefile_path, censo_config)

## Load test dataset

In [None]:
file_path = files["test"]["output_path"] + files["test"]["json_name"]

raw_test = load_json(file_path)

print("Test set size:", len(raw_test))

Check for `unitTypes != "APARTMENT"`

In [None]:
wrong_test = []

for i in raw_test:
    if i["unitTypes"] != "APARTMENT":
        wrong_test.append(i)

wrong_test

### Processing test

Convert nested dictionary into flattened dictionary

In [None]:
%%time
raw_test_flatten = [flatten_dict(item) for item in raw_test]

Applying processing steps

In [None]:
%%time
test_processed = processing(raw_test_flatten[:10], converted_features, ibge_paths, shapefile_path, censo_config)