# Dataset Builder

Takes raw data and makes a real dataset ready for preprocessing.

## Importing

In [None]:
import pandas as pd

In [None]:
# Get Dataset into pandas dataframe
df_pesagens = pd.read_csv('../data/raw/pesagens2016.csv', sep=';', index_col='PES_ID')
df_rotas = pd.read_csv('../data/raw/rotas.csv', sep=';', index_col='ROTA_ID')

## Info about the raw data

In [None]:
df_pesagens.info()

In [None]:
df_rotas.info()

In [None]:
df_pesagens.head()

In [None]:
df_rotas.head()

By looking at both data, as well as looking at the [documentation of the data](http://dados.recife.pe.gov.br/dataset/pesagem-de-coletas-de-residuos), we know that the point of connection for both data is `ROTA_ID` attribute. 

We've gotta be sure that all instances from `pesagens` dataset has a valid `ROTA_ID` that is present on `roteirizacao` dataset, for this, we'll aggregate all valid data and non-valid data will be discarded.

## Merging Rotas Data into Pesagens 

Now we're gonna merge both datasets into just one dataset. This is achieved by using the `merge` method from **pandas**, but, as seen in our tests, if we have duplicate indexes we may see a duplication of our data. 

To fix this, we're gonna drop duplicate indexes present on our data, to make sure the merge is successfull and  no duplicate data is present in the merge.

In [None]:
df_pesagens = df_pesagens.reset_index().drop_duplicates(subset='PES_ID', keep='first').set_index('PES_ID')
df_rotas = df_rotas.reset_index().drop_duplicates(subset='ROTA_ID', keep='first').set_index('ROTA_ID')

Duplicates removed, we can now merge into one dataset.

In [None]:
df = df_pesagens.merge(df_rotas, on='ROTA_ID')

In [None]:
df.info()

## Exporting our dataset

Simply we're gonna export into a `.csv`.

In [None]:
df.to_csv('../data/dataset.csv', sep=',', index=True, index_label='PES_ID')