## Import packages

In [1]:
import pandas as pd
import numpy as np
import sys 
from tqdm import tqdm
tqdm.pandas() 

sys.path.append("..")

from src.support_cleaning import normalize, apply_fill, fill_value_bidirectional

ModuleNotFoundError: No module named 'src.support_cleaning'

## Data import

In [None]:
brasil_public = pd.read_parquet("../data/concatenated_data.parquet")
brasil_public.info()
brasil_public.head()

## Data cleaning

First problems to correct as identified in exploration:
- Data types for numerical features valor_previsto_atualizado, valor_lancado, valor_realizado, percentual_realizado and datetime data_lancamento
- Missing values in nome_orgao_superior, as well as other columns, to be inferred from other columns and rows
- Duplicated values

### 2. Correcting data types

In [42]:
data_types_dict = {
    "codigo_orgao_superior": object,
    "codigo_orgao": object,  
    "codigo_unidade_gestora": object,      
    "valor_previsto_atualizado": float,
    "valor_lancado": float,  
    "valor_realizado": float,      
    "percentual_realizado": float,
    "data_lancamento": "datetime64[ns]"
}

#### 2.1.1 Replacing floating commas by floating point

In [43]:
for column, data_type in data_types_dict.items():
    if data_type == float:
        brasil_public[column] = brasil_public[column].str.replace(",",".")

#### 2.1.2 Correcting data type

In [None]:
brasil_public = brasil_public.astype(data_types_dict)
brasil_public.info()

In [None]:
brasil_public.describe().T.assign(missing_values= lambda x: brasil_public.shape[0] - x["count"]).T

Now that the numerical values have been corrected, the revenue value columns can be explored. Odd things immediately struck when seeing the negative numbers as minimum values. Theoretically, revenue should always be positive and expenditures should appear in different reports, so one possible explanation for this could be that certain corrections to the same category are made.

In [None]:
brasil_public.describe(include=['O']).T.assign(missing_values= lambda x: brasil_public.shape[0] - x["count"])

### 2.2 Missing values

In [None]:
brasil_public.isna().sum()

#### 2.2.1 Codigo orgao superior & nome_orgao_superior

As codigo_orgao_superior and nome_orgao_superior should bear a one to one relationship, filling the gaps with one another can be a good option. To do that, a support function will be used, that takes a generated equivalence dictionary from the columns like so:

In [None]:
codigo_nome_orgao_superior = brasil_public[["codigo_orgao_superior","nome_orgao_superior"]].value_counts().index.to_list()
codigo_nome_orgao_superior_dict = {codigo: nome for codigo, nome  in codigo_nome_orgao_superior}
codigo_nome_orgao_superior_dict

Applying the function through a pandas apply is done as per the following:

In [None]:
brasil_public[['codigo_orgao_superior_filled', 'nome_orgao_superior_filled']] = brasil_public.progress_apply(
    lambda row: fill_value_bidirectional(row['codigo_orgao_superior'], row['nome_orgao_superior'], codigo_nome_orgao_superior_dict),
    axis=1,
    result_type='expand'
)

Now, checking that the filling works correctly:

In [None]:
# check filled vs original missing
display(brasil_public.isna().sum())


missing_filter_codigo = brasil_public["codigo_orgao_superior"].isna() 
missing_filter_nome=  brasil_public["nome_orgao_superior"].isna()

# printing missing code
display(brasil_public.loc[missing_filter_codigo,["codigo_orgao_superior","nome_orgao_superior",'codigo_orgao_superior_filled', 'nome_orgao_superior_filled']])

# printing missing name
brasil_public.loc[missing_filter_nome,["codigo_orgao_superior","nome_orgao_superior",'codigo_orgao_superior_filled', 'nome_orgao_superior_filled']]

The check confirms that columns where 'nome_orgao_superior' was missing now have the correct name. None values are essentiallyh the same as NaN, so they are converted as such.

Now the rest of pairs can be filled with the same technique.

#### 2.2.2 Codigo_orgao & nome_orgao

To make the code cleaner, the creation of the equivalences dictionary and the application of the filling function have been included in a higher support function.

In [None]:
brasil_public[['codigo_orgao_filled', 'nome_orgao_filled']] = apply_fill(brasil_public[['codigo_orgao', 'nome_orgao']])

In [None]:
# check filled vs original missing
display(brasil_public.isna().sum())


missing_filter_codigo = brasil_public["codigo_orgao"].isna() 
missing_filter_nome=  brasil_public["nome_orgao"].isna()

# printing missing code
display(brasil_public.loc[missing_filter_codigo,["codigo_orgao","nome_orgao",'codigo_orgao_filled', 'nome_orgao_filled']])

# printing missing name
brasil_public.loc[missing_filter_nome,["codigo_orgao","nome_orgao",'codigo_orgao_filled', 'nome_orgao_filled']]

#### 2.2.3 codigo_unidade_gestora & nome_unidade_gestora

In [None]:
brasil_public[['codigo_unidade_gestora_filled', 'nome_unidade_gestora_filled']] = apply_fill(
                                                                                    brasil_public[['codigo_unidade_gestora', 'nome_unidade_gestora']])

In [None]:
# check filled vs original missing
display(brasil_public.isna().sum())


missing_filter_codigo = brasil_public["codigo_unidade_gestora"].isna() 
missing_filter_nome=  brasil_public["nome_unidade_gestora"].isna()

# printing missing code
display(brasil_public.loc[missing_filter_codigo,["codigo_unidade_gestora","nome_unidade_gestora",'codigo_unidade_gestora_filled', 'nome_unidade_gestora_filled']])

# printing missing name
brasil_public.loc[missing_filter_nome,["codigo_unidade_gestora","nome_unidade_gestora",'codigo_unidade_gestora_filled', 'nome_unidade_gestora_filled']]

#### 2.2.4 categoria_economica & origem_receita & especie_receita 

These revenues groups should share a non-shared hierarchy, meaning that a origem_receita only has a categoria_economica. That could be the case for especie_receita, so let's check that by manually inspecting the value counts, first ordered by origem_receita with respect to categoria_economica. 

In [None]:
pd.set_option("display.max_rows",85)
brasil_public[['categoria_economica', 'origem_receita']].value_counts().reset_index().sort_values(by="origem_receita")

As suspected, each categoria_economica bears a non-shared hierarchy with respect to origem_receita. There would be just one modification to make for this to work and that is to create a boolean column for receitas 'intra-orcamentárias'. 

In [None]:
brasil_public['categoria_economica'].str.split(" - ",expand=True)

In [None]:
brasil_public[['categoria_economica','intra_orcamentaria']] = brasil_public['categoria_economica'].str.split(" - ",expand=True)


In [None]:
brasil_public['intra_orcamentaria'] = brasil_public['intra_orcamentaria'].astype(int)
brasil_public['intra_orcamentaria'] = brasil_public['intra_orcamentaria'].astype(bool)
brasil_public[['categoria_economica','intra_orcamentaria']]

With that modification done, the filling method can be used unidirectionally from origem_receita to categoria_economica.

In [None]:
brasil_public[['categoria_economica_filled', 'origem_receita_filled']] = apply_fill(
                                                                                    brasil_public[['categoria_economica', 'origem_receita']], direction="left")

In [None]:
display(brasil_public[['origem_receita', 'especie_receita']].value_counts().reset_index().sort_values(by="especie_receita"))
pd.set_option("display.max_rows",None)