## Importing libraries

---

In [None]:
# Data treatment
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Path
import sys
sys.path.append('../')

# Config
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None)

from src.support import check_columns

## Loading data

---

In [2]:
# Build a list of paths, read them and store dataframes in otrher list
years = [i for i in range(2013, 2022)]
paths = [f"../data/raw/datos-{year}.csv" for year in years]
data = []

for path in paths:
    data.append(pd.read_csv(path, sep = ';'))

### Inspecting datasets

In [3]:
# We check all columns are the same for all dataframes
check_columns(data)

True

Renaming columns to understandable English

In [6]:
# Name dictionary
new_columns = {
    'CÓDIGO ÓRGÃO SUPERIOR': 'Superior Agency Code',
    'NOME ÓRGÃO SUPERIOR': 'Superior Agency Name',
    'CÓDIGO ÓRGÃO': 'Agency Code',
    'NOME ÓRGÃO': 'Agency Name',
    'CÓDIGO UNIDADE GESTORA': 'Managing Unit Code',
    'NOME UNIDADE GESTORA': 'Managing Unit Name',
    'CATEGORIA ECONÔMICA': 'Economic Category',
    'ORIGEM RECEITA': 'Revenue Source',
    'ESPÉCIE RECEITA': 'Revenue Type',
    'DETALHAMENTO': 'Detailing',
    'VALOR PREVISTO ATUALIZADO': 'Updated Budgeted Amount',
    'VALOR LANÇADO': 'Posted Amount',
    'VALOR REALIZADO': 'Actual Amount',
    'PERCENTUAL REALIZADO': 'Realization Percentage',
    'DATA LANÇAMENTO': 'Posting Date',
    'ANO EXERCÍCIO': 'Fiscal Year'
}

# Dataframe renaming
[df.rename(columns= new_columns, inplace=True) for df in data]

# Check new column names for all dataframes
# print([df.columns for df in data])

[None, None, None, None, None, None, None, None, None]

We can concatenate the datasets since they share the same structure.

Before concatenating datasets we check dates both in `Fiscal Year` and `Posting Date`.

+ `Fiscal Year`

In [7]:
# Checking Fiscal Year
for df in data:
    print(df['Fiscal Year'].unique())

[  nan 2013.]
[2014.   nan]
[2015.   nan]
[2016.   nan]
[2017.   nan]
[  nan 2018.]
[2019.   nan]
[2020.   nan]
[2021.   nan]


+ `Posting Date`

In [8]:
# Checking Posting Date
# Since date format is DD/MM/YYYY we can extract it using a regex pattern 
pattern = r'/(\d{4})'

for df in data:
    print(df['Posting Date'].str.extract(pattern)[0].unique())

['2013' nan]
['2014' nan]
['2015' nan]
['2016' nan]
['2017' nan]
['2018' nan]
['2019' nan]
['2020' nan]
['2021' nan]


We can check number of missing year data

In [9]:
for df in data:
    print(f'Missing {df[df['Posting Date'].isna() & data[0]['Fiscal Year'].isna()].shape[0]} out of {df.shape[0]}')

Missing 31 out of 4498
Missing 42 out of 4553
Missing 31 out of 4523
Missing 11 out of 194533
Missing 41 out of 190479
Missing 29 out of 173944
Missing 14 out of 176828
Missing 21 out of 142348
Missing 37 out of 134593


We see that we have some NaN values, but not mixed data from years so we will assume that NaN can be replaced by the year

In [10]:
i = 0

for df in data:

    df['Fiscal Year'] = years[i]
    i += 1

## Concatenate dataframes

---

We can concatenate the list of dataframes to stack them in a single one

In [11]:
# We use pd.concat() since we have the same columns
df_full = pd.concat(data, ignore_index = True)
df_full.sample()

Unnamed: 0,Superior Agency Code,Superior Agency Name,Agency Code,Agency Name,Managing Unit Code,Managing Unit Name,Economic Category,Revenue Source,Revenue Type,Detailing,Updated Budgeted Amount,Posted Amount,Actual Amount,Realization Percentage,Posting Date,Fiscal Year
829779,26000.0,,26241.0,Universidade Federal do Paraná,153079.0,UNIVERSIDADE FEDERAL DO PARANA,Receitas Correntes,Outras Receitas Correntes,,OUTRAS RESTITUICOES-PRINCIPAL,0,0,1269700,0,01/09/2020,2020


In [12]:
df_full.shape

(1026299, 16)

We see that we have 1026299 entries over the 2013-2021 period.

Now we can seva the dataframe

In [13]:
# Now we save the dataframe
df_full.to_csv("../data/output/data_full.csv", index = False)