## Importing libraries

---

In [1]:
# Data treatment
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Path
import sys
sys.path.append('../')

# Config
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames

from src.support import cleaning_columns

import warnings
warnings.filterwarnings('ignore')

## Data loading

---

In [2]:
path = "../data/output/data_full.csv"

df = pd.read_csv(path)

We check theres are no duplicates.

In [3]:
df.duplicated().value_counts()

False    1026214
True          85
Name: count, dtype: int64

There are 85 duplicates, it's a very small amount but we drop them.

In [4]:
# Drop duplicates
df.drop_duplicates(inplace=True)
df.duplicated().value_counts()

False    1026214
Name: count, dtype: int64

## Data cleaning

---

We check important info in the dataframe to decide which columns contain relevant info

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1026214 entries, 0 to 1026298
Data columns (total 16 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   Superior Agency Code     995855 non-null   float64
 1   Superior Agency Name     667027 non-null   object 
 2   Agency Code              1001097 non-null  float64
 3   Agency Name              991328 non-null   object 
 4   Managing Unit Code       992633 non-null   float64
 5   Managing Unit Name       1006733 non-null  object 
 6   Economic Category        1007236 non-null  object 
 7   Revenue Source           987797 non-null   object 
 8   Revenue Type             994287 non-null   object 
 9   Detailing                996878 non-null   object 
 10  Updated Budgeted Amount  974899 non-null   object 
 11  Posted Amount            999795 non-null   object 
 12  Actual Amount            986772 non-null   object 
 13  Realization Percentage   1002080 non-null  obje

We see the following info:

* We have 1026214 entries

* `Superior Agency Code` and `Superior Agency Name` appear to refer to the same info. Code has more non-null entries so we will want to keep that info. However, since the name is more understandable for our analysis we will change the codes for their corresponding names to fill the empty names.

* `Agency Code` and `Agency Name` is the same case.

* `Managing Unit Code` and `Managing Unit Name` the same.


In [6]:
for col in ['Superior Agency', 'Agency', 'Managing Unit']:
    cleaning_columns(df, col)

In [8]:
df.sample()

Unnamed: 0,Superior Agency,Agency,Managing Unit,Economic Category,Revenue Source,Revenue Type,Detailing,Updated Budgeted Amount,Posted Amount,Actual Amount,Realization Percentage,Posting Date,Fiscal Year
825211,Ministério da Educação,Instituto Federal do Piauí,"INST.FED.DE EDUC.,CIENC.E TEC.DO PIAUI",Receitas Correntes,Outras Receitas Correntes,"Indenizações, restituições e ressarcimentos",REST.DESPESAS EXERC.ANT.FIN.FTE.PRIM.-PRINC.,0,0,4470535,0,03/02/2020,2020


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1026214 entries, 0 to 1026298
Data columns (total 13 columns):
 #   Column                   Non-Null Count    Dtype 
---  ------                   --------------    ----- 
 0   Superior Agency          1015669 non-null  object
 1   Agency                   1025309 non-null  object
 2   Managing Unit            1025600 non-null  object
 3   Economic Category        1007236 non-null  object
 4   Revenue Source           987797 non-null   object
 5   Revenue Type             994287 non-null   object
 6   Detailing                996878 non-null   object
 7   Updated Budgeted Amount  974899 non-null   object
 8   Posted Amount            999795 non-null   object
 9   Actual Amount            986772 non-null   object
 10  Realization Percentage   1002080 non-null  object
 11  Posting Date             1002467 non-null  object
 12  Fiscal Year              1026214 non-null  int64 
dtypes: int64(1), object(12)
memory usage: 141.9+ MB


---

Now it's time to convert to their proper type the following columns:

* Updated Budgeted Amount (numeric)

* Posted Amount (numeric)

* Actual Amount (numeric)

* Realization Percentage (numeric)

* Posting Date (datetime)

In [9]:
# Conveting dates to datetime
df['Posting Date'] = pd.to_datetime(df['Posting Date'], dayfirst=True)

We need to convert `object`to `float` replacing commas for dots for an apropriate conversion.

We should also convert 0 to NaN to handle null values properly.

In [10]:
categories = ['Updated Budgeted Amount', 'Posted Amount', 'Actual Amount', 'Realization Percentage']

for cat in categories:
    df[cat] = df[cat].str.replace(',', '.').astype(float).replace(0, np.nan)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1026214 entries, 0 to 1026298
Data columns (total 13 columns):
 #   Column                   Non-Null Count    Dtype         
---  ------                   --------------    -----         
 0   Superior Agency          1015669 non-null  object        
 1   Agency                   1025309 non-null  object        
 2   Managing Unit            1025600 non-null  object        
 3   Economic Category        1007236 non-null  object        
 4   Revenue Source           987797 non-null   object        
 5   Revenue Type             994287 non-null   object        
 6   Detailing                996878 non-null   object        
 7   Updated Budgeted Amount  18850 non-null    float64       
 8   Posted Amount            7042 non-null     float64       
 9   Actual Amount            966555 non-null   float64       
 10  Realization Percentage   7241 non-null     float64       
 11  Posting Date             1002467 non-null  datetime64[ns]
 12  Fisca

---

We'll analyze 'Economic Category' so we should clean values

In [12]:
df['Economic Category'].unique()

array(['Receitas Correntes', 'Receitas de Capital', nan,
       'Receitas Correntes - intra-orçamentárias', 'Sem informação',
       'Receitas de Capital - intra-orçamentárias'], dtype=object)

In [13]:
# Now we store these columns in a dictionary to rename the codes

economic_category_dict = {
    'Receitas Correntes': 'Current Revenues',
    'Receitas de Capital': 'Capital Revenues',
    'Receitas Correntes - intra-orçamentárias': 'Intra-Budgetary Current Revenues',
    'Sem informação': 'No Information',
    'Receitas de Capital - intra-orçamentárias': 'Intra-Budgetary Capital Revenues'
}

df['Economic Category'].replace(economic_category_dict, inplace = True)

In [14]:
df['Economic Category'].unique()

array(['Current Revenues', 'Capital Revenues', nan,
       'Intra-Budgetary Current Revenues', 'No Information',
       'Intra-Budgetary Capital Revenues'], dtype=object)

In [15]:
# Now we should replace NaN with 'No Information'
df['Economic Category'].fillna('No Information', inplace=True)

In [16]:
df['Economic Category'].unique()

array(['Current Revenues', 'Capital Revenues', 'No Information',
       'Intra-Budgetary Current Revenues',
       'Intra-Budgetary Capital Revenues'], dtype=object)

---

Now we can check the null % in the numeric variables

In [17]:
for cat in categories:
    print(f'{cat}: {round((df[(df[cat].isna())].shape[0]) / (df.shape[0]) * 100, 2)}%')

Updated Budgeted Amount: 98.16%
Posted Amount: 99.31%
Actual Amount: 5.81%
Realization Percentage: 99.29%


We see we have a huge amount of null values except for `Actual Amount`. Unfortunately there's little we can do about this, so let's save this dataframe.

In [18]:
# Now we save the dataframe
df.to_csv("../data/output/data_clean.csv", index = False)