### Importing libraries

In [187]:
# Importing libraries

# Data treatment
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Path
import sys
sys.path.append('../')

# Config
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames

### Data loading

In [188]:
path = "../data/output/data_full.csv"

df = pd.read_csv(path, index_col= None)

In [189]:
# We check theres are no duplicated
df.duplicated().sum()

0

### Data cleaning

We can remove irrelevant columns:

* "Unnamed: 0": Does not prvide relevant info

* 'Superior Agency Code': We can use directly 'Superior Agency Name'

* 'Agency Code': We can use directly 'Agency Name'

* 'Managing Unit Code': We can use directly 'Managing Unit Name'

In [190]:
# Remove columns
df.drop(columns=["Unnamed: 0", "Superior Agency Code", "Agency Code", "Managing Unit Code"], inplace = True)

In [191]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1026299 entries, 0 to 1026298
Data columns (total 13 columns):
 #   Column                   Non-Null Count    Dtype 
---  ------                   --------------    ----- 
 0   Superior Agency Name     667093 non-null   object
 1   Agency Name              991412 non-null   object
 2   Managing Unit Name       1006818 non-null  object
 3   Economic Category        1007321 non-null  object
 4   Revenue Source           987881 non-null   object
 5   Revenue Type             994372 non-null   object
 6   Detailing                996962 non-null   object
 7   Updated Budgeted Amount  974984 non-null   object
 8   Posted Amount            999880 non-null   object
 9   Actual Amount            986834 non-null   object
 10  Realization Percentage   1002165 non-null  object
 11  Posting Date             1002468 non-null  object
 12  Fiscal Year              1026299 non-null  int64 
dtypes: int64(1), object(12)
memory usage: 101.8+ MB


Now it's time to convert to their proper type the following columns:

* Updated Budgeted Amount (numeric)

* Posted Amount (numeric)

* Actual Amount (numeric)

* Realization Percentage (numeric)

* Posting Date (datetime)

In [192]:
categories = ['Updated Budgeted Amount', 'Posted Amount', 'Actual Amount', 'Realization Percentage', 'Posting Date']

for cat in categories:

    if cat == 'Posting Date':
        df[cat] = pd.to_datetime(df[cat], dayfirst=True)

    else:
        df[cat] = df[cat].str.replace(',', '.').astype(float).replace(0, np.nan)

We need to address some inconveniences such as:

* Replacing commas with dots in '0,00' for proper float conversion

* Replacing nan with 0 for effective data cleaning

In [193]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1026299 entries, 0 to 1026298
Data columns (total 13 columns):
 #   Column                   Non-Null Count    Dtype         
---  ------                   --------------    -----         
 0   Superior Agency Name     667093 non-null   object        
 1   Agency Name              991412 non-null   object        
 2   Managing Unit Name       1006818 non-null  object        
 3   Economic Category        1007321 non-null  object        
 4   Revenue Source           987881 non-null   object        
 5   Revenue Type             994372 non-null   object        
 6   Detailing                996962 non-null   object        
 7   Updated Budgeted Amount  18850 non-null    float64       
 8   Posted Amount            7042 non-null     float64       
 9   Actual Amount            966613 non-null   float64       
 10  Realization Percentage   7241 non-null     float64       
 11  Posting Date             1002468 non-null  datetime64[ns]
 12  

In [194]:
(df[(df['Posted Amount'].isnull()) | (df['Posted Amount'] == 0)].shape[0]) / (df.shape[0])

0.993138451854674

In [195]:
df['Posted Amount'].value_counts()

Posted Amount
-1.000000e-02    31
 1.000000e-02    24
 1.100000e-01    10
 2.000000e-02     8
 4.920760e+03     7
                 ..
-4.960937e+06     1
 7.417899e+05     1
-3.440536e+08     1
-7.440405e+08     1
 4.151808e+04     1
Name: count, Length: 6632, dtype: int64

In [196]:
df_sin_nulos = df[df.notnull().all(axis=1)]

In [197]:
df_sin_nulos['Posted Amount'].value_counts()

Posted Amount
159089.35    2
8559.31      1
Name: count, dtype: int64

In [198]:
df[df['Posted Amount'] == 0].shape[0]

0