## Importing libreries

---

In [83]:
# Importing libraries

# Data treatment
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Path
import sys
sys.path.append('../')

# Config
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames

## Load data

---

In [84]:
path = "../data/output/data_clean.csv"

df = pd.read_csv(path)

In [85]:
df.sample()

Unnamed: 0,Superior Agency,Agency,Managing Unit Code,Economic Category,Revenue Source,Revenue Type,Detailing,Updated Budgeted Amount,Posted Amount,Actual Amount,Realization Percentage,Posting Date,Fiscal Year
856669,Ministério da Infraestrutura,Agência Nacional de Transportes Terrestres,AGENCIA NACIONAL DE TRANSPORTES TERRESTRES,Current Revenues,Outras Receitas Correntes,"Multas administrativas, contratuais e judicia",MULTAS E JUROS PREVISTOS EM CONTRATOS-PRINC.,,,756405.16,,2020-01-13,2020


#### Currency observation

Although data does not explicitly mention the currency, since these are Brazilian government data, it is reasonable to assume that the amounts are in Brazilian Reais (BRL), as this is the official currency of Brazil. This would be consistent with the standard practices for publishing financial data by the government of that country.

## Revenue Distribution by Economic Category

---

Let's start analyzing the most significant revenue categories and their contribution to total revenues. We can check quickly the different types of Economic Categories

In [86]:
df['Economic Category'].unique()

array(['Current Revenues', 'Capital Revenues', 'No Information',
       'Intra-Budgetary Current Revenues',
       'Intra-Budgetary Capital Revenues'], dtype=object)

Show the number of operations with an Actual Amount registered grouped by Economic Category

In [87]:
df.groupby('Economic Category')['Actual Amount'].count().round(0)

Economic Category
Capital Revenues                     27065
Current Revenues                    907002
Intra-Budgetary Capital Revenues        86
Intra-Budgetary Current Revenues     14567
No Information                       17893
Name: Actual Amount, dtype: int64

With this information, we can show some initial insights about the data:

1. **Current Revenues** is the most represented category with 907,002 entries. Current revenues probably include taxes, fees, and other regular government income.

2. **Intra-Budgetary Current Revenues** has far fewer entries, with a total of 14,567. This difference indicates that intra-budgetary revenues, those generated within the government's own budget, represent a much smaller portion of total entries.

3. **Capital Revenues** has 27,065 entries, indicating that capital revenues (which might include revenue from asset sales, investments, or financing) are also important but not as common as current revenues. However, this kind of transactions are usually way more expensive than Current Revenues as we will see in the next step.

4. **Intra-Budgetary Capital Revenues** with only 86 entries, shows that intra-budgetary capital revenues are unusual. This could imply that there are few events or sources of capital income occurring within the government's own budget.

5. **No Information** has 17893 entries. Depending on how big the mean amount is it could be a big loss of information or couldn't be.

We should remember that we have some entries with no 'Actual amount' available, representing 94.18% of the data:

In [90]:
round(df.groupby('Economic Category')['Actual Amount'].count().sum() / df.shape[0] * 100, 2)

94.18

Let's now move to the value of operations with an Actual Amount registered grouped by Economic Category

In [91]:
df.groupby('Economic Category')['Actual Amount'].sum().round(0)

Economic Category
Capital Revenues                    1.200415e+13
Current Revenues                    1.202508e+13
Intra-Budgetary Capital Revenues    2.141127e+10
Intra-Budgetary Current Revenues    2.805787e+11
No Information                      3.271665e+11
Name: Actual Amount, dtype: float64

We can extract again some valuable insights:

1. **Current Revenues**:
   - The total actual revenues amount to **1.202 x 10¹³** (approximately 12.02 trillion Brazilian reais). This confirms that current revenues are a huge source of income for the government, as already suggested by the number of entries in this category.

2. **Intra-Budgetary Current Revenues**:
   - The total actual revenues amount to **2.8058 x 10¹¹** (280.58 billion reais). Although much smaller than "Receitas Correntes," it is still a significant amount.

3. **Capital Revenues**:
   - The total actual revenues amount to **1.20 x 10¹³** (approximately 12 trillion reais). Capital revenues are almost equivalent in magnitude to current revenues, indicating that this category also plays a crucial role in government revenue.

4. **Intra-Budgetary Capital Revenues**:
   - The total actual amount is **2.141 x 10¹⁰** (21.41 billion reais). Similar to "Receitas Correntes - intra-orçamentárias," these revenues are a small fraction compared to the total capital revenues.

5. **No Information**:
   - The total amount is **3.27166 x 10¹¹** (327.17 billion reais). Represents a 1.33% out of the total with unknown Category. Despite seeming to be low it's a huge amount that should have been labeled.

In [96]:
# We can also compute the total amount of money
round(df[df['Economic Category'] == 'No Information']['Actual Amount'].sum() / df['Actual Amount'].sum() * 100, 2)

1.33

---

Now we can compute the average difference between projected and actual revenues for each category. This only makes sense for entries that have both numbers available, which is only 8837 entries

In [113]:
df_diff = df[df['Actual Amount'].notna() & df['Updated Budgeted Amount'].notna()][['Economic Category', 'Actual Amount', 'Updated Budgeted Amount']]
df_diff.shape[0]

8837

In [115]:
df_diff['Difference'] = df_diff['Actual Amount'] - df_diff['Updated Budgeted Amount']
df_diff.groupby('Economic Category')['Difference'].mean().round(0)

Economic Category
Capital Revenues                   -2.129977e+10
Current Revenues                   -4.509515e+08
Intra-Budgetary Current Revenues   -1.193434e+08
No Information                     -4.687883e+09
Name: Difference, dtype: float64

It makes more sense 

---

In [117]:
# Now, let's evaluate the temporal trends by calculating the sum of actual income per year.
df.groupby('Fiscal Year')['Actual Amount'].sum()

Fiscal Year
2013    1.663430e+12
2014    2.211514e+12
2015    2.634219e+12
2016    2.787181e+12
2017    2.476074e+12
2018    2.865977e+12
2019    2.880424e+12
2020    3.460501e+12
2021    3.679059e+12
Name: Actual Amount, dtype: float64

In [50]:
df['Superior Agency'].unique().shape

(26,)

In [49]:
df['Agency'].unique().shape

(288,)

In [48]:
df['Managing Unit Code'].unique().shape

(357,)

Phase 3: Exploratory Data Analysis (EDA)  


Temporal Analysis:

- Evaluate trends over time, such as how actual revenues change from month to month or year to year.

Identification of Discrepancies:

- Investigate the categories with the largest differences between projected and actual revenues, identifying patterns of underperformance or overperformance.