## Importing libreries

---

In [132]:
# Importing libraries

# Data treatment
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Path
import sys
sys.path.append('../')

# Config
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames

In [195]:
from src.support import find_outliers_iqr

## Load data

---

In [133]:
path = "../data/output/data_clean.csv"

df = pd.read_csv(path)

#### Currency observation

Although data does not explicitly mention the currency, since these are Brazilian government data, it is reasonable to assume that the amounts are in Brazilian Reais (BRL), as this is the official currency of Brazil. This would be consistent with the standard practices for publishing financial data by the government of that country.

## Revenue Distribution by Economic Category

---

Let's start analyzing the most significant revenue categories and their contribution to total revenues. We can check quickly the different types of Economic Categories

In [135]:
df['Economic Category'].unique()

array(['Current Revenues', 'Capital Revenues', 'No Information',
       'Intra-Budgetary Current Revenues',
       'Intra-Budgetary Capital Revenues'], dtype=object)

Show the number of operations with an Actual Amount registered grouped by Economic Category

In [136]:
df.groupby('Economic Category')['Actual Amount'].count().round(0)

Economic Category
Capital Revenues                     27065
Current Revenues                    907002
Intra-Budgetary Capital Revenues        86
Intra-Budgetary Current Revenues     14567
No Information                       17893
Name: Actual Amount, dtype: int64

With this information, we can show some initial insights about the data:

1. **Current Revenues** is the most represented category with 907,002 entries. Current revenues probably include taxes, fees, and other regular government income.

2. **Intra-Budgetary Current Revenues** has far fewer entries, with a total of 14,567. This difference indicates that intra-budgetary revenues, those generated within the government's own budget, represent a much smaller portion of total entries.

3. **Capital Revenues** has 27,065 entries, indicating that capital revenues (which might include revenue from asset sales, investments, or financing) are also important but not as common as current revenues. However, this kind of transactions are usually way more expensive than Current Revenues as we will see in the next step.

4. **Intra-Budgetary Capital Revenues** with only 86 entries, shows that intra-budgetary capital revenues are unusual. This could imply that there are few events or sources of capital income occurring within the government's own budget.

5. **No Information** has 17893 entries. Depending on how big the mean amount is it could be a big loss of information or couldn't be.

We should remember that we have some entries with no 'Actual amount' available, representing 94.18% of the data:

In [137]:
round(df.groupby('Economic Category')['Actual Amount'].count().sum() / df.shape[0] * 100, 2)

94.18

Let's now move to the value of operations with an Actual Amount registered grouped by Economic Category

In [138]:
df.groupby('Economic Category')['Actual Amount'].sum().round(0)

Economic Category
Capital Revenues                    1.200415e+13
Current Revenues                    1.202508e+13
Intra-Budgetary Capital Revenues    2.141127e+10
Intra-Budgetary Current Revenues    2.805787e+11
No Information                      3.271665e+11
Name: Actual Amount, dtype: float64

We can extract again some valuable insights:

1. **Current Revenues**:
   - The total actual revenues amount to **1.202 x 10¹³** (approximately 12.02 trillion Brazilian reais). This confirms that current revenues are a huge source of income for the government, as already suggested by the number of entries in this category.

2. **Intra-Budgetary Current Revenues**:
   - The total actual revenues amount to **2.8058 x 10¹¹** (280.58 billion reais). Although much smaller than "Receitas Correntes," it is still a significant amount.

3. **Capital Revenues**:
   - The total actual revenues amount to **1.20 x 10¹³** (approximately 12 trillion reais). Capital revenues are almost equivalent in magnitude to current revenues, indicating that this category also plays a crucial role in government revenue.

4. **Intra-Budgetary Capital Revenues**:
   - The total actual amount is **2.141 x 10¹⁰** (21.41 billion reais). Similar to "Receitas Correntes - intra-orçamentárias," these revenues are a small fraction compared to the total capital revenues.

5. **No Information**:
   - The total amount is **3.27166 x 10¹¹** (327.17 billion reais). Represents a 1.33% out of the total with unknown Category. Despite seeming to be low it's a huge amount that should have been labeled.

In [139]:
# We can also compute the total amount of money
round(df[df['Economic Category'] == 'No Information']['Actual Amount'].sum() / df['Actual Amount'].sum() * 100, 2)

1.33

---

Now we can compute the average difference between projected and actual revenues for each category. This only makes sense for entries that have both numbers available, which is only 8837 entries

In [140]:
df_diff = df[df['Actual Amount'].notna() & df['Updated Budgeted Amount'].notna()][['Economic Category', 'Actual Amount', 'Updated Budgeted Amount']]
df_diff

Unnamed: 0,Economic Category,Actual Amount,Updated Budgeted Amount
20,Capital Revenues,985939.65,640279.0
21,Capital Revenues,194307.50,358413.0
22,No Information,68323.99,111097.0
23,Current Revenues,107548.62,350042.0
25,Current Revenues,5145837.84,4639104.0
...,...,...,...
1024650,Current Revenues,1098.41,907063.0
1025386,Current Revenues,480.00,517775.0
1025615,Current Revenues,1483.23,1323033.0
1025765,Current Revenues,135.90,407371.0


In [169]:
df_diff['Difference'] = df_diff['Actual Amount'] - df_diff['Updated Budgeted Amount']
df_diff['Porcentual_difference'] = df_diff['Difference'] / df_diff['Updated Budgeted Amount'] * 100

df_diff.groupby('Economic Category')['Difference'].mean().round(0)

Economic Category
Capital Revenues                   -2.129977e+10
Current Revenues                   -4.509515e+08
Intra-Budgetary Current Revenues   -1.193434e+08
No Information                     -4.687883e+09
Name: Difference, dtype: float64

It makes more sense to show the porcentual difference:

In [194]:
df_diff.groupby('Economic Category')['Porcentual_difference'].agg(['mean', 'median'])

Unnamed: 0_level_0,mean,median
Economic Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Capital Revenues,2785.083178,-46.251895
Current Revenues,9900.145758,-48.355853
Intra-Budgetary Current Revenues,199.616772,-38.589202
No Information,1613.218577,-31.114211


We can see that mean and median are extremely different. This is usually caused by outliers, let's check

In [196]:
# Getting outliers

find_outliers_iqr(df_diff, 'Porcentual_difference').sort_values(by = 'Porcentual_difference', ascending=False)

Unnamed: 0,Economic Category,Actual Amount,Updated Budgeted Amount,Difference,Porcentual_difference
5190,Current Revenues,1.447275e+06,2.0,1.447273e+06,7.236367e+07
6193,Current Revenues,2.453838e+06,131.0,2.453707e+06,1.873059e+06
4289,Current Revenues,8.027243e+05,50.0,8.026743e+05,1.605349e+06
10427,Capital Revenues,2.319754e+10,3111822.0,2.319442e+10,7.453647e+05
5762,Current Revenues,3.183964e+06,842.0,3.183122e+06,3.780430e+05
...,...,...,...,...,...
5599,Intra-Budgetary Current Revenues,-6.322226e+04,29019.0,-9.224126e+04,-3.178651e+02
12179,Current Revenues,-3.253205e+05,132367.0,-4.576875e+05,-3.457716e+02
679663,Current Revenues,-3.803480e+03,1349.0,-5.152480e+03,-3.819481e+02
10077,Intra-Budgetary Current Revenues,-2.124860e+05,52900.0,-2.653860e+05,-5.016750e+02


In [193]:
df.iloc[5190,:]

Superior Agency                                       Ministério da Economia
Agency                     Ministério da Economia - Unidades com vínculo ...
Managing Unit Code                   SETORIAL ORCAMENTARIA E FINANCEIRA / ME
Economic Category                                           Current Revenues
Revenue Source                                                 Contribuições
Revenue Type                                           Contribuições sociais
Detailing                                       OUTRAS CONTRIBUICOES SOCIAIS
Updated Budgeted Amount                                                  2.0
Posted Amount                                                            NaN
Actual Amount                                                     1447275.45
Realization Percentage                                            72363772.0
Posting Date                                                             NaT
Fiscal Year                                                             2014

---

### Temporal analysis

In [147]:
df['Posting Date'] = pd.to_datetime(df['Posting Date'], dayfirst=True)

  df['Posting Date'] = pd.to_datetime(df['Posting Date'], dayfirst=True)


In [143]:
# Now, let's evaluate the temporal trends by calculating the sum of actual income per year.
df.groupby('Fiscal Year')['Actual Amount'].sum()

Fiscal Year
2013    1.663430e+12
2014    2.211514e+12
2015    2.634219e+12
2016    2.787181e+12
2017    2.476074e+12
2018    2.865977e+12
2019    2.880424e+12
2020    3.460501e+12
2021    3.679059e+12
Name: Actual Amount, dtype: float64

We can see a increasement of the incomes over time except for a slight decreasement in 2016-2017 that can be related to Brasil economical crisis in 2014-2017

In [163]:
# Now, let's evaluate the temporal trends by calculating the sum of actual income per year and month.
for i in range(2013, 2022):
    
    print(i, df[df['Fiscal Year'] == i].groupby(df['Posting Date'].dt.month)['Actual Amount'].sum())

2013 Posting Date
12.0    1.617267e+12
Name: Actual Amount, dtype: float64
2014 Posting Date
12.0    2.192906e+12
Name: Actual Amount, dtype: float64
2015 Posting Date
12.0    2.630488e+12
Name: Actual Amount, dtype: float64
2016 Posting Date
1.0     2.445406e+11
2.0     1.443923e+11
3.0     3.234955e+11
4.0     2.215033e+11
5.0     1.662483e+11
6.0     1.752807e+11
7.0     2.347508e+11
8.0     1.788412e+11
9.0     1.913556e+11
10.0    2.877622e+11
11.0    1.512721e+11
12.0    4.413237e+11
Name: Actual Amount, dtype: float64
2017 Posting Date
1.0     3.554494e+11
2.0     1.604895e+11
3.0     1.949693e+11
4.0     1.909670e+11
5.0     1.517739e+11
6.0     1.733170e+11
7.0     1.801296e+11
8.0     2.202914e+11
9.0     2.212834e+11
10.0    2.109477e+11
11.0    1.686580e+11
12.0    1.703508e+11
Name: Actual Amount, dtype: float64
2018 Posting Date
1.0     3.058722e+11
2.0     1.599130e+11
3.0     2.478358e+11
4.0     2.647583e+11
5.0     1.734651e+11
6.0     1.754082e+11
7.0     2.402284e+1

1. **2013-2015**: The data only includes the total amount for December, and these amounts are significantly higher compared to individual months in subsequent years. This suggests that income may have been reported in a lump sum at the end of the year during these early years.

2. **2016 onwards**: The income is reported monthly, providing a clearer picture of income flows. Notable trends include:
   - **2016**: A peak in December with an actual amount of approximately 4.41 x 10¹¹.

   - **2017-2019**: Monthly income appears fairly consistent.

   - **2020**: Significant spikes in August (6.22 x 10¹¹) and December, possibly reflecting special circumstances (maybe COVID-19-related budget adjustments).

   - **2021**: Again spikes in April and September but quite homogeneous across the year.