## Setup & Dependencies
This cell imports all necessary libraries and modules. We include pandas for data manipulation, sklearn components (like OneHotEncoder, RobustScaler, ColumnTransformer, Pipeline, and SimpleImputer) for building the preprocessing pipeline, and crucial custom functions from our local functions package.

We also modify sys.path to ensure the notebook can successfully find the custom modules within the project structure.

In [22]:
#Imports

import os
import sys
import pandas as pd
import numpy as np
sys.path.append(os.path.abspath('..'))
from sklearn.preprocessing import OneHotEncoder, RobustScaler, FunctionTransformer
from functions.analise_exploratoria import (load_and_combine_csvs,
                                            clean_dataframe,
                                            add_confidential_flags,
                                            _estimate_state_row,
                                            apply_state_estimation)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

## Data Ingestion
This step uses the custom function load_and_combine_csvs to read and consolidate all individual monthly CPGF CSV files located in the raw_data directory. The resulting df_raw DataFrame serves as the initial, unprocessed dataset for the entire pipeline. It also automatically adds the ARQUIVO_ORIGEM column for traceability.

In [2]:
csv_path = '../raw_data'
df_raw = load_and_combine_csvs(csv_path)

## Data Cleaning and Initial Feature Engineering
This cell sequentially applies the foundational cleaning and transformation functions:

clean_dataframe: Converts the 'VALOR TRANSAÇÃO' column to numeric (handling the Brazilian decimal format) and converts 'DATA TRANSAÇÃO' to datetime format. It also removes duplicate rows.

add_confidential_flags: Creates one binary (0/1) feature: 'SIGILOSO' (marks confidential transactions).

apply_state_estimation: Executes the custom logic to infer the ESTADO_ESTIMADO (Estimated State) of the transaction based on text patterns in the agency names, which is critical for geospatial risk analysis.

In [16]:
df_limpo = clean_dataframe(df_raw)
df_conf = add_confidential_flags(df_limpo)
df_final = apply_state_estimation(df_conf)

## Validation Checks
These commands are executed after the cleaning and initial feature engineering to perform essential data validation.

df_final.info(): Displays the non-null counts and data types for the final DataFrame, confirming that the VALOR TRANSAÇÃO column is now a numeric type (float64) and checking the fill rate of the newly engineered columns like ESTADO_ESTIMADO.

df_final.head(): Shows the first few rows of the processed data, allowing a quick visual inspection of the results from the clean_dataframe and apply_state_estimation functions.

In [21]:
# Example checks
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 307307 entries, 0 to 392088
Data columns (total 18 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   CÓDIGO ÓRGÃO SUPERIOR   307307 non-null  int64         
 1   NOME ÓRGÃO SUPERIOR     307307 non-null  object        
 2   CÓDIGO ÓRGÃO            307307 non-null  int64         
 3   NOME ÓRGÃO              307307 non-null  object        
 4   CÓDIGO UNIDADE GESTORA  307307 non-null  int64         
 5   NOME UNIDADE GESTORA    307307 non-null  object        
 6   ANO EXTRATO             307307 non-null  int64         
 7   MÊS EXTRATO             307307 non-null  int64         
 8   CPF PORTADOR            289417 non-null  object        
 9   NOME PORTADOR           307307 non-null  object        
 10  CNPJ OU CPF FAVORECIDO  307307 non-null  int64         
 11  NOME FAVORECIDO         307307 non-null  object        
 12  TRANSAÇÃO               307307 non-

In [20]:
# Example checks
df_final.head()

Unnamed: 0,CÓDIGO ÓRGÃO SUPERIOR,NOME ÓRGÃO SUPERIOR,CÓDIGO ÓRGÃO,NOME ÓRGÃO,CÓDIGO UNIDADE GESTORA,NOME UNIDADE GESTORA,ANO EXTRATO,MÊS EXTRATO,CPF PORTADOR,NOME PORTADOR,CNPJ OU CPF FAVORECIDO,NOME FAVORECIDO,TRANSAÇÃO,DATA TRANSAÇÃO,VALOR TRANSAÇÃO,ARQUIVO_ORIGEM,SIGILOSO,ESTADO_ESTIMADO
0,63000,Advocacia-Geral da União,63000,Advocacia-Geral da União - Unidades com víncul...,110161,SUPERINTENDENCIA REG. DE ADMIN. DA 1ª REGIAO,2024,3,***.725.752-**,VIVIANE CORREA LIMA,2179328000142,FREITAS & CIA LTDA,COMPRA A/V - R$ - APRES,2024-02-21,8.42,202403_CPGF.csv,0,UNIÃO
1,63000,Advocacia-Geral da União,63000,Advocacia-Geral da União - Unidades com víncul...,110161,SUPERINTENDENCIA REG. DE ADMIN. DA 1ª REGIAO,2024,3,***.866.951-**,JONAS SCHOTTZ DA SILVA,37828985000158,CASA DO SINDICO LTDA,COMPRA A/V - R$ - APRES,2024-02-20,72.0,202403_CPGF.csv,0,UNIÃO
2,63000,Advocacia-Geral da União,63000,Advocacia-Geral da União - Unidades com víncul...,110161,SUPERINTENDENCIA REG. DE ADMIN. DA 1ª REGIAO,2024,3,***.866.951-**,JONAS SCHOTTZ DA SILVA,33216995000181,RADEL ELETRONICA LTDA,COMPRA A/V - R$ - APRES,2024-02-21,26.0,202403_CPGF.csv,0,UNIÃO
3,63000,Advocacia-Geral da União,63000,Advocacia-Geral da União - Unidades com víncul...,110161,SUPERINTENDENCIA REG. DE ADMIN. DA 1ª REGIAO,2024,3,***.866.951-**,JONAS SCHOTTZ DA SILVA,10421458000178,T-9 ELETRONICA E INFORMATICA LTDA,COMPRA A/V - R$ - APRES,2024-02-07,1352.0,202403_CPGF.csv,0,UNIÃO
4,63000,Advocacia-Geral da União,63000,Advocacia-Geral da União - Unidades com víncul...,110161,SUPERINTENDENCIA REG. DE ADMIN. DA 1ª REGIAO,2024,3,***.725.752-**,VIVIANE CORREA LIMA,16577772000120,SO FILTROS RONDONIA LTDA,COMPRA A/V - R$ - APRES,2024-02-21,570.0,202403_CPGF.csv,0,UNIÃO


## Feature Categorization

This cell organizes the DataFrame columns into three logical lists for use in the subsequent Scikit-learn preprocessing pipeline (ColumnTransformer). These lists define the specialized transformations applied to each feature group:

Categorical Columns (cat_columns): These are identifier and nominal features (e.g., NOME ÓRGÃO, CNPJ OU CPF FAVORECIDO). CNPJ OU CPF FAVORECIDO is included here because, despite being stored as an integer, it represents a unique supplier identifier and must be encoded, not scaled. These features will be imputed (to handle missing categories like CPF PORTADOR) and One-Hot Encoded.

Numerical Value (num_value): This contains the single, highly sensitive column, VALOR TRANSAÇÃO.

Numerical Columns (num_columns): This contains discrete codes and time variables (e.g., CÓDIGO ÓRGÃO, MÊS EXTRATO).

Transformation Rationale (Why Separate VALOR TRANSAÇÃO?)
We separate VALOR TRANSAÇÃO for a critical two-step process due to its extreme positive skew (many small transactions, few large ones):
Log Transformation ($\log(1+x)$): Applied first, this compresses the huge range of values, reducing the skew and normalizing the distribution. This is vital for improving the performance of distance-based models like the Autoencoder and LOF.
Robust Scaling (RobustScaler): This is applied to all numerical features (num_value and num_columns). The RobustScaler uses the median and Interquartile Range (IQR), making the scaling highly resilient to outliers. 

This is essential for Project Jacurutu, as the anomalies we are trying to detect are the outliers themselves.

Note: We exclude the datetime column (DATA TRANSAÇÃO) as temporal information is covered by the existing discrete columns (ANO EXTRATO and MÊS EXTRATO).

In [None]:
# Identifying columns
cat_columns = [
    'NOME ÓRGÃO SUPERIOR', 'NOME ÓRGÃO', 'NOME UNIDADE GESTORA', 'CPF PORTADOR',
    'NOME PORTADOR', 'NOME FAVORECIDO', 'TRANSAÇÃO', 'ARQUIVO_ORIGEM', 'ESTADO_ESTIMADO',
    'ID_PORTADOR', 'CNPJ OU CPF FAVORECIDO'
]
# Numerical columns
# The highly sensitive, continuous financial value
num_value = ['VALOR TRANSAÇÃO']

# Discrete codes and less-skewed numerical features
num_columns = [
    'CÓDIGO ÓRGÃO SUPERIOR', 'CÓDIGO ÓRGÃO', 'CÓDIGO UNIDADE GESTORA', 'ANO EXTRATO',
    'MÊS EXTRATO', 'SIGILOSO', 'FIM_DE_SEMANA'
]

## Final Preprocessing Pipeline: preprocessor
The preprocessor object, built using Scikit-learn's ColumnTransformer, orchestrates the final necessary transformations for all features before they are input into the anomaly detection models (e.g., Isolation Forest, LOF and Autoencoder). The pipeline is optimized to handle the unique challenges of financial data: extreme skew, high cardinality, and outlier sensitivity.

1. Categorical Pipeline (cat_pipeline)
This pipeline handles all nominal and identifier columns (cat_columns), including CNPJ OU CPF FAVORECIDO.

Imputation (SimpleImputer): Missing values (like in CPF PORTADOR) are imputed using a constant value ('Desconhecido'). This is the correct approach for categorical data, as it preserves missingness as its own unique category without distorting the data distribution.

Encoding (OneHotEncoder): Transforms nominal variables (strings) into a format usable by models (binary vectors). The setting handle_unknown='ignore' prevents the pipeline from failing if a new, unseen category appears in the test or prediction set.

2. Specialized Value Pipeline (num_value_pipeline)
This pipeline is exclusively applied to the highly sensitive feature, VALOR TRANSAÇÃO.

Log Transformation (FunctionTransformer(np.log1p)): This is the crucial step for mitigating skew. Applying the natural logarithm of (1 + x) compresses the extremely wide range of financial values, bringing the distribution closer to normal. This improves the performance and stability of distance-based models.

Robust Scaling (RobustScaler): Scales the log-transformed data. This scaler is preferred over StandardScaler because it uses the median and Interquartile Range (IQR), making it highly resistant to outliers. Since anomalies are the outliers we are looking for, preserving their position relative to the majority of the data is essential.

3. General Numerical Pipeline (num__columns_pipeline)
This pipeline handles the remaining numerical and discrete features (num_columns), such as code numbers, ANO EXTRATO, and flag variables (SIGILOSO).

Robust Scaling (RobustScaler): Since these columns are assumed to have no missing values, they go directly to the scaler. The RobustScaler is used here again to ensure that even subtle outliers in these secondary features do not excessively influence the feature space.

4. ColumnTransformer Integration
The ColumnTransformer applies each specialized pipeline to its designated column subset (num_columns, num_value, cat_columns).

remainder='passthrough': This critical setting ensures that any columns not explicitly listed in the feature lists (e.g., specific transaction IDs, which are vital for merging results later) are kept intact and appended to the final processed array.

In [None]:
# Numerical columns pipeline: only scaling (assuming no NaNs)
num__columns_pipeline = RobustScaler()

# Numerical value pipeline: Log-transform and scale (assuming no NaNs)
num_value_pipeline = Pipeline(steps=[
    ('log_transform', FunctionTransformer(np.log1p, validate=True)),
    ('scaler', RobustScaler())
])

# Categorical pipeline: impute and encode
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Desconhecido')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine in a column transformer
preprocessor = ColumnTransformer([
    ('num_columns', num__columns_pipeline, num_columns),
    ('num-value', num_value_pipeline, num_value),
    ('cat', cat_pipeline, cat_columns),
],
remainder='passthrough'
)