## üïµÔ∏è Anomaly Detection Pipeline: Execution and Scoring
This notebook serves as the final stage of the anomaly detection process. It executes the full data and model pipeline, merges the results from the four model runs (LOF and Isolation Forest, both on normal and classified data), normalizes the anomaly scores, and calculates a final Priority Score used for manual review and triage.

## 1. Setup and Core Functions
We import all necessary libraries and define the three core functions that manage the pipeline's execution and scoring.

### Imports

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import sys

In [4]:
%reload_ext autoreload
sys.path.append(os.path.abspath('..'))
from functions.clean_df import load_and_combine_csvs,clean_dataframe
from functions.state_imput import apply_state_estimation
from functions.feature_engineering import feature_engineering
from functions.preprocessing import get_preprocessor
from functions.models import run_lof_normal, run_lof_classified, run_if_normal, run_if_classified
from functions.pipeline import run_pipeline, combine_dataframes
from sklearn.preprocessing import MinMaxScaler

## 2. Run Pipeline

The cell below initializes the path to the raw data and starts executing our pipeline. The function run_pipeline that comes as a result of it orchestrates the entire process: data cleaning, feature engineering, and the execution of all four anomaly detection models (LOF/IF on both normal and classified subsets).

The resulting result_dict now holds the four DataFrames required for the final scoring and merging stage.

In [6]:
raw_path = '../raw_data/'

In [7]:
df_raw = load_and_combine_csvs(raw_path)
df_clean = clean_dataframe(df_raw)
df_state = apply_state_estimation(df_clean)
df_feature_engineering = feature_engineering(df_state)

In [8]:
result_dict = {
    'lof_classified': run_lof_classified(df_feature_engineering),
    'lof_normal': run_lof_normal(df_feature_engineering),
    'if_classified': run_if_classified(df_feature_engineering),
    'if_normal': run_if_normal(df_feature_engineering)
}

In [9]:
df_lof_classified = result_dict['lof_classified']
df_lof_normal = result_dict['lof_normal']
df_if_classified = result_dict['if_classified']
df_if_normal = result_dict['if_normal']

## 2.1 Final run_pipeline function

In [None]:
def run_pipeline(raw_data):
    """
    Runs the entire data processing and modeling pipeline.

    Parameters
    raw_data : Path to directory with CSVs.

    Returns
    Dictionary of length 4:
        - Output of run_lof_classified(df_feature_engineering)
        - Output of run_lof_normal(df_feature_engineering)
        - Output of run_if_classified(df_feature_engineering)
        - Output of run_if_normal(df_feature_engineering)
    """
    df_raw = load_and_combine_csvs(raw_data)
    df_clean = clean_dataframe(df_raw)
    df_state = apply_state_estimation(df_clean)
    df_feature_engineering = feature_engineering(df_state)

    return {
    'lof_classified': run_lof_classified(df_feature_engineering),
    'lof_normal': run_lof_normal(df_feature_engineering),
    'if_classified': run_if_classified(df_feature_engineering),
    'if_normal': run_if_normal(df_feature_engineering)
    }

## 3. Combining dataframes and normalizing their score
This section executes the combine_dataframes function, which is the consolidation and score normalization step.

Consolidation and Score Unification
This stage is crucial because the four model results are based on two different data partitions (confidential vs. non-confidential).

Combination: The combine helper function first merges the respective classified and normal subsets for LOF and IF models (e.g., df_lof_classified + df_lof_normal), creating a unified, full-size result for each method. It generates a sequential ID to serve as the stable key for the final join.

Merging: The function then performs the final merge, combining the full feature set from the LOF results (df_lof_full) with only the specific scores and labels from the IF results (df_if_specific) using the generated ID.

Score Alignment & Normalization: The raw scores are transformed to represent risk magnitude (higher score = higher risk). The IF score is inverted (1 - Score), and the LOF score's absolute value is taken.

Finally, both scores are normalized to the $[-1, 1]$ range using MinMaxScaler. This ensures both models contribute equally to the final risk calculation, regardless of their original score distribution.

The output, df_combined, is the single, integrated DataFrame containing all original features and the four normalized model scores.

1. Setup and Helper Function Definition - This cell defines the necessary imports and the combine helper function which is essential for merging your disjoint data subsets.

In [12]:
def combine(df1, df2, label_cols):
    """Combines two disjoint model result DFs, creates a sequential ID, and returns full and specific DFs."""
    # Concatenate, drop duplicates, and reset index
    df = pd.concat([df1, df2], ignore_index=True).drop_duplicates().reset_index(drop=True)

    # Create Sequential ID (used as the stable merge key)
    df["ID"] = df.index + 1

    # Rearrange columns
    cols = ["ID"] + [c for c in df.columns if c != "ID"]
    df = df[cols]

    # df_specific is used for merging only the score/label columns
    df_specific = df[["ID"] + label_cols]
    return df, df_specific

print("Setup complete. Extracted 4 model result DataFrames and defined the 'combine' helper.")

Setup complete. Extracted 4 model result DataFrames and defined the 'combine' helper.


2. Process and Merge LOF and IF Results - This cell uses the combine function to prepare the LOF and IF data and then performs the central merge operation.

In [13]:
# Processing LOF: Creates a full DataFrame with all features and LOF scores
df_lof_full, df_lof_specific = combine(
    df_lof_classified, df_lof_normal,
    label_cols=["LOF_LABEL", "LOF_SCORE"]
)

# Processing IF: Creates a specific DataFrame with ID and IF scores/labels
df_if_full, df_if_specific = combine(
    df_if_classified, df_if_normal,
    label_cols=["IF_LABEL", "IF_SCORE"]
)

# Final merge: Merges the IF scores/labels onto the LOF full DataFrame using the common "ID".
df_combined = df_lof_full.merge(df_if_specific, on="ID", how="inner")

print(f"LOF and IF results merged into df_combined. Total rows: {len(df_combined)}.")

LOF and IF results merged into df_combined. Total rows: 307307.


3. Align Risk Magnitude - This step prepares the raw scores for normalization by ensuring that a higher numerical value always represents a higher risk.

In [14]:
# IF_SCORE: Assuming lower score = higher risk. Invert to align risk magnitude (1 - Score).
df_combined['RISK_IF_SCORE'] = 1 - df_combined['IF_SCORE']

# LOF_SCORE: More negative is higher risk. Take the absolute value for magnitude.
df_combined['RISK_LOF_SCORE'] = df_combined['LOF_SCORE'].abs()

print("Raw scores transformed to RISK_IF_SCORE and RISK_LOF_SCORE (0 to 1 scale).")

Raw scores transformed to RISK_IF_SCORE and RISK_LOF_SCORE (0 to 1 scale).


4. Normalize Scores - This cell applies the MinMaxScaler to normalize both risk scores to the range $[-1, 1]$.

In [15]:
# Initialize scaler for the range [-1, 1]
scaler = MinMaxScaler(feature_range=(-1, 1))

# Normalize LOF Risk Score
df_combined['LOF_SCORE_NORM'] = scaler.fit_transform(df_combined[['RISK_LOF_SCORE']])

# Normalize IF Risk Score
df_combined['IF_SCORE_NORM'] = scaler.fit_transform(df_combined[['RISK_IF_SCORE']])

print("LOF_SCORE_NORM and IF_SCORE_NORM successfully created (range [-1, 1]).")

LOF_SCORE_NORM and IF_SCORE_NORM successfully created (range [-1, 1]).


5. Cleanup - The final step cleans up the temporary and redundant score columns.

In [16]:
# Clean up intermediate columns (RISK_...) and the raw score columns (LOF_SCORE, IF_SCORE)
df_combined = df_combined.drop(columns=['RISK_LOF_SCORE', 'RISK_IF_SCORE', 'LOF_SCORE', 'IF_SCORE'])

print("Intermediate and raw score columns dropped.")
print("df_combined is ready for the final priority score calculation.")
display(df_combined[['ID', 'LOF_SCORE_NORM', 'IF_SCORE_NORM', 'VALOR TRANSA√á√ÉO']].head())

Intermediate and raw score columns dropped.
df_combined is ready for the final priority score calculation.


Unnamed: 0,ID,LOF_SCORE_NORM,IF_SCORE_NORM,VALOR TRANSA√á√ÉO
0,1,-0.999998,-0.159904,1000.0
1,2,-0.999999,-0.239186,350.0
2,3,-0.999998,-0.354873,250.0
3,4,-0.999999,-0.093496,300.0
4,5,-0.999999,-0.151828,500.0


## 3. 1 Final combine_dataframes function

In [None]:
def combine_dataframes(df_lof_classified, df_lof_normal, df_if_classified, df_if_normal):
    """
    Combine the dataframes LOF and IF, creates ID, rearrange columns,
    and merge final adding only specific columns.
    """
    def combine(df1, df2, label_cols):
        df = pd.concat([df1, df2], ignore_index=True).drop_duplicates().reset_index(drop=True)

        df["ID"] = df.index + 1

        cols = ["ID"] + [c for c in df.columns if c != "ID"]
        df = df[cols]

        df_specific = df[["ID"] + label_cols]
        return df, df_specific

    # Processing LOF
    df_lof_full, df_lof_specific = combine(
        df_lof_classified, df_lof_normal,
        label_cols=["LOF_LABEL", "LOF_SCORE"]
    )
    # Processing IF
    df_if_full, df_if_specific = combine(
        df_if_classified, df_if_normal,
        label_cols=["IF_LABEL", "IF_SCORE"]
    )

    # Final merge: LOF (all columns) + IF (specific columns)
    df_final = df_lof_full.merge(df_if_specific, on="ID", how="inner")

    # Normalization step
    # IF_SCORE: Assuming lower score = higher risk. Invert to: Higher score = Higher risk.
    df_final['RISK_IF_SCORE'] = 1 - df_final['IF_SCORE']

    # LOF_SCORE: Already negative, where more negative is higher risk.
    df_final['RISK_LOF_SCORE'] = df_final['LOF_SCORE'].abs()

    # Normalize both risk scores to the range [-1, 1]
    scaler = MinMaxScaler(feature_range=(-1, 1))

    # Reshape scores for MinMaxScaler and fit/transform
    df_final['LOF_SCORE_NORM'] = scaler.fit_transform(df_final[['RISK_LOF_SCORE']])
    df_final['IF_SCORE_NORM'] = scaler.fit_transform(df_final[['RISK_IF_SCORE']])

    # Clean up intermediate columns
    df_final = df_final.drop(columns=['RISK_LOF_SCORE', 'RISK_IF_SCORE'])

    return df_final

## 4. Priority Score Calculation
This section is about the risk engine of our pipeline. It takes the merged, normalized scores and weights them alongside the financial magnitude of the transaction to produce a final ranking score.

1. Technical Score (Model Consensus)

The first step calculates the TECHNICAL_SCORE by taking the mean of the two normalized anomaly scores (LOF_SCORE_NORM and IF_SCORE_NORM). This score reflects the consensus of our machine learning models:

It is high only if both models agree the transaction is statistically anomalous.

In [18]:
# Combine Technical Scores (Mean)
df_combined['TECHNICAL_SCORE'] = df_combined[['LOF_SCORE_NORM', 'IF_SCORE_NORM']].mean(axis=1)

2. Financial Risk

The FINANCIAL_RISK component is derived from the LOG_VALOR (log-transformed transaction value).

MinMaxScaler is used to scale LOG_VALOR to the range $[0, 1]$. This converts the magnitude of the transaction into a standardized risk factor, where the largest transaction gets a risk score of 1.

In [19]:
# Normalize LOG_VALOR to range [0, 1] to use it as a weighting factor
log_scaler = MinMaxScaler(feature_range=(0, 1))
df_combined['FINANCIAL_RISK'] = log_scaler.fit_transform(df_combined[['LOG_VALOR']])

3. Final Priority Score and Ranking

The final PRIORITY_SCORE is a weighted average of the two components:

70% Weight (0.7): Applied to the TECHNICAL_SCORE (Statistical Evidence).

30% Weight (0.3): Applied to the FINANCIAL_RISK (Impact/Magnitude).

This prioritizes anomalies that are both statistically strange and have a significant financial impact. The DataFrame is then sorted by this score in descending order, making the highest-risk transactions appear at the top for immediate manual audit.

In [21]:
# Final Priority Calculation: (0.7 * Technical) + (0.3 * Financial Risk)
df_combined['PRIORITY_SCORE'] = (
    (0.7 * df_combined['TECHNICAL_SCORE']) +
    (0.3 * df_combined['FINANCIAL_RISK'])
)

# Order by priority score
df = df_combined.sort_values(by='PRIORITY_SCORE', ascending=False).reset_index(drop=True)

## 4.1 Final calculate_priority_score function

In [None]:
def calculate_priority_score(df: pd.DataFrame):
    """
    Calculates the Technical Score, Financial Risk, and the final weighted
    Priority Score for manual review.

    Returns:
        pd.DataFrame: The DataFrame with the final PRIORITY_SCORE, sorted descending.
    """

    # 3.1 Combine Technical Scores (Mean)
    df['TECHNICAL_SCORE'] = df[['LOF_SCORE_NORM', 'IF_SCORE_NORM']].mean(axis=1)

    # 3.2 Calculate Financial Risk and Final Weighted Priority Score

    # Normalize LOG_VALOR to range [0, 1] to use it as a weighting factor
    log_scaler = MinMaxScaler(feature_range=(0, 1))
    df['FINANCIAL_RISK'] = log_scaler.fit_transform(df[['LOG_VALOR']])

    # Final Priority Calculation: (0.7 * Technical) + (0.3 * Financial Risk)
    df['PRIORITY_SCORE'] = (
        (0.7 * df['TECHNICAL_SCORE']) +
        (0.3 * df['FINANCIAL_RISK'])
    )

    # Order by priority score
    df = df.sort_values(by='PRIORITY_SCORE', ascending=False).reset_index(drop=True)

    return df

## 5. Inspect Final Structure and Missing Values
This checks the column types, non-null counts, and confirms that the final score columns were created.

In [22]:
print("--- DataFrame Information (df) ---")
df.info()

print("\n--- Check New Score Columns ---")
display(df[['PRIORITY_SCORE', 'TECHNICAL_SCORE', 'FINANCIAL_RISK']].describe().T)

--- DataFrame Information (df) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307307 entries, 0 to 307306
Data columns (total 34 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   ID                      307307 non-null  int64         
 1   C√ìDIGO √ìRG√ÉO SUPERIOR   307307 non-null  int64         
 2   NOME √ìRG√ÉO SUPERIOR     307307 non-null  object        
 3   C√ìDIGO √ìRG√ÉO            307307 non-null  int64         
 4   NOME √ìRG√ÉO              307307 non-null  object        
 5   C√ìDIGO UNIDADE GESTORA  307307 non-null  int64         
 6   NOME UNIDADE GESTORA    307307 non-null  object        
 7   ANO EXTRATO             307307 non-null  int64         
 8   M√äS EXTRATO             307307 non-null  int64         
 9   CPF PORTADOR            289417 non-null  object        
 10  NOME PORTADOR           307307 non-null  object        
 11  CNPJ OU CPF FAVORECIDO  307307 non-null  int

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PRIORITY_SCORE,307307.0,-0.392042,0.108171,-0.580749,-0.475343,-0.417315,-0.325732,0.62027
TECHNICAL_SCORE,307307.0,-0.752262,0.138142,-0.999999,-0.858596,-0.782119,-0.667689,0.644897
FINANCIAL_RISK,307307.0,0.448471,0.105913,0.0,0.378076,0.451681,0.52101,1.0
