# **Detecção de Fraudes em Transações Financeiras**

Luiz Henrique Rigo Faccio | CCR de `Inteligência Artifical`

*Ciência da Computação - Universidade Federal Da Fronteira Sul*

Dataset disponível em: [https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection](https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection)

## **Importando bibliotecas e o dataset**

In [55]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import datetime as dt
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [56]:
path = "archive/financial_fraud_detection_dataset.csv"
dataSet = pd.read_csv(path)

## **Vizualizando as informações**

In [57]:
def numeric_columns(dataSet):
    return dataSet.select_dtypes(include=['int64', 'float64']).columns
def categorical_columns(dataSet): 
    return dataSet.select_dtypes(include=['object']).columns

def analyze_data(dataSet):
    info = pd.DataFrame({"Tipos":dataSet.dtypes, "Valores únicos": dataSet.nunique(), "Valores Nulos": dataSet.isnull().sum()})
    
    print("Dimensão do dataset: ", dataSet.shape)
    display(info)

    print("Informações contínuas:")
    display(dataSet[numeric_columns(dataSet)].describe())

    if (categorical_columns(dataSet).size > 0):
        print("Informações categóricas:")
        display(dataSet[categorical_columns(dataSet)].describe())

    print("Amostra do dataset:")
    display(dataSet.sample(5))

In [58]:
#analyze_data(dataSet)

## **Tratando os dados**

Algumas informações como IDs, tipo de fraude, números de contas e números de dispositivos são inúteis nesta situação

time_since_last_transaction tem muitos valores vazios

In [59]:
useLess = ["transaction_id", "sender_account", "receiver_account", "ip_address", "device_hash", "fraud_type", "time_since_last_transaction"]

dataSet = dataSet.drop(columns=useLess)

Para não descartar os horários das transações, eles serão agrupados em madrugada, manha, tarde e noite.

In [60]:
def categorize_timestamp(timestamps : pd.Series):
    """Função para categorizar timestap em períodos: manhã, tarde, noite e madrugada

    Args:
        timestamps (pd.Series): Coluna de timestamp do DataSet
    
    Returns:
        periodos (pd.Series): Coluna de timestamps já categorizada
    """
    
    def get_period(hour):
        if 6 <= hour < 9:
            return "manha_1"
        if 9 <= hour < 12:
            return "manha_2"
        elif 12 <= hour < 15:
            return "tarde_1"
        elif 15 <= hour < 18:
            return "tarde_2"
        elif 18 <= hour < 21:
            return "noite_1"
        elif 18 <= hour < 21:
            return "noite_2"
        elif 21 <= hour < 23:
            return "tarde_2"
        elif 23 <= hour < 2:
            return "tarde_2"
        elif 2 <= hour < 5:
            return "madrugada_1"
        else:
            return "madrugada_2"
        
    periodos = timestamps.apply(lambda x: get_period(dt.datetime.fromisoformat(x).hour))
    return periodos
    

In [61]:
dataSet["timestamp"] = categorize_timestamp(dataSet["timestamp"])

Como não existem mais dados faltantes, não será necessário imputá-los

As informações categóricas serão tratadas com o uso de OneHotEncoding e os dados numéricos serão escalados com o uso de StandartScaler

Os valores de y (is_fraud) serão transformados em inteiros

In [62]:
sacaler = StandardScaler()
dataSet[numeric_columns(dataSet)] = sacaler.fit_transform(dataSet[numeric_columns(dataSet)])

dataSet["is_fraud"] = dataSet["is_fraud"].map(lambda x: 1 if x == True else 0)

In [63]:
encoder = OneHotEncoder()
encoder.fit(dataSet[categorical_columns(dataSet)])
encoded_data = encoder.transform(dataSet[categorical_columns(dataSet)])
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(categorical_columns(dataSet)))
dataSet = dataSet.drop(columns=categorical_columns(dataSet))
dataSet = pd.concat([dataSet, encoded_df], axis=1)

In [64]:
analyze_data(dataSet)

Dimensão do dataset:  (5000000, 40)


Unnamed: 0,Tipos,Valores únicos,Valores Nulos
amount,float64,217069,0
is_fraud,int64,2,0
spending_deviation_score,float64,917,0
velocity_score,float64,20,0
geo_anomaly_score,float64,101,0
timestamp_madrugada_1,float64,2,0
timestamp_madrugada_2,float64,2,0
timestamp_manha_1,float64,2,0
timestamp_manha_2,float64,2,0
timestamp_noite_1,float64,2,0


Informações contínuas:


Unnamed: 0,amount,is_fraud,spending_deviation_score,velocity_score,geo_anomaly_score,timestamp_madrugada_1,timestamp_madrugada_2,timestamp_manha_1,timestamp_manha_2,timestamp_noite_1,...,location_Tokyo,location_Toronto,device_used_atm,device_used_mobile,device_used_pos,device_used_web,payment_channel_ACH,payment_channel_UPI,payment_channel_card,payment_channel_wire_transfer
count,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,...,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0,5000000.0
mean,7.016183e-17,0.0359106,1.947598e-17,-6.677539e-17,-8.86061e-16,0.1250936,0.1666556,0.1250348,0.1251578,0.1250794,...,0.1251988,0.1248698,0.249928,0.2502262,0.2498316,0.2500142,0.2500482,0.2497694,0.2499386,0.2502438
std,1.0,0.1860673,1.0,1.0,1.0,0.330825,0.3726681,0.3307584,0.3308978,0.330809,...,0.3309442,0.3305713,0.4329712,0.4331433,0.4329155,0.4330209,0.4330406,0.4328795,0.4329773,0.4331534
min,-0.7637771,0.0,-5.255371,-1.647578,-1.732394,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.7072584,0.0,-0.679064,-0.9539571,-0.8662475,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.4687139,0.0,0.0003878031,0.08647374,-0.0001013599,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.3084602,0.0,0.6698476,0.9534994,0.8660447,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
max,6.72784,1.0,5.016341,1.64712,1.732191,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Amostra do dataset:


Unnamed: 0,amount,is_fraud,spending_deviation_score,velocity_score,geo_anomaly_score,timestamp_madrugada_1,timestamp_madrugada_2,timestamp_manha_1,timestamp_manha_2,timestamp_noite_1,...,location_Tokyo,location_Toronto,device_used_atm,device_used_mobile,device_used_pos,device_used_web,payment_channel_ACH,payment_channel_UPI,payment_channel_card,payment_channel_wire_transfer
1185237,-0.519359,0,-0.529185,-0.086931,0.138482,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
923252,-0.655123,0,0.280162,1.30031,-1.143414,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
250428,-0.222743,0,2.238582,1.64712,-0.866247,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2297954,-0.306542,0,0.280162,-0.953957,1.524316,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4400115,-0.286454,0,-0.649088,-1.474172,-0.485143,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


## **Treinando Modelos**