# Descrição do notebook

Uso da API do Kaggle para listar e baixar o conjunto de dados para o estudo de fraude de cartão de crédito. Particionamento da base em treino, validação e teste.

# Pacotes e funções

In [39]:
#!pip install kaggle
#!pip install fastparquet

In [18]:
from kaggle.api.kaggle_api_extended import KaggleApi
from sklearn.model_selection import train_test_split
import zipfile
import os
import pandas as pd

In [4]:
api = KaggleApi()
api.authenticate()

In [17]:
def Ajusta_Formatos(dataframe):
    # Função que recebe um pandas dataframe e ajusta as colunas para int
    # Retorna um pandas dataframe com as colunas ajustadas
    
    dataframe['repeat_retailer'] = dataframe['repeat_retailer'].astype(int)
    dataframe['used_chip'] = dataframe['used_chip'].astype(int)
    dataframe['used_pin_number'] = dataframe['used_pin_number'].astype(int)
    dataframe['online_order'] = dataframe['online_order'].astype(int)
    dataframe['fraud'] = dataframe['fraud'].astype(int)

    return dataframe

# Lista dados com informação de cartão de crédito e fraude

In [5]:
!kaggle datasets list -s credit-card-fraud

ref                                                        title                                              size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
mlg-ulb/creditcardfraud                                    Credit Card Fraud Detection                        66MB  2018-03-23 01:17:27         825408      11911  0.85294116       
dhanushnarayananr/credit-card-fraud                        Credit Card Fraud                                  29MB  2022-05-07 15:09:29          20504        198  0.9411765        
mishra5001/credit-card                                     Credit Card Fraud Detection                       112MB  2019-07-15 06:36:02          21591        240  0.88235295       
joebeachcapital/credit-card-fraud                          Credit Card Fraud                   

## Download dos dados 

Faz o download dos dados, que estão disponíveis no link descrito no README.

In [6]:
!kaggle datasets download dhanushnarayananr/credit-card-fraud

Dataset URL: https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud
License(s): CC0-1.0
Downloading credit-card-fraud.zip to /home/hugo/Documents/Git_GitHub/Estudo_Fraude_CC/vFraude_CC/Base_de_dados
 93%|███████████████████████████████████▌  | 27.0M/28.9M [00:02<00:00, 20.9MB/s]
100%|██████████████████████████████████████| 28.9M/28.9M [00:02<00:00, 14.4MB/s]


## Leitura e visualização dos dados

distance_from_home - the distance from home where the transaction happened.

distance_from_last_transaction - the distance from last transaction happened.

ratio_to_median_purchase_price - Ratio of purchased price transaction to median purchase price.

repeat_retailer - Is the transaction happened from same retailer.

used_chip - Is the transaction through chip (credit card).

used_pin_number - Is the transaction happened by using PIN number.

online_order - Is the transaction an online order.

fraud - Is the transaction fraudulent.

In [7]:
zf = zipfile.ZipFile('credit-card-fraud.zip') 
dados = pd.read_csv(zf.open('card_transdata.csv'))
dados.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


In [8]:
dados.shape

(1000000, 8)

In [10]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


In [16]:
dados = Ajusta_Formatos(dados)
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  int64  
 4   used_chip                       1000000 non-null  int64  
 5   used_pin_number                 1000000 non-null  int64  
 6   online_order                    1000000 non-null  int64  
 7   fraud                           1000000 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 61.0 MB


## Divisão da base em treino, validação e teste (~80%, ~10%, ~10%)

In [19]:
X = dados.drop(['fraud'], axis=1)
y = dados['fraud']

In [31]:
X_treino, X_teste, y_treino, y_teste = train_test_split(X, y, test_size=0.1, stratify=y)
X_treino, X_val, y_treino, y_val = train_test_split(X_treino, y_treino, test_size=0.13, stratify=y_treino)

In [32]:
X_treino.shape, X_val.shape, X_teste.shape

((783000, 7), (117000, 7), (100000, 7))

In [35]:
dados_treino = pd.concat([X_treino, y_treino], axis = 1)
dados_val = pd.concat([X_val, y_val], axis = 1)
dados_teste = pd.concat([X_teste, y_teste], axis = 1)

### Salva as bases 

Salva as bases separadamente em treino, validação e teste.

In [41]:
dados_treino.to_parquet('treino.parquet', engine='fastparquet')
dados_val.to_parquet('validacao.parquet', engine='fastparquet')
dados_teste.to_parquet('teste.parquet', engine='fastparquet')