# Background
## IEEE-CIS Fraud Detection

### Citation

[Click here for Kaggle page](https://www.kaggle.com/competitions/ieee-fraud-detection/data)
```
@misc{ieee-fraud-detection,
    author = {Addison Howard and Bernadette Bouchon-Meunier and IEEE CIS and inversion and John Lei and Lynn@Vesta and Marcus2010 and Prof. Hussein Abbass},
    title = {IEEE-CIS Fraud Detection},
    year = {2019},
    howpublished = {\url{https://kaggle.com/competitions/ieee-fraud-detection}},
    note = {Kaggle}
}
```

### Description

Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.

Embarrassed, and certain you have the funds to cover everything needed for an epic nacho party for 50 of your closest friends, you try your card again. Same result. As you step aside and allow the cashier to tend to the next customer, you receive a text message from your bank. “Press 1 if you really tried to spend $500 on cheddar cheese.”

While perhaps cumbersome (and often embarrassing) in the moment, this fraud prevention system is actually saving consumers millions of dollars per year. Researchers from the IEEE Computational Intelligence Society (IEEE-CIS) want to improve this figure, while also improving the customer experience. With higher accuracy fraud detection, you can get on with your chips without the hassle.

IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.

In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.

If successful, you’ll improve the efficacy of fraudulent transaction alerts for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue. And of course, you will save party people just like you the hassle of false positives.

### Data

In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.

The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.
Categorical Features - Transaction

    ProductCD
    card1 - card6
    addr1, addr2
    P_emaildomain
    R_emaildomain
    M1 - M9

Categorical Features - Identity

    DeviceType
    DeviceInfo
    id_12 - id_38

The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp).

You can read more about the data from this post by the competition host.
Files

    train_{transaction, identity}.csv - the training set
    test_{transaction, identity}.csv - the test set (you must predict the isFraud value for these observations)
    sample_submission.csv - a sample submission file in the correct format


# Analysis

In [None]:
import sklearn
import xgboost
import catboost
import torch

import pandas as pd
import duckdb as db

import plotly

In [32]:
workDir = 'D:/GitRepos/Brandon-Pipher.github.io/_temp'

transaction_cats = [
    'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
    'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain',
    'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9'
]

identity_cats = ['DeviceType', 'DeviceInfo'] + [f'id_{i}' for i in range(12, 39)]

# Function to load with automatic dtype assignment
def load_csv_with_categoricals(path, cat_cols):
    cols = pd.read_csv(path, nrows=0).columns
    dtypes = {col: 'category' if col in cat_cols else 'float32' for col in cols}
    return pd.read_csv(path, dtype=dtypes)

# Load both datasets
transaction_df = load_csv_with_categoricals(f'{workDir}/train_transaction.csv', transaction_cats)
identity_df    = load_csv_with_categoricals(f'{workDir}/train_identity.csv',    identity_cats)

In [36]:
def quick_column_summary(df):
    summary = pd.DataFrame({
        'dtype': df.dtypes,
        'n_unique': df.nunique(dropna=True),
        'missing': df.isna().sum(),
        'missing_pct': df.isna().mean() * 100,
        'mode': df.mode(dropna=True).iloc[0]
    })
    return summary
print("train_identity")
quick_column_summary(train_identity)

train_identity


Unnamed: 0,dtype,n_unique,missing,missing_pct,mode
TransactionID,int64,144233,0,0.0,2987004
id_01,float64,77,0,0.0,-5.0
id_02,float64,115655,3361,2.330257,1102.0
id_03,float64,24,77909,54.016071,0.0
id_04,float64,15,77909,54.016071,0.0
id_05,float64,93,7368,5.108401,0.0
id_06,float64,101,7368,5.108401,0.0
id_07,float64,84,139078,96.425922,0.0
id_08,float64,94,139078,96.425922,-100.0
id_09,float64,46,69307,48.05211,0.0


In [37]:
print("train_identity")
quick_column_summary(train_identity)

train_identity


Unnamed: 0,dtype,n_unique,missing,missing_pct,mode
TransactionID,int64,144233,0,0.0,2987004
id_01,float64,77,0,0.0,-5.0
id_02,float64,115655,3361,2.330257,1102.0
id_03,float64,24,77909,54.016071,0.0
id_04,float64,15,77909,54.016071,0.0
id_05,float64,93,7368,5.108401,0.0
id_06,float64,101,7368,5.108401,0.0
id_07,float64,84,139078,96.425922,0.0
id_08,float64,94,139078,96.425922,-100.0
id_09,float64,46,69307,48.05211,0.0
