- **Task**: Predict whether a credit card transaction is fraud.
- **Performance**: Maximize recall given 5% precision
- **Experience**: Credit card transactions
- **Similar Projects**: Detecting churn
- **Assumptions**: Amount and the first principle components will be the most important variables.  Shap feature importance will be a good start.  I don't think two days of time data is enough to get a generalizable feature from time given that I believe time is mainly a good fraud feature for checking day of the week and hour of the day.
- **Why**: Banks are liable for credit card fraud over $50, so they must minimize this cost by catching it with machines.
- **Benefits**: Being able to stop a transaction suspected of fraud.  We can freeze a card and investigate it for fraud.  The model will improve as we find more fraud using the model.
- **Handoff**: We will handoff this model by saving it to a file and uploading the model to our server.  We will run a prediction after every transaction, so we can reject fradulent transactions, freeze the account, and update the database which alerts the investigations department.
- **Buy-in**: We need buy-in from the engineering department responsible for code that runs after every transaction.  We also need buy-in from investigators, to make sure they will be willing to investigate and resolve these frozen accounts.
- **Data Available**: time in seconds (over the course of 2 days), amount, 28 principle components from PCA
- **Data I Wish I Had**: whether the chip reader was used, whether the chip reader was attempted to be used, item purchased, time of day purchased, user history data
- **Data to Ignore**: drop time in seconds.  Drop a lot of the principle components, but first check for Shap importance of all of them
- **Development Data Format**: CSV
- **Missing Data**: None
- **Anonymize Data**: Already done
- **Change Datatypes**: Not necessary, small file of numbers
- **Test Set Size**: Dev set and test set will be 10,0000 samples each
- **Training Set Sampling Plan**: No sampling.  Training will be approximately 300,000 samples


In [2]:
import pandas as pd
import re
import pandas as pd
from pandas import read_csv
from numpy.random import seed
from sklearn.model_selection import train_test_split

# Set seed - ensures that the datasets are split the same way if re-run
seed(32)

# Read in data
df = pd.read_csv("../../data/raw/creditcard.csv")

# Common function to improve column names
def camel_to_snake_case(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
df.columns = [camel_to_snake_case(x) for x in df.columns]
df.rename(columns={'class': 'target'}, inplace=True)

# Drop columns
df.drop(["time"], axis=1, inplace=True)

# Split datasets
train, temp = train_test_split(df, test_size=0.14, stratify=df['target'])
dev, test = train_test_split(temp, test_size=0.50, stratify=temp['target'])

# Write results to files
train.to_csv("../../data/interim/train.csv", index=False)
dev.to_csv("../../data/interim/dev.csv", index=False)
test.to_csv("../../data/interim/test.csv", index=False)

# Print the dataframe shapes and show the rows per target value
print(train.shape, dev.shape, test.shape)
print("\nTrain\n", train['target'].value_counts(), "\n\nDev\n", dev['target'].value_counts(), "\n\nTest\n", test['target'].value_counts())


(244934, 30) (19936, 30) (19937, 30)

Train
 0    244511
1       423
Name: target, dtype: int64 

Dev
 0    19902
1       34
Name: target, dtype: int64 

Test
 0    19902
1       35
Name: target, dtype: int64
