# Classifying Credit Card Fraud

In [21]:
if True == True:
    # Standard Packages
    import pandas as pd
    import numpy as np
    import markdown
    import os
    import haversine

    # Viz Packages
    import seaborn as sns
    import matplotlib.pyplot as plt

    # Scipy Stats
    import scipy.stats as stats
    from scipy.special import logit, expit

    # Statsmodel Api
    import statsmodels.api as sm
    from statsmodels.formula.api import ols

    # SKLearn Modules
    from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
    from sklearn.feature_selection import RFE
    from sklearn.preprocessing import StandardScaler, OneHotEncoder, normalize
    from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, log_loss, confusion_matrix, RocCurveDisplay, classification_report, accuracy_score, recall_score, precision_score, f1_score
    from sklearn.preprocessing import PolynomialFeatures, LabelEncoder
    from sklearn.datasets import load_diabetes
    from sklearn.model_selection import train_test_split, cross_validate, KFold, cross_val_score
    from sklearn import datasets

    # Suppress future, deprecation, and SettingWithCopy warnings
    import warnings
    warnings.filterwarnings("ignore", category= FutureWarning)
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    pd.options.mode.chained_assignment = None

    # make all columns in a df viewable
    pd.options.display.max_columns = None
    pd.options.display.width = None

## The Business Problem

We've been hired by **Insert Credit Agency** to create a screener to help protect their clients from potentially fradulent purchases.

## Data Understanding

Data from Kaggle, Synthetically Generated, Already splitted for us. Synthetically Generated good because data like this usually encrypted/not available to public. No missing values, should be easier to clean and prep for analysis.

In [22]:
# Load in Fraud Test and Train
fraudTrain = pd.read_csv('data/fraudTrain.csv')
fraudTest = pd.read_csv('data/fraudTest.csv')
# Concatenate them for sake of EDA
fraudDF = pd.concat([fraudTrain, fraudTest], axis = 0)

In [51]:
fraudDF.head()

Unnamed: 0,full_name_street,trans_date_trans_time,amt,gender,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud,food_dining,gas_transport,grocery_net,grocery_pos,health_fitness,home,kids_pets,misc_net,misc_pos,personal_care,shopping_net,shopping_pos,travel,distance_from_home,last_purchase_distance
1296670,Erik Patterson on 162 Jessica Row Apt. 072,2020-06-21 12:12:08,15.56,1,84735,37.7175,-112.4777,258,1371816728,36.841266,-111.690765,0,0,0,0,0,0,0,0,0,0,0,0,0,0,119.752302,
1296671,Jeffrey White on 8617 Holmes Terrace Suite 651,2020-06-21 12:12:19,51.7,1,21790,39.2667,-77.5101,100,1371816739,38.906881,-78.246528,0,1,0,0,0,0,0,0,0,0,0,0,0,0,75.104189,
1296672,Christopher Castaneda on 1632 Cohen Drive Suit...,2020-06-21 12:12:32,105.93,1,88325,32.9396,-105.8189,899,1371816752,33.619513,-105.130529,0,1,0,0,0,0,0,0,0,0,0,0,0,0,99.04787,
1296673,Joseph Murray on 42933 Ryan Underpass,2020-06-21 12:13:36,74.9,1,57756,43.3526,-102.5411,1126,1371816816,42.78894,-103.24116,0,1,0,0,0,0,0,0,0,0,0,0,0,0,84.627769,
1296674,Jeffrey Smith on 135 Joseph Mountains,2020-06-21 12:13:37,4.3,1,59871,45.8433,-113.8748,218,1371816817,46.565983,-114.18611,0,1,0,0,0,0,0,0,0,0,0,0,0,0,83.853771,


In [24]:
fraudDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Unnamed: 0             int64  
 1   trans_date_trans_time  object 
 2   cc_num                 int64  
 3   merchant               object 
 4   category               object 
 5   amt                    float64
 6   first                  object 
 7   last                   object 
 8   gender                 object 
 9   street                 object 
 10  city                   object 
 11  state                  object 
 12  zip                    int64  
 13  lat                    float64
 14  long                   float64
 15  city_pop               int64  
 16  job                    object 
 17  dob                    object 
 18  trans_num              object 
 19  unix_time              int64  
 20  merch_lat              float64
 21  merch_long             float64
 22  is_fraud           

We (obviously) have no missing values, since the data is artificially generated. However, their are some things that we need to set up in order to conduct our analysis.
1. Since we will be using unix time, as it is much easier to manipulate, we can drop trans_date_trans_time. 
2. Convert Gender Column into a boolean
3. Convert cc_num to a string
4. Find best identifier for transaction tracking (Name? Credit Card Number? Address? Some sort of mix?)
5. Drop useless columns, such as Unnamed:0 as that obviously does not effect whether or not a purchase is fraudulent

Via further analysis (Not included here), it seems that 1 of the people did not make any purchases with a credit card, so we only have 999 unique people in our dataset

## Data Preparation

In [25]:
fraudTrain.columns

Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

In [26]:
# Drop blatantly useless columns
fraudTrain.drop(['Unnamed: 0', 'merchant', 'dob', 'trans_num','job','merchant','cc_num','city','state'], axis = 1, inplace = True)
fraudTest.drop(['Unnamed: 0','merchant', 'dob', 'trans_num','job','merchant','cc_num','city','state'], axis = 1, inplace = True)
fraudDF.drop(['Unnamed: 0','merchant', 'dob', 'trans_num','job','merchant','cc_num','city','state'], axis = 1, inplace = True)

In [27]:
fraudDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 15 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   trans_date_trans_time  object 
 1   category               object 
 2   amt                    float64
 3   first                  object 
 4   last                   object 
 5   gender                 object 
 6   street                 object 
 7   zip                    int64  
 8   lat                    float64
 9   long                   float64
 10  city_pop               int64  
 11  unix_time              int64  
 12  merch_lat              float64
 13  merch_long             float64
 14  is_fraud               int64  
dtypes: float64(5), int64(4), object(6)
memory usage: 226.1+ MB


In [28]:
fraudDF['full_name_street'] = fraudDF['first'] + ' ' + fraudDF['last'] + ' on ' + fraudDF['street']
fraudTest['full_name_street'] = fraudTest['first'] + ' ' + fraudTest['last'] + ' on ' + fraudTest['street']
fraudTrain['full_name_street'] = fraudTrain['first'] + ' ' + fraudTrain['last'] + ' on ' + fraudTrain['street']
fraudDF.drop(columns = ['first','last','street'], axis = 1, inplace = True)
fraudTrain.drop(columns = ['first','last','street'], axis = 1, inplace = True)
fraudTest.drop(columns = ['first','last','street'], axis = 1, inplace = True)

In [29]:
fraudTrain.set_index("full_name_street", inplace=True)
fraudTest.set_index("full_name_street", inplace=True)
fraudDF.set_index("full_name_street", inplace=True)
fraudTrain.reset_index(inplace=True)
fraudTest.reset_index(inplace=True)
fraudDF.reset_index(inplace = True)

In [30]:
fraudDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1852394 entries, 0 to 1852393
Data columns (total 13 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   full_name_street       object 
 1   trans_date_trans_time  object 
 2   category               object 
 3   amt                    float64
 4   gender                 object 
 5   zip                    int64  
 6   lat                    float64
 7   long                   float64
 8   city_pop               int64  
 9   unix_time              int64  
 10  merch_lat              float64
 11  merch_long             float64
 12  is_fraud               int64  
dtypes: float64(5), int64(4), object(4)
memory usage: 183.7+ MB


Convert trans_date_trans_time to Datetime Object

In [31]:
fraudDF['trans_date_trans_time'] = pd.to_datetime(fraudDF['trans_date_trans_time'])
fraudTrain['trans_date_trans_time'] = pd.to_datetime(fraudTrain['trans_date_trans_time'])
fraudTest['trans_date_trans_time'] = pd.to_datetime(fraudTest['trans_date_trans_time'])

Male or Female Mapping

In [32]:
fraudTrain['gender'][0]

'F'

In [33]:
if fraudTrain['gender'][0] == 'F': # Wrapper to not overtransform
    fraudTrain['gender'] = fraudTrain['gender'].map({'F': 0, 'M': 1})
    fraudTest['gender'] = fraudTest['gender'].map({'F': 0, 'M': 1})
    fraudDF['gender'] = fraudDF['gender'].map({'F': 0, 'M': 1})

One Hot Encoding Category of Purchase

In [34]:
if fraudTrain.columns[2] == 'category':
    categoryOHE = pd.get_dummies(fraudDF['category'], drop_first = True)
    fraudDF = pd.concat([fraudDF, categoryOHE], axis = 1)
    categoryOHETrain = pd.get_dummies(fraudTrain['category'], drop_first = True)
    fraudTrain = pd.concat([fraudTrain, categoryOHETrain], axis = 1)
    categoryOHETest = pd.get_dummies(fraudTest['category'], drop_first = True)
    fraudTest = pd.concat([fraudTest, categoryOHETest], axis = 1)
    fraudTrain.drop(columns = ['category'], axis = 1, inplace = True)
    fraudTest.drop(columns = ['category'], axis = 1, inplace = True)
    fraudDF.drop(columns = ['category'], axis = 1, inplace = True)

In [35]:
fraudTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 25 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   full_name_street       1296675 non-null  object        
 1   trans_date_trans_time  1296675 non-null  datetime64[ns]
 2   amt                    1296675 non-null  float64       
 3   gender                 1296675 non-null  int64         
 4   zip                    1296675 non-null  int64         
 5   lat                    1296675 non-null  float64       
 6   long                   1296675 non-null  float64       
 7   city_pop               1296675 non-null  int64         
 8   unix_time              1296675 non-null  int64         
 9   merch_lat              1296675 non-null  float64       
 10  merch_long             1296675 non-null  float64       
 11  is_fraud               1296675 non-null  int64         
 12  food_dining            12966

In [36]:
fraudTest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 25 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   full_name_street       555719 non-null  object        
 1   trans_date_trans_time  555719 non-null  datetime64[ns]
 2   amt                    555719 non-null  float64       
 3   gender                 555719 non-null  int64         
 4   zip                    555719 non-null  int64         
 5   lat                    555719 non-null  float64       
 6   long                   555719 non-null  float64       
 7   city_pop               555719 non-null  int64         
 8   unix_time              555719 non-null  int64         
 9   merch_lat              555719 non-null  float64       
 10  merch_long             555719 non-null  float64       
 11  is_fraud               555719 non-null  int64         
 12  food_dining            555719 non-null  uint

New Features: Distance from Home, Distance from Last Purchase, Time of Purchase, Time Since Last Purchase

Merchant's Distance From Home

In [37]:
# Haversine Function for Calculating Distance Between Place of Purchase, Home of Customer
def distance(lat1, lon1, lat2, lon2):
    coords1 = (lat1, lon1)
    coords2 = (lat2, lon2)
    return haversine.haversine(coords1, coords2)

In [39]:
fraudDF['distance_from_home'] = fraudDF.apply(lambda row: distance(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)
fraudTrain['distance_from_home'] = fraudTrain.apply(lambda row: distance(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)
fraudTest['distance_from_home'] = fraudTest.apply(lambda row: distance(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)

In [40]:
fraudTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 26 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   full_name_street       1296675 non-null  object        
 1   trans_date_trans_time  1296675 non-null  datetime64[ns]
 2   amt                    1296675 non-null  float64       
 3   gender                 1296675 non-null  int64         
 4   zip                    1296675 non-null  int64         
 5   lat                    1296675 non-null  float64       
 6   long                   1296675 non-null  float64       
 7   city_pop               1296675 non-null  int64         
 8   unix_time              1296675 non-null  int64         
 9   merch_lat              1296675 non-null  float64       
 10  merch_long             1296675 non-null  float64       
 11  is_fraud               1296675 non-null  int64         
 12  food_dining            12966

Time of Purchase

In [53]:
fraudTrain['Time'] = fraudTrain['trans_date_trans_time'].dt.time

In [54]:
fraudTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 28 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   full_name_street        1296675 non-null  object        
 1   trans_date_trans_time   1296675 non-null  datetime64[ns]
 2   amt                     1296675 non-null  float64       
 3   gender                  1296675 non-null  int64         
 4   zip                     1296675 non-null  int64         
 5   lat                     1296675 non-null  float64       
 6   long                    1296675 non-null  float64       
 7   city_pop                1296675 non-null  int64         
 8   unix_time               1296675 non-null  int64         
 9   merch_lat               1296675 non-null  float64       
 10  merch_long              1296675 non-null  float64       
 11  is_fraud                1296675 non-null  int64         
 12  food_dining   

Time Since Last Purchase

In [55]:
fraudTrain['TimeSinceLast'] = fraudTrain.groupby(by = 'full_name_street')['unix_time'].diff()

In [58]:
fraudTrain['TimeSinceLast'] = fraudTrain['TimeSinceLast'].fillna(-1)

## Modeling

##### Business Problem specifies that we a creating a screener, so we want to catch almost all fraudulent purchases, don't care too much about False-Positives, so we will use Recall as our scoring metric.

Obviously, if we classify every purchase as fradulent, we would get a recall of 1, but that would be an extremely shitty screener. We have to do more work than that. Let's make a baseline logistic regression model, just with the columns we currently have- not taking anything else into account. We would expect this to be a bad model, but it is a good place to start.

In [99]:
X_train = fraudTrain.drop(['is_fraud','trans_date_trans_time', 'full_name_street'], axis = 1)
y_train = fraudTrain['is_fraud']
X_test = fraudTest.drop(['is_fraud','trans_date_trans_time','full_name_street'], axis = 1)
y_test = fraudTest['is_fraud']

In [100]:
BaselineModel = LogisticRegression(random_state = 42, solver = 'saga')
BaselineModel.fit(X_train, y_train)



## Evaluation

## Code Quality???