# Classifying Credit Card Fraud

In [71]:
try: # Wrapper to save time when running all cells
    # Standard Packages
    import pandas as pd
    import numpy as np
    import markdown
    import os

    # Viz Packages
    import seaborn as sns
    import matplotlib.pyplot as plt

    # Scipy Stats
    import scipy.stats as stats
    from scipy.special import logit, expit

    # Statsmodel Api
    import statsmodels.api as sm
    from statsmodels.formula.api import ols

    # SKLearn Modules
    from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
    from sklearn.feature_selection import RFE
    from sklearn.preprocessing import StandardScaler, OneHotEncoder, normalize
    from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, log_loss, confusion_matrix, RocCurveDisplay, plot_roc_curve, classification_report, accuracy_score, recall_score, precision_score, f1_score
    from sklearn.preprocessing import PolynomialFeatures, LabelEncoder
    from sklearn.datasets import load_diabetes
    from sklearn.model_selection import train_test_split, cross_validate, KFold, cross_val_score
    from sklearn import datasets

    # Suppress future, deprecation, and SettingWithCopy warnings
    import warnings
    warnings.filterwarnings("ignore", category= FutureWarning)
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    pd.options.mode.chained_assignment = None

    # make all columns in a df viewable
    pd.options.display.max_columns = None
    pd.options.display.width = None
except ImportError:
    pass

## The Business Problem

We've been hired by **Insert Credit Agency** to create a screener to help protect their clients from potentially fradulent purchases.

## Data Understanding

Data from Kaggle, Synthetically Generated, Already splitted for us. Synthetically Generated good because data like this usually encrypted/not available to public. No missing values, should be easier to clean and prep for analysis.

In [72]:
# Load in Fraud Test and Train
fraudTrain = pd.read_csv('data/fraudTrain.csv')
fraudTest = pd.read_csv('data/fraudTest.csv')
# Concatenate them for sake of EDA
fraudDF = pd.concat([fraudTrain, fraudTest], axis = 0)

In [73]:
fraudDF.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [74]:
fraudDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Unnamed: 0             int64  
 1   trans_date_trans_time  object 
 2   cc_num                 int64  
 3   merchant               object 
 4   category               object 
 5   amt                    float64
 6   first                  object 
 7   last                   object 
 8   gender                 object 
 9   street                 object 
 10  city                   object 
 11  state                  object 
 12  zip                    int64  
 13  lat                    float64
 14  long                   float64
 15  city_pop               int64  
 16  job                    object 
 17  dob                    object 
 18  trans_num              object 
 19  unix_time              int64  
 20  merch_lat              float64
 21  merch_long             float64
 22  is_fraud           

We (obviously) have no missing values, since the data is artificially generated. However, their are some things that we need to set up in order to conduct our analysis.
1. Since we will be using unix time, as it is much easier to manipulate, we can drop trans_date_trans_time. 
2. Convert Gender Column into a boolean
3. Convert cc_num to a string
4. Find best identifier for transaction tracking (Name? Credit Card Number? Address? Some sort of mix?)
5. Drop useless columns, such as Unnamed:0, unix_time, and dob, as those obviously do not effect whether or not a purchase is fraudulent

Via further analysis (Not included here), it seems that 1 of the people did not make any purchases with a credit card, so we only have 999 unique people in our dataset

## Data Preparation

In [75]:
# Drop blatantly useless columns
fraudDF.drop(['Unnamed: 0','trans_date_trans_time'], axis = 1, inplace = True)

In [76]:
fraudDF['full_name'] = fraudDF['first'] + ' ' + fraudDF['last']

In [77]:
fraudDF['full_name'].nunique()

989

Not getting every unique person, but maybe we can also concatenate their street, in order to address the 10 people with shared names

In [78]:
fraudDF['full_name_street'] = fraudDF['full_name'] + ' on ' + fraudDF['street']

In [79]:
fraudDF['full_name_street'].nunique()

999

That worked! Now that we have a working identifier, we can drop all of the columns used in it (first, last, street), as well as other columns we may have used to identify people such as cc_num, job, and dob.

We also will apply the manipulations above to the pre-split dataframes, as we will be using those going forward

In [80]:
fraudTrain['full_name_street'] = fraudTrain['first'] + ' ' + fraudTrain['last'] + ' on ' + fraudTrain['street']
fraudTest['full_name_street'] = fraudTest['first'] + ' ' + fraudTest['last'] + ' on ' + fraudTest['street']

In [82]:
fraudTrain.drop(['first','last','street','cc_num','job','dob'], axis = 1, inplace=True)
fraudTest.drop(['first','last','street','cc_num','job','dob'], axis = 1, inplace=True)

Let's also make our identifier the first column in our dataframes

In [83]:
fraudTrain.set_index("full_name_street", inplace=True)
fraudTest.set_index("full_name_street", inplace=True)

In [84]:
fraudTrain.reset_index(inplace=True)
fraudTest.reset_index(inplace=True)

In [86]:
fraudTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 18 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   full_name_street       1296675 non-null  object 
 1   Unnamed: 0             1296675 non-null  int64  
 2   trans_date_trans_time  1296675 non-null  object 
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   gender                 1296675 non-null  object 
 7   city                   1296675 non-null  object 
 8   state                  1296675 non-null  object 
 9   zip                    1296675 non-null  int64  
 10  lat                    1296675 non-null  float64
 11  long                   1296675 non-null  float64
 12  city_pop               1296675 non-null  int64  
 13  trans_num              1296675 non-null  object 
 14  unix_time         

## Modeling

## Evaluation

## Code Quality???