# Project_M3_Finley_Daniel

# Section 0 - Application and data resource description

### Objective
The goal of this application is to explore the card identity data and develop a model that predicts fraud based on the data. 

### Stakeholders
The stakeholders in the case can be considered both the card holders and the credit card companies. The card holders benefit from a accurate model by having fraudulant behavior flagged and quickly addressed. The company benefits by offering a strong model to their customers and preventing any complaints or cases of fraud raised against them. 

### Reference
https://www.kaggle.com/c/ieee-fraud-detection

# Section 1 - Initial Set Up, Data Import, and Inspection

In [2]:
# supress the display of warning messages
import warnings    
warnings.filterwarnings('ignore')

## Install and import packages

Choose an installation option described in this page -  https://imbalanced-learn.readthedocs.io/en/stable/install.html to install the imbalanced-learn package (only once is necessary) via a shell command in your notebook - https://colab.research.google.com/notebooks/snippets/importing_libraries.ipynb


In [3]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
import sklearn as sk

from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier 
from sklearn.tree import export_text

from sklearn.model_selection import train_test_split, cross_validate,\
GridSearchCV, cross_val_score, KFold, ParameterGrid

from sklearn import metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import precision_recall_fscore_support,\
accuracy_score, recall_score, precision_score, f1_score,\
confusion_matrix, classification_report

# from sklearn.naive_bayes import GaussianNB
# from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC, LinearSVC
# from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier,\
BaggingClassifier, AdaBoostClassifier

In [5]:
# Import libraries for additional metrics, plots, and GridSearchCV
# Use help() or search for these APIs in https://scikit-learn.org/stable/
# from sklearn import metrics
from sklearn.metrics import roc_curve, plot_roc_curve, roc_auc_score,\
  plot_precision_recall_curve, precision_recall_curve, average_precision_score,\
  balanced_accuracy_score

from sklearn.model_selection import GridSearchCV, cross_val_score, ParameterGrid
# If interested, explore and compare RandomizedSearchCV as an alternative to GridSearchCV
# from scipy.stats import uniform
# from sklearn.model_selection import RandomizedSearchCV,


In [6]:
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT

from imblearn.ensemble import (BalanceCascade,
                               BalancedBaggingClassifier,
                               BalancedRandomForestClassifier,
                               EasyEnsemble,
                               EasyEnsembleClassifier,
                               RUSBoostClassifier)


# from imblearn.pipeline import make_pipeline 
from imblearn.over_sampling import (RandomOverSampler, ADASYN, 
                                    SMOTE, BorderlineSMOTE, SVMSMOTE)
from imblearn.under_sampling import (RandomUnderSampler,
                                     ClusterCentroids,
                                     NearMiss,
                                     InstanceHardnessThreshold,
                                     CondensedNearestNeighbour,
                                     EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours,
                                     AllKNN,
                                     NeighbourhoodCleaningRule,
                                     OneSidedSelection)
from imblearn.combine import (SMOTEENN, SMOTETomek)

In [7]:
# Method for anomaly detection
from sklearn import svm
from sklearn.svm import OneClassSVM
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
# from sklearn import neighbors
# from sklearn.neighbors import LocalOutlierFactor

In [8]:
# Install memory_profiler to monitor memory usage
!pip install -U memory_profiler

Collecting memory_profiler
  Downloading memory_profiler-0.58.0.tar.gz (36 kB)
Building wheels for collected packages: memory-profiler
  Building wheel for memory-profiler (setup.py) ... [?25l[?25hdone
  Created wheel for memory-profiler: filename=memory_profiler-0.58.0-py3-none-any.whl size=30189 sha256=9ae335155bf7a9135d2690f8aa723c65d4533c5869c5e62ce0ec6bdc43044ec2
  Stored in directory: /root/.cache/pip/wheels/56/19/d5/8cad06661aec65a04a0d6785b1a5ad035cb645b1772a4a0882
Successfully built memory-profiler
Installing collected packages: memory-profiler
Successfully installed memory-profiler-0.58.0


In [9]:
import memory_profiler
import time
m1 = memory_profiler.memory_usage()
t1 = time.clock()
print(f' memory_usage: {m1}\n time.clock:{t1}\n')

 memory_usage: [158.921875]
 time.clock:2.336929



## Load and inspect train_identity

Data is separated into two datasets: information about the identity of the customer and identity information. Not all identitys belong to identities, which are available. Maybe it would be possible to use additional identitys to generate new features.

In [10]:
# Retrieve csv file from google drive by mapping the folder from google drive. 
# Must be done each time session expires.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [21]:
#folder_path = '/content/drive/My Drive/Data sets/IEEE credit card fraud/ieee-fraud-detection/'
train_id = pd.read_csv("/content/drive/My Drive/Colab Notebooks/train_identity.csv")

#transaction data
train_trans = pd.read_csv("/content/drive/My Drive/train_transaction.csv")

In [25]:

df_train = train_trans.merge(train_id, how='left', left_index=True, right_index=True, on='TransactionID')


print(df_train.shape)

# y_train = df_train['isFraud'].copy()
#del df_trans, df_id, df_test_trans, df_test_id

(590540, 434)


In [26]:
df_train.info

<bound method DataFrame.info of         TransactionID  isFraud  ...  DeviceType                     DeviceInfo
0             2987000        0  ...      mobile  SAMSUNG SM-G892A Build/NRD90M
1             2987001        0  ...      mobile                     iOS Device
2             2987002        0  ...     desktop                        Windows
3             2987003        0  ...     desktop                            NaN
4             2987004        0  ...     desktop                          MacOS
...               ...      ...  ...         ...                            ...
590535        3577535        0  ...         NaN                            NaN
590536        3577536        0  ...         NaN                            NaN
590537        3577537        0  ...         NaN                            NaN
590538        3577538        0  ...         NaN                            NaN
590539        3577539        0  ...         NaN                            NaN

[590540 rows x 434 

In [27]:
df_train.dtypes

TransactionID       int64
isFraud             int64
TransactionDT       int64
TransactionAmt    float64
ProductCD          object
                   ...   
id_36              object
id_37              object
id_38              object
DeviceType         object
DeviceInfo         object
Length: 434, dtype: object

In [28]:
df_train.columns

Index(['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt',
       'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5',
       ...
       'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
       'DeviceType', 'DeviceInfo'],
      dtype='object', length=434)

In [29]:
df_train.head(40)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,P_emaildomain,R_emaildomain,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,id_10,id_11,id_12,id_13,id_14,id_15,id_16,id_17,id_18,id_19,id_20,id_21,id_22,id_23,id_24,id_25,id_26,id_27,id_28,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,credit,315.0,87.0,19.0,,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,,,,,...,0.0,70787.0,,,,,,,,,100.0,NotFound,,-480.0,New,NotFound,166.0,,542.0,144.0,,,,,,,,New,NotFound,Android 7.0,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,...,-5.0,98945.0,,,0.0,-5.0,,,,,100.0,NotFound,49.0,-300.0,New,NotFound,166.0,,621.0,500.0,,,,,,,,New,NotFound,iOS 11.1.2,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,330.0,87.0,287.0,,outlook.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,...,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,0.0,100.0,NotFound,52.0,,Found,Found,121.0,,410.0,142.0,,,,,,,,Found,Found,,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,87.0,,,yahoo.com,,2.0,5.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0,25.0,1.0,112.0,112.0,0.0,94.0,0.0,,,,,...,-5.0,221832.0,,,0.0,-6.0,,,,,100.0,NotFound,52.0,,New,NotFound,225.0,,176.0,507.0,,,,,,,,New,NotFound,,chrome 62.0,,,,F,F,T,T,desktop,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,,,,...,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,0.0,100.0,NotFound,,-300.0,Found,Found,166.0,15.0,529.0,575.0,,,,,,,,Found,Found,Mac OS X 10_11_6,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS
5,2987005,0,86510,49.0,W,5937,555.0,150.0,visa,226.0,debit,272.0,87.0,36.0,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,...,-5.0,61141.0,3.0,0.0,3.0,0.0,,,3.0,0.0,100.0,NotFound,52.0,-300.0,Found,Found,166.0,18.0,529.0,600.0,,,,,,,,Found,Found,Windows 10,chrome 62.0,24.0,1366x768,match_status:2,T,F,T,T,desktop,Windows
6,2987006,0,86522,159.0,W,12308,360.0,150.0,visa,166.0,debit,126.0,87.0,0.0,,yahoo.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,...,-15.0,,,,,,,,,,,NotFound,14.0,,,,,,,,,,,,,,,,,,,,,,,,,,,
7,2987007,0,86529,422.5,W,12695,490.0,150.0,visa,226.0,debit,325.0,87.0,,,mail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,...,0.0,31964.0,0.0,0.0,0.0,-10.0,,,0.0,0.0,100.0,Found,,-300.0,Found,Found,166.0,15.0,352.0,533.0,,,,,,,,Found,Found,Android,chrome 62.0,32.0,1920x1080,match_status:2,T,F,T,T,mobile,
8,2987008,0,86535,15.0,H,2803,100.0,150.0,visa,226.0,debit,337.0,87.0,,,anonymous.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,,,,...,-10.0,116098.0,0.0,0.0,0.0,0.0,,,0.0,0.0,100.0,NotFound,52.0,,Found,Found,121.0,,410.0,142.0,,,,,,,,Found,Found,,chrome 62.0,,,,F,F,T,T,desktop,Windows
9,2987009,0,86536,117.0,W,17399,111.0,150.0,mastercard,224.0,debit,204.0,87.0,19.0,,yahoo.com,,2.0,2.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,1.0,0.0,12.0,2.0,61.0,61.0,30.0,318.0,30.0,,,,,...,-5.0,257037.0,,,0.0,0.0,,,,,100.0,NotFound,52.0,,New,NotFound,225.0,,484.0,507.0,,,,,,,,New,NotFound,,chrome 62.0,,,,F,F,T,T,desktop,Windows


In [30]:
len(df_train.TransactionID.unique())

590540

In [31]:
df_train.value_counts('isFraud', normalize=True) 

isFraud
0    0.96501
1    0.03499
dtype: float64

In [32]:
df_train['TransactionDT'].quantile(np.linspace(0,1,11))

0.0       86400.0
0.1     1361004.4
0.2     2310159.6
0.3     3864163.9
0.4     5592303.6
0.5     7306527.5
0.6     8745782.4
0.7    10437998.1
0.8    12192853.6
0.9    13990907.7
1.0    15811131.0
Name: TransactionDT, dtype: float64

In [33]:
df_train['TransactionID'].quantile(np.linspace(0,1,11))

0.0    2987000.0
0.1    3046053.9
0.2    3105107.8
0.3    3164161.7
0.4    3223215.6
0.5    3282269.5
0.6    3341323.4
0.7    3400377.3
0.8    3459431.2
0.9    3518485.1
1.0    3577539.0
Name: TransactionID, dtype: float64

In [34]:
for i in df_train.columns:
  print(f'{i} has {df_train[i].isnull().sum()/df_train.shape[0]} % of NaN\n')

TransactionID has 0.0 % of NaN

isFraud has 0.0 % of NaN

TransactionDT has 0.0 % of NaN

TransactionAmt has 0.0 % of NaN

ProductCD has 0.0 % of NaN

card1 has 0.0 % of NaN

card2 has 0.015126833068039422 % of NaN

card3 has 0.0026501168422122124 % of NaN

card4 has 0.00267043722694483 % of NaN

card5 has 0.007212043214684865 % of NaN

card6 has 0.0026602770345785214 % of NaN

addr1 has 0.1112642666034477 % of NaN

addr2 has 0.1112642666034477 % of NaN

dist1 has 0.596523520845328 % of NaN

dist2 has 0.9362837403054831 % of NaN

P_emaildomain has 0.1599485216920107 % of NaN

R_emaildomain has 0.7675161716395164 % of NaN

C1 has 0.0 % of NaN

C2 has 0.0 % of NaN

C3 has 0.0 % of NaN

C4 has 0.0 % of NaN

C5 has 0.0 % of NaN

C6 has 0.0 % of NaN

C7 has 0.0 % of NaN

C8 has 0.0 % of NaN

C9 has 0.0 % of NaN

C10 has 0.0 % of NaN

C11 has 0.0 % of NaN

C12 has 0.0 % of NaN

C13 has 0.0 % of NaN

C14 has 0.0 % of NaN

D1 has 0.0021488806854743116 % of NaN

D2 has 0.4754919226470688 % of N

# Section 2 - Feature reduction, missing data handling, and data exploration

## Reduction

Reduce the number of predictor variables to 50

In [35]:
# create trans to store columns with low missingness
trans = pd.DataFrame()
# drop columns that would generate a large number of dummy variables
features = list(df_train.columns)
to_remove = ['TransactionID','TransactionDT','R_emaildomain','P_emaildomain']
for i in to_remove:
  features.remove(i)
# drop columns with high missingness
for i in features:
  if df_train[i].isnull().sum()/df_train.shape[0] < 0.0005:
    trans[i] = df_train[i].copy()
print(trans.shape[1],'\n', trans.columns)

50 
 Index(['isFraud', 'TransactionAmt', 'ProductCD', 'card1', 'C1', 'C2', 'C3',
       'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14',
       'V279', 'V280', 'V284', 'V285', 'V286', 'V287', 'V290', 'V291', 'V292',
       'V293', 'V294', 'V295', 'V297', 'V298', 'V299', 'V302', 'V303', 'V304',
       'V305', 'V306', 'V307', 'V308', 'V309', 'V310', 'V311', 'V312', 'V316',
       'V317', 'V318', 'V319', 'V320', 'V321'],
      dtype='object')


In [36]:
trans.info

<bound method DataFrame.info of         isFraud  TransactionAmt ProductCD  ...        V319        V320        V321
0             0           68.50         W  ...    0.000000    0.000000    0.000000
1             0           29.00         W  ...    0.000000    0.000000    0.000000
2             0           59.00         W  ...    0.000000    0.000000    0.000000
3             0           50.00         W  ...    0.000000    0.000000    0.000000
4             0           50.00         H  ...    0.000000    0.000000    0.000000
...         ...             ...       ...  ...         ...         ...         ...
590535        0           49.00         W  ...    0.000000    0.000000    0.000000
590536        0           39.50         W  ...    0.000000    0.000000    0.000000
590537        0           30.95         W  ...    0.000000    0.000000    0.000000
590538        0          117.00         W  ...    0.000000    0.000000    0.000000
590539        0          279.95         W  ...  279.950

590540 rows x 394 columns to 590540 rows x 50 columns

## Missing data

In [37]:
# Encodes categorical variables and nafill categorical null values with 'none'
 # Replace NaN in numeric variables with column means or medians
def fun_fillna_enc(df_name,df,y_col,cat_c,num_fillna,num_c): 
  # separate target and features
  df_target =df[y_col].copy().reset_index(drop=True)
  df_features = df.drop(y_col, axis=1)
  # Separate numeric and categorical predictors
  df_num_features = df_features.select_dtypes(exclude=['object'])
  df_cat_features = df_features.select_dtypes(include=['object'])
  # impute categorical nan with cat_c, e.g., cat_c ='none'
  df_cat_features = df_cat_features.fillna(cat_c)
  # if num_fillna = 'a', then use column mean to replace numerical nan
  # if num_fillna = 'm' uses column medium to replace numerical nan
  # if num_fillna = 'c', uses num_c, e.g. num_c = -99, to replace numerical nan or nafill(num_c)
  if (num_fillna == 'c'): df_num_features = df_num_features.fillna(num_c).reset_index(drop=True)
  elif (num_fillna == 'a'):
    for i in df_num_features.columns:
      df_num_features.update(df_num_features[i].fillna(value=df_num_features[i].mean(), inplace=True))
  elif (num_fillna == 'm'):
    df_num_features.update(df_num_features[i].fillna(value=df_num_features[i].median(), inplace=True))
  else: print('Invalid num_fillna input\n')

  # Create a transformer object and fit it to cat_features    
  enc = OneHotEncoder(dtype=np.int8)
  enc_f = enc.fit(df_cat_features)
  # Print dummy variable names. Original cat names are lost.
  print(f'Dummy variable names for {df_name}\n')
  print(enc_f.get_feature_names(),'\n')
  mat = enc_f.transform(df_cat_features)
  # mat is a numpy sparse matrix. Transform it to pandas sparse dataframe
  df_cat_enc = pd.DataFrame.sparse.from_spmatrix(mat).reset_index(drop=True)
  # Join the sparse dataframe of dummy variables with the data frame of numeric features
  # reset_index in the two concatenated columns (or use merge on 'TransactionID')
  df_features_enc = pd.concat([df_cat_enc,df_num_features],axis=1)
  print(f'The encoded {df_name} has {df_features_enc.shape[0]} rows, {df_features_enc.shape[1]} columns\n')
  return df_features_enc, df_target;

In [38]:
t2 = time.clock()
m2 = memory_profiler.memory_usage()
time_diff = round(t2 - t1,6)
mem_diff = round(m2[0] - m1[0],6)
print(f"It took {time_diff} Secs to run the code above and consumed {mem_diff} Mb additional memory")

It took 61.153502 Secs to run the code above and consumed 10107.710938 Mb additional memory


In [39]:
print(f'memory_usage().sum() shows:  {round(df_train.memory_usage().sum()/1024**2,4)} MB in memory usage')

memory_usage().sum() shows:  1955.3709 MB in memory usage


In [40]:
import sys

In [41]:
# Track memory and time consumption
m1 = memory_profiler.memory_usage()
t1 = time.clock()
print(f' memory_usage: {m1}\n time.clock:{t1}\n')
# encode and replace missing values
trans_nafill_enc, trans_y = fun_fillna_enc('trans', trans,'isFraud','none','a', -99)
#
t2 = time.clock()
m2 = memory_profiler.memory_usage()
time_diff = round(t2 - t1,6)
mem_diff = round(m2[0] - m1[0],6)
print(f"It took {time_diff} Secs and {mem_diff} Mb to encode trans")

 memory_usage: [10266.6328125]
 time.clock:63.527456

Dummy variable names for trans

['x0_C' 'x0_H' 'x0_R' 'x0_S' 'x0_W'] 

The encoded trans has 590540 rows, 53 columns

It took 21.392491 Secs and -813.460938 Mb to encode trans


In [42]:
trans_nafill_enc.head()

Unnamed: 0,0,1,2,3,4,TransactionAmt,card1,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,V279,V280,V284,V285,V286,V287,V290,V291,V292,V293,V294,V295,V297,V298,V299,V302,V303,V304,V305,V306,V307,V308,V309,V310,V311,V312,V316,V317,V318,V319,V320,V321
0,0,0,0,0,1,68.5,13926,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0
1,0,0,0,0,1,29.0,2755,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0,0,0,1,59.0,4663,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,0,0,1,50.0,18132,2.0,5.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0,25.0,1.0,1.0,28.0,0.0,10.0,0.0,4.0,1.0,1.0,1.0,1.0,38.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,50.0,1758.0,925.0,0.0,354.0,0.0,135.0,50.0,1404.0,790.0,0.0,0.0,0.0
4,0,1,0,0,0,50.0,4497,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
# Before handling missing data

for i in trans.columns:
  print(f'{i} has {df_train[i].isnull().sum()/df_train.shape[0]} % of NaN\n')

isFraud has 0.0 % of NaN

TransactionAmt has 0.0 % of NaN

ProductCD has 0.0 % of NaN

card1 has 0.0 % of NaN

C1 has 0.0 % of NaN

C2 has 0.0 % of NaN

C3 has 0.0 % of NaN

C4 has 0.0 % of NaN

C5 has 0.0 % of NaN

C6 has 0.0 % of NaN

C7 has 0.0 % of NaN

C8 has 0.0 % of NaN

C9 has 0.0 % of NaN

C10 has 0.0 % of NaN

C11 has 0.0 % of NaN

C12 has 0.0 % of NaN

C13 has 0.0 % of NaN

C14 has 0.0 % of NaN

V279 has 2.0320384732617603e-05 % of NaN

V280 has 2.0320384732617603e-05 % of NaN

V284 has 2.0320384732617603e-05 % of NaN

V285 has 2.0320384732617603e-05 % of NaN

V286 has 2.0320384732617603e-05 % of NaN

V287 has 2.0320384732617603e-05 % of NaN

V290 has 2.0320384732617603e-05 % of NaN

V291 has 2.0320384732617603e-05 % of NaN

V292 has 2.0320384732617603e-05 % of NaN

V293 has 2.0320384732617603e-05 % of NaN

V294 has 2.0320384732617603e-05 % of NaN

V295 has 2.0320384732617603e-05 % of NaN

V297 has 2.0320384732617603e-05 % of NaN

V298 has 2.0320384732617603e-05 % of NaN

V2

In [44]:
# After handling missing data

for i in trans_nafill_enc.columns:
  print(f'{i} has {trans_nafill_enc[i].isnull().sum()/trans_nafill_enc.shape[0]} % of NaN\n')

0 has 0.0 % of NaN

1 has 0.0 % of NaN

2 has 0.0 % of NaN

3 has 0.0 % of NaN

4 has 0.0 % of NaN

TransactionAmt has 0.0 % of NaN

card1 has 0.0 % of NaN

C1 has 0.0 % of NaN

C2 has 0.0 % of NaN

C3 has 0.0 % of NaN

C4 has 0.0 % of NaN

C5 has 0.0 % of NaN

C6 has 0.0 % of NaN

C7 has 0.0 % of NaN

C8 has 0.0 % of NaN

C9 has 0.0 % of NaN

C10 has 0.0 % of NaN

C11 has 0.0 % of NaN

C12 has 0.0 % of NaN

C13 has 0.0 % of NaN

C14 has 0.0 % of NaN

V279 has 0.0 % of NaN

V280 has 0.0 % of NaN

V284 has 0.0 % of NaN

V285 has 0.0 % of NaN

V286 has 0.0 % of NaN

V287 has 0.0 % of NaN

V290 has 0.0 % of NaN

V291 has 0.0 % of NaN

V292 has 0.0 % of NaN

V293 has 0.0 % of NaN

V294 has 0.0 % of NaN

V295 has 0.0 % of NaN

V297 has 0.0 % of NaN

V298 has 0.0 % of NaN

V299 has 0.0 % of NaN

V302 has 0.0 % of NaN

V303 has 0.0 % of NaN

V304 has 0.0 % of NaN

V305 has 0.0 % of NaN

V306 has 0.0 % of NaN

V307 has 0.0 % of NaN

V308 has 0.0 % of NaN

V309 has 0.0 % of NaN

V310 has 0.0 % 

# Section 3



In [45]:
DTC_ent4 = DecisionTreeClassifier(criterion='entropy',max_depth=4,random_state=42)
DTC_ent8 = DecisionTreeClassifier(criterion='entropy',max_depth=8,random_state=42)
lr_lbfgs = LogisticRegression(random_state=42) 
rf_max3 = RandomForestClassifier(max_depth=3,random_state=42)

In [46]:
clf = rf_max3
clf_f = clf.fit(trans_nafill_enc,trans_y)
pred = clf_f.predict(trans_nafill_enc)

In [47]:
print(clf,'\n confusion matrix:\n',metrics.confusion_matrix(trans_y,pred),'\n, classification report:\n',classification_report(trans_y, pred),'\n')

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False) 
 confusion matrix:
 [[569711    166]
 [ 19027   1636]] 
, classification report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.98    569877
           1       0.91      0.08      0.15     20663

    accuracy                           0.97    590540
   macro avg       0.94      0.54      0.56    590540
weighted avg       0.97      0.97      0.95    590540
 



In [48]:
clf.feature_importances_

array([2.56847933e-02, 3.26818076e-04, 7.18699662e-04, 1.01965806e-03,
       2.06407806e-02, 1.50487005e-03, 5.02710155e-04, 9.30200495e-02,
       3.12105916e-02, 0.00000000e+00, 8.70841453e-02, 8.75043098e-03,
       2.62542438e-02, 1.28911202e-01, 5.41967646e-02, 4.45769246e-03,
       5.70093210e-02, 6.37217000e-02, 1.12330098e-01, 7.02840195e-02,
       5.83050051e-02, 2.21203609e-03, 2.33964248e-03, 4.01790799e-04,
       5.33621655e-03, 3.02401580e-05, 6.53707933e-03, 2.25528901e-05,
       9.07935221e-05, 8.16410357e-04, 9.47012555e-04, 1.10463064e-02,
       2.20468991e-02, 0.00000000e+00, 2.50340086e-04, 0.00000000e+00,
       6.43398517e-03, 7.58468423e-03, 1.99987458e-02, 0.00000000e+00,
       4.25861209e-03, 1.56388015e-03, 2.82423353e-03, 1.36556535e-03,
       1.80541301e-02, 0.00000000e+00, 2.10171740e-03, 5.93556768e-03,
       2.16336233e-02, 9.98196859e-03, 0.00000000e+00, 4.14940881e-05,
       2.10878028e-04])

In [49]:
rf_feature_importances = pd.DataFrame(clf.feature_importances_,
    index = trans_nafill_enc.columns,
     columns=['importance']).sort_values('importance',ascending=False)

In [50]:
rf_feature_importances.head(40)

Unnamed: 0,importance
C7,0.128911
C12,0.11233
C1,0.09302
C4,0.087084
C13,0.070284
C11,0.063722
C14,0.058305
C10,0.057009
C8,0.054197
C2,0.031211


In [51]:
clf = DTC_ent4
clf_f = clf.fit(trans_nafill_enc,trans_y)
pred = clf_f.predict(trans_nafill_enc)

In [52]:
print(clf,'\n confusion matrix:\n',metrics.confusion_matrix(trans_y,pred),'\n, classification report:\n',classification_report(trans_y, pred),'\n')

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best') 
 confusion matrix:
 [[568879    998]
 [ 16324   4339]] 
, classification report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99    569877
           1       0.81      0.21      0.33     20663

    accuracy                           0.97    590540
   macro avg       0.89      0.60      0.66    590540
weighted avg       0.97      0.97      0.96    590540
 



In [53]:
clf.feature_importances_

array([0.02750414, 0.        , 0.        , 0.        , 0.00736919,
       0.        , 0.        , 0.21374198, 0.        , 0.        ,
       0.29290029, 0.03191981, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.01223606, 0.10264581,
       0.17184324, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.0159523 , 0.        , 0.00269799,
       0.        , 0.        , 0.        , 0.1211892 , 0.        ,
       0.        , 0.        , 0.        ])

In [54]:
DTC_feature_importances = pd.DataFrame(clf.feature_importances_,
    index = trans_nafill_enc.columns,
     columns=['importance']).sort_values('importance',ascending=False)

In [55]:
DTC_feature_importances.head(40)

Unnamed: 0,importance
C4,0.2929
C1,0.213742
C14,0.171843
V317,0.121189
C13,0.102646
C5,0.03192
0,0.027504
V308,0.015952
C12,0.012236
4,0.007369


In [56]:
print(f'DTC_ent4 Textual Rules for trans_nafill_enc:\n',export_text(clf_f, feature_names=list(trans_nafill_enc.columns)))

DTC_ent4 Textual Rules for trans_nafill_enc:
 |--- C4 <= 1.50
|   |--- C14 <= 0.50
|   |   |--- C1 <= 2.50
|   |   |   |--- 0 <= 0.50
|   |   |   |   |--- class: 0
|   |   |   |--- 0 >  0.50
|   |   |   |   |--- class: 0
|   |   |--- C1 >  2.50
|   |   |   |--- 4 <= 0.50
|   |   |   |   |--- class: 1
|   |   |   |--- 4 >  0.50
|   |   |   |   |--- class: 0
|   |--- C14 >  0.50
|   |   |--- V317 <= 7.86
|   |   |   |--- C5 <= 0.50
|   |   |   |   |--- class: 0
|   |   |   |--- C5 >  0.50
|   |   |   |   |--- class: 0
|   |   |--- V317 >  7.86
|   |   |   |--- V308 <= 314.96
|   |   |   |   |--- class: 0
|   |   |   |--- V308 >  314.96
|   |   |   |   |--- class: 0
|--- C4 >  1.50
|   |--- C1 <= 5.50
|   |   |--- C13 <= 1.50
|   |   |   |--- C1 <= 2.50
|   |   |   |   |--- class: 0
|   |   |   |--- C1 >  2.50
|   |   |   |   |--- class: 0
|   |   |--- C13 >  1.50
|   |   |   |--- C12 <= 13.50
|   |   |   |   |--- class: 0
|   |   |   |--- C12 >  13.50
|   |   |   |   |--- class: 1
|   |-

In [57]:
clf = DTC_ent8
clf_f = clf.fit(trans_nafill_enc,trans_y)
pred = clf_f.predict(trans_nafill_enc)

In [58]:
print(clf,'\n confusion matrix:\n',metrics.confusion_matrix(trans_y,pred),'\n, classification report:\n',classification_report(trans_y, pred),'\n')

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=8, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best') 
 confusion matrix:
 [[568415   1462]
 [ 14131   6532]] 
, classification report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99    569877
           1       0.82      0.32      0.46     20663

    accuracy                           0.97    590540
   macro avg       0.90      0.66      0.72    590540
weighted avg       0.97      0.97      0.97    590540
 



In [59]:
clf.feature_importances_

array([3.22389178e-02, 4.24667129e-03, 3.78332398e-03, 1.11376564e-03,
       5.48007095e-03, 2.15548877e-02, 1.47869710e-02, 1.82863043e-01,
       1.37299130e-02, 0.00000000e+00, 2.24330055e-01, 3.32094502e-02,
       1.21163773e-02, 2.03035695e-03, 8.67441880e-03, 3.04179134e-03,
       3.95734361e-03, 1.78014061e-02, 1.57505334e-03, 1.06992524e-01,
       1.43023637e-01, 1.05350982e-03, 1.08393022e-04, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 3.56522313e-04, 0.00000000e+00,
       3.97572222e-03, 0.00000000e+00, 0.00000000e+00, 8.47133180e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 7.52949757e-04, 0.00000000e+00, 0.00000000e+00,
       4.41718701e-03, 1.03814312e-03, 2.68639805e-02, 5.80675732e-04,
       1.21057091e-02, 0.00000000e+00, 9.91964522e-03, 0.00000000e+00,
       9.07397590e-02, 7.15630169e-04, 4.75495295e-05, 2.09068341e-03,
       2.12629909e-04])

In [60]:
DTC_feature_importances = pd.DataFrame(clf.feature_importances_,
    index = trans_nafill_enc.columns,
     columns=['importance']).sort_values('importance',ascending=False)

In [61]:
DTC_feature_importances.head(40)

Unnamed: 0,importance
C4,0.22433
C1,0.182863
C14,0.143024
C13,0.106993
V317,0.09074
C5,0.033209
0,0.032239
V308,0.026864
TransactionAmt,0.021555
C11,0.017801


In [62]:
print(f'DTC_ent8 Textual Rules for trans_nafill_enc:\n',export_text(clf_f, feature_names=list(trans_nafill_enc.columns)))

DTC_ent8 Textual Rules for trans_nafill_enc:
 |--- C4 <= 1.50
|   |--- C14 <= 0.50
|   |   |--- C1 <= 2.50
|   |   |   |--- 0 <= 0.50
|   |   |   |   |--- C5 <= 0.50
|   |   |   |   |   |--- C2 <= 1.50
|   |   |   |   |   |   |--- TransactionAmt <= 149.75
|   |   |   |   |   |   |   |--- C4 <= 0.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |   |--- C4 >  0.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |--- TransactionAmt >  149.75
|   |   |   |   |   |   |   |--- 1 <= 0.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |   |--- 1 >  0.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |--- C2 >  1.50
|   |   |   |   |   |   |--- V291 <= 1.50
|   |   |   |   |   |   |   |--- C4 <= 0.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |   |--- C4 >  0.50
|   |   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |--- V291 >  1.50
|   |   |   |   |   |   |   |---

In [None]:
clf = lr_lbfgs
clf_f = clf.fit(trans_nafill_enc,trans_y)
pred = clf_f.predict(trans_nafill_enc)
print(f'model by {clf}\n')
# Print model info
print(f'model intercept:{clf_f.intercept_}\n')
print(f'model coefficients:{clf_f.coef_}\n')

In [None]:
pd.DataFrame(trans_nafill_enc.columns)

In [None]:
print(clf,'\n confusion matrix:\n',metrics.confusion_matrix(trans_y,pred),'\n, classification report:\n',classification_report(trans_y, pred),'\n')

Top 10 Predictors: 

RF vars
C7
C12
C1
C4
C13

DTC 1 Vars
C4
C1
C14
V317
C13

DTC 2 Vars
C4
C1
C14
C13
V317

Overall: C1, C4, C7, C11, C12, C13, C14, V317, C5, 0


# Generate HTML File


In [None]:
# Enter and run the following lines of commands in your notebook to generate 
# an hmtl file of your notebook with NbConvertApp
# copy the ipynb to the local working directory using ! to run a shell command,
# cp, to copy my notebook in a folder named "Colab Notebooks" 
# to a local working directory referenced via ./

!cp "/content/drive/My Drive/Colab Notebooks/Project_M3_Finley_Daniel.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

!jupyter nbconvert --to html "Project_M3_Finley_Daniel.ipynb"