# Introduction

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
Content


The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.


It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.


# About
In this notebook we will be doing
- why accuracy sucks in this case
- Random Undersampling (Random Forest (Cross Val)  93.0 % (3.0)f1_micro, 87.0 % (5.0)recall)
- SMOTETomek (Random Forest (train test) 80% recall, 74% F1 FOR CLASS = 1 )

### If you found this helpful please upvote

# Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import RobustScaler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_selection import mutual_info_classif, SelectPercentile
from sklearn.model_selection import  cross_val_score,cross_validate, train_test_split, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix, accuracy_score, f1_score, recall_score, precision_score, make_scorer

from sklearn.compose import ColumnTransformer
# from imblearn.pipeline import Pipeline
# from imblearn import FunctionSampler
from sklearn.pipeline import Pipeline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/creditcardfraud/creditcard.csv


# Pandas Helper functions

In [2]:
!pip install pandas_flavor

Collecting pandas_flavor
  Downloading pandas_flavor-0.2.0-py2.py3-none-any.whl (6.6 kB)
Installing collected packages: pandas-flavor
Successfully installed pandas-flavor-0.2.0


In [3]:
from pandas_flavor import register_dataframe_method,register_series_method
from IPython.core.display import display, HTML

@register_dataframe_method
def get_missing(df):        
    tmp =  sorted(
                [(col , str(df[col].dtypes) ,df[col].isna().sum(), np.round( df[col].isna().sum() / len(df) * 100,2) ) for col in df.columns if df[col].isna().sum() !=0 ],
                key = lambda x: x[2], reverse=True)
    
    return pd.DataFrame(tmp).rename({0:"Feature", 1:"dtype", 2:"count", 3:"percent"},axis=1)  

@register_dataframe_method
def get_numeric_df(df):
    return df.select_dtypes(np.number)

@register_dataframe_method
def get_numeric_cols(df):
    return list(df.select_dtypes(np.number).columns)

@register_dataframe_method
def get_object_cols(df):
    return list(df.select_dtypes(exclude = np.number).columns)

@register_dataframe_method
def get_object_df(df):
    return df.select_dtypes(exclude = np.number)

@register_dataframe_method
def get_discrete_cols(df,thresold):
#     thresold in number of unique values
    return [feature for feature in df.columns if len(df[feature].unique()) < thresold]

@register_dataframe_method
def get_discrete_df(df,thresold):
#     thresold in number of unique values
    return df[ get_discrete_cols(df=df,thresold=thresold) ]

@register_dataframe_method
def describe_discrete_cols(df,thresold, ascending=True):
    
    values = pd.DataFrame()
    
    for col in df.get_discrete_cols(thresold=thresold):
        values[col] = [df[col].unique(), df[col].nunique()]
        
    return values.transpose().sort_values(by = 1,ascending=ascending).rename({0:"Values",1:"cardinality"},axis=1)

@register_dataframe_method
def get_continuous_cols(df,thresold):
    #     thresold in number of unique values
    return [feature for feature in df.columns if len(df[feature].unique()) >= thresold]

@register_dataframe_method
def get_continuous_df(df,thresold):
    #     thresold in number of unique values
    return df[ get_continuous_cols(df=df,thresold=thresold) ]


@register_dataframe_method
def describe_continuous_cols(df,thresold, ascending=True):
    return df[df.get_continuous_cols(thresold=thresold)].describe().T

@register_dataframe_method
def dtypes_of_cols(df):
    return pd.DataFrame(df.dtypes).reset_index().rename(columns={'index':"Columns",0: "dtype"})


@register_series_method
def IQR_range(df):
    if isinstance(df, pd.Series):
        Q3 = np.quantile(df, 0.75)
        Q1 = np.quantile(df, 0.25)
        IQR = Q3 - Q1

        lower_range = Q1 - 1.5 * IQR
        upper_range = Q3 + 1.5 * IQR

        return (lower_range,upper_range)
    else:
        assert False, "df must be of type pandas.Series"
        
@register_dataframe_method
def IQR_range(df):
    if isinstance(df, pd.DataFrame):
        cols = df.get_numeric_cols()
        features = {}
        for i in cols:
            Q3 = np.quantile(df[i], 0.75)
            Q1 = np.quantile(df[i], 0.25)
            IQR = Q3 - Q1

            lower_range = Q1 - 1.5 * IQR
            upper_range = Q3 + 1.5 * IQR


            features[i] = (lower_range,upper_range)
            
        return pd.DataFrame.from_dict(features,orient='index').rename({0: 'IQR_Low',1: 'IQR_High'}, axis=1)
    else:
        assert False, "df must be of type pandas.DataFrame"
        
@register_series_method
def IQR_percent(df):
    if isinstance(df, pd.Series):
        
        lower_range, upper_range = df.IQR_range()

        length = len(df)
        return np.round((length - df.between(lower_range,upper_range).sum())/length * 100, 2)
    else:
        assert False, "df must be of type pandas.Series"

@register_dataframe_method
def IQR_percent(df):
    if isinstance(df, pd.DataFrame):
        cols = df.get_numeric_cols()
        features = {}
        for i in cols:
            lower_range, upper_range = df[i].IQR_range()
#             length - Number of NON outliers
            length = len(df[i])
            outlier_count = length - df[i].between(lower_range,upper_range).sum()
            
            percent = np.round( outlier_count /length * 100, 2)
            if outlier_count != 0:
                features[i] = [percent, outlier_count]
#             features[i] = IQR_percent(df[i])
            
        return pd.DataFrame.from_dict(features,orient='index').rename({0: 'Outlier percent', 1:"Count"}, axis=1)
    else:
        assert False, "df must be of type pandas.DataFrame"

@register_dataframe_method
def get_outlier_cols(df):
    return df.IQR_percent().reset_index()["index"].to_list()
        
@register_dataframe_method
def drop_row_outlier(df, cols, inplace=False):
#     init empty index
    indices = pd.Series(np.zeros(len(df), dtype=bool), index=df.index)

    for col in cols:
        low, top = df[col].IQR_range()
        indices |= (df[col] > top) | (df[col] < low)
        
    
    return df.drop(df[ indices ].index, inplace=inplace)

@register_series_method
def drop_row_outlier(df, inplace=False):
#     init empty index

    low, top = df.IQR_range()
    indices = (df > top) | (df < low)
        
    
    return df.drop(df[ indices ].index, inplace=inplace)
        
@register_dataframe_method
def compare_cols(df,l_feat,r_feat, percent=False, percent_of_total=False):
    
#     [L_feat] {R_feat1: agg1, R_feat2: agg2}

    
    if percent or percent_of_total:
        
        comp = []
        for key, val in zip(r_feat,r_feat.values()):
            tmp = pd.DataFrame()
            tmp[key + " " + val] =  df.groupby(l_feat,sort=True).agg({key: val})
            
            if percent: tmp[key +" %"] = tmp.groupby(level=0).apply(lambda x: np.round(100 * x / float(x.sum()),2))

            if percent_of_total: tmp[key+" % of total"] = np.round(tmp[key + " " + val] / tmp[key + " " + val].sum() * 100 , 2)
            
            comp.append(tmp)
            
        return comp
    
    else:
        comp = []
        for key, val in zip(r_feat,r_feat.values()):
            tmp = pd.DataFrame()
            tmp[key + " " + val] =  df.groupby(l_feat,sort=True).agg({key: val})           
            comp.append(tmp)
            
        return comp  
    
    

@register_dataframe_method
def count_dtypes(df, ascending=False):
    return pd.DataFrame(df.dtypes.value_counts(ascending=ascending)).rename({0:"Count"},axis=1)

@register_dataframe_method
def about(df):

    display(HTML('<h1 style="color:green"> <b> Shape of data </b> </h1>'))
    print(df.shape)    

    display(HTML('<h1 style="color:green"> <b> Datatypes in data </b> </h1> '))
    display(pd.DataFrame(df.dtypes.value_counts(ascending=False) ).rename({0:"count"},axis=1))

    display(HTML('<h1 style="color:green"> <b> dtypes of columns </b> </h1> '))
    display(df.dtypes_of_cols())

    display(HTML('<h1 style="color:green"> <b> Percentage of missing values </b> </h1> '))
    tmp = get_missing(df)
    display(tmp) if len(tmp) != 0 else display(HTML("<h2> <b> None <b> </h2>"))

    display(HTML('<h1 style="color:green"> <b> Data description </b> </h1> '))
    display(df.describe().T)
    
    display(HTML('<h1 style="color:green"> <b> Outlier Percentage(IQR) </b> </h1> '))
    tmp = df.IQR_percent()
    display(tmp) if len(tmp) != 0 else display(HTML("<h2> <b> None <b> </h2>"))

    display(HTML('<h1 style="color:green"> <b> Example of data </b> </h1> '))
    display(df.head())
    
    
import itertools
def display_multiple_tables(table_list):
    table_list = list(itertools.chain(*table_list) )
    return HTML(
        '<table><tr style="background-color:white;">' + 
        ''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]) +
        '</tr></table>')


# Importing Data

In [4]:
df = pd.read_csv("../input/creditcardfraud/creditcard.csv")
df.about()

(284807, 31)


Unnamed: 0,count
float64,30
int64,1


Unnamed: 0,Columns,dtype
0,Time,float64
1,V1,float64
2,V2,float64
3,V3,float64
4,V4,float64
5,V5,float64
6,V6,float64
7,V7,float64
8,V8,float64
9,V9,float64


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Time,284807.0,94813.86,47488.145955,0.0,54201.5,84692.0,139320.5,172792.0
V1,284807.0,3.918649e-15,1.958696,-56.40751,-0.920373,0.018109,1.315642,2.45493
V2,284807.0,5.682686e-16,1.651309,-72.715728,-0.59855,0.065486,0.803724,22.057729
V3,284807.0,-8.761736e-15,1.516255,-48.325589,-0.890365,0.179846,1.027196,9.382558
V4,284807.0,2.811118e-15,1.415869,-5.683171,-0.84864,-0.019847,0.743341,16.875344
V5,284807.0,-1.552103e-15,1.380247,-113.743307,-0.691597,-0.054336,0.611926,34.801666
V6,284807.0,2.04013e-15,1.332271,-26.160506,-0.768296,-0.274187,0.398565,73.301626
V7,284807.0,-1.698953e-15,1.237094,-43.557242,-0.554076,0.040103,0.570436,120.589494
V8,284807.0,-1.893285e-16,1.194353,-73.216718,-0.20863,0.022358,0.327346,20.007208
V9,284807.0,-3.14764e-15,1.098632,-13.434066,-0.643098,-0.051429,0.597139,15.594995


Unnamed: 0,Outlier percent,Count
V1,2.48,7062
V2,4.75,13526
V3,1.18,3363
V4,3.91,11148
V5,4.32,12295
V6,8.06,22965
V7,3.14,8948
V8,8.47,24134
V9,2.91,8283
V10,3.33,9496


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


# Distribution of Features

In [5]:
import altair as alt
alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

In [6]:
alt.Chart(df).mark_bar().encode(
    alt.X("Time",  bin=alt.Bin(maxbins=30)),
    alt.Y('count()',),
    tooltip="count()"
).properties(
    width=900,
    height=400
).interactive()

In [7]:
alt.Chart(df).mark_bar(opacity=0.8).encode(
    alt.X("Time",  bin=alt.Bin(maxbins=30)),
    alt.Y('count()', scale=alt.Scale(type='log'), ),
    column="Class",
    tooltip="count()"
).properties(
    width=700,
    height=400
).interactive()

In [8]:
alt.Chart(df).mark_bar().encode(
    alt.X("Amount",  bin=alt.Bin(maxbins=30), ),
    alt.Y("count()", scale=alt.Scale(type='log')),
    tooltip="count()"
).properties(
    width=800,
    height=400
).interactive()



In [9]:
alt.Chart(df).mark_bar().encode(
    alt.X("Amount",  bin=alt.Bin(maxbins=30), ),
    alt.Y("count()", scale=alt.Scale(type='log'),),
    column="Class",
    tooltip="count()"
).properties(
    width=800,
    height=400
).interactive()



In [10]:
alt.Chart(df).mark_bar().encode(
    alt.X("Class:O",  ),
    alt.Y("count()", scale=alt.Scale(type='log')),
    tooltip="count()"
).properties(
    width=500,
    height=400
).interactive()

# Feature scaling

In [11]:
r_scaler = RobustScaler()

df["Amount"] = r_scaler.fit_transform(df["Amount"].values.reshape(-1,1))
df["Time"] = r_scaler.fit_transform(df["Time"].values.reshape(-1,1))

# Super secret model 

#### **Dont unfold the code just yet!!**

**Try to see whats the issue here**

In [2]:
class super_model:
    def train(self,X,y):
        return np.zeros((len(y), 1))
    

In [3]:
model = super_model()

X = df.drop("Class", axis=1)
y = df["Class"]

y_trained = model.train(X=X,y=y)

f1_score(y, y_trained )

NameError: name 'df' is not defined

Thats one hell of accuracy!

Seems too good to be true!?

What do you think is the issue here?

Now you could unfold the code!

Notice the issue? Comment if you did!

# Random Under Sampler

In [14]:
X = df.drop("Class", axis=1)
y = df["Class"]

In [15]:
rus = RandomUnderSampler(random_state=42)
x_tmp, y_tmp = rus.fit_resample(X, y)

print(np.round(y_tmp.value_counts()/len(y_tmp) * 100, 2).to_dict())
print(len(y_tmp))

undersampled_df = pd.concat([x_tmp, y_tmp], axis=1)
undersampled_df["Class"].value_counts()

{0: 50.0, 1: 50.0}
984


0    492
1    492
Name: Class, dtype: int64

In [16]:
X = undersampled_df.drop("Class", axis=1)
y = undersampled_df["Class"]



cols = X.columns[SelectPercentile(mutual_info_classif, percentile=25).fit(X,y).get_support()]
print(f"columns are {list(cols)}")
X_new = X[cols]
X_new.head()

undersampled_df = pd.concat([X_new, y], axis=1)


print(f"Treating outliers for {cols}")
print(f"length of data before {len(undersampled_df)}")

undersampled_df.drop_row_outlier(cols= cols, inplace=True)

print(f"length of data After {len(undersampled_df)}")

columns are ['V3', 'V4', 'V10', 'V11', 'V12', 'V14', 'V16', 'V17']
Treating outliers for Index(['V3', 'V4', 'V10', 'V11', 'V12', 'V14', 'V16', 'V17'], dtype='object')
length of data before 984
length of data After 867


https://www.youtube.com/watch?v=DF-rJA-eOUQ

In [17]:
classifiers = {
    "LogisiticRegression": LogisticRegression(),
    "KNearest": KNeighborsClassifier(),
#     "Support Vector Classifier": SVC(),
    "RandomForestClassifier": RandomForestClassifier(class_weight=dict({0:1,1:100})),
    "LGBMClassifier":LGBMClassifier()
}

X = undersampled_df.drop("Class",axis=1)
y = undersampled_df["Class"]


scoring = {'accuracy' : make_scorer(accuracy_score), 
           'precision' : make_scorer(precision_score),
           'recall' : make_scorer(recall_score), 
           'f1_micro' : make_scorer(f1_score, average="micro")}


for key, classifier in classifiers.items():

    training_score = cross_validate(classifier, X, y, cv=10, scoring=scoring, n_jobs=-1 )
    for score in scoring.keys():
        print(f'''{classifier.__class__.__name__ :<25} {round(training_score["test_"+score].mean(), 2) * 100} % ({round(training_score["test_"+score].std(), 2) * 100}){score}''')

    print("\n")


LogisticRegression        93.0 % (4.0)accuracy
LogisticRegression        97.0 % (3.0)precision
LogisticRegression        87.0 % (7.000000000000001)recall
LogisticRegression        93.0 % (4.0)f1_micro


KNeighborsClassifier      93.0 % (3.0)accuracy
KNeighborsClassifier      97.0 % (3.0)precision
KNeighborsClassifier      86.0 % (7.000000000000001)recall
KNeighborsClassifier      93.0 % (3.0)f1_micro


RandomForestClassifier    93.0 % (4.0)accuracy
RandomForestClassifier    96.0 % (4.0)precision
RandomForestClassifier    86.0 % (6.0)recall
RandomForestClassifier    93.0 % (4.0)f1_micro


LGBMClassifier            93.0 % (4.0)accuracy
LGBMClassifier            95.0 % (4.0)precision
LGBMClassifier            88.0 % (7.000000000000001)recall
LGBMClassifier            93.0 % (4.0)f1_micro




|                      | Actual  Fraud | Actual  NOT Fraud |     |
|:--------------------:|---------------|:-----------------:|-----|
| Predicted  Fraud     | TP            | FP                | PPV |
| Predicted  NOT Fraud | FN            | TN                | NPV |
|                      | TPR           | TNR               |     |

TP = True Positive rate

FP = False Positive

FN = False Negative

TN = True Negative

<br>

TPR(Recall) = True Positive rate = $\dfrac{TP}{TP + FN}$

TNR = True Negative rate = $\dfrac{TN}{TN + FP}$

PPV(precision) = Positive Predictive value = $\dfrac{TP}{TP + FP}$

NPV = Negative predictive value = $\dfrac{TN}{TN + FN}$

F1 = $2 \cdot \dfrac{precision \cdot recall}{precision + recall}$

# SMOTE

In [18]:
# https://towardsdatascience.com/the-right-way-of-using-smote-with-cross-validation-92a8d09d00c7
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek

In [19]:
X = df.drop("Class", axis=1)
y = df["Class"]



**Time consuming method with Cross Val**

In [20]:
# class_weight=dict({0:1,1:100})
# classifiers = {
#     "LogisiticRegression": LogisticRegression(),
#     "KNearest": KNeighborsClassifier(),
# #     "Support Vector Classifier": SVC(),
#     "RandomForestClassifier": RandomForestClassifier(class_weight=class_weight),
#     "LGBMClassifier":LGBMClassifier()
# }

# scoring = {'accuracy' : make_scorer(accuracy_score), 
#            'precision' : make_scorer(precision_score),
#            'recall' : make_scorer(recall_score), 
#            'f1_micro' : make_scorer(f1_score, average="micro")}

# for key, classifier in classifiers.items():

#     feat_select = SelectPercentile(mutual_info_classif, percentile=25)
#     pipe = imbpipeline(steps=[
#         ("select",feat_select),
#         ("smote",SMOTETomek(random_state=22,sampling_strategy=0.5)),
#         (key, classifier)

#     ])

#     training_score = cross_validate(pipe, X, y, cv=5, scoring=scoring, n_jobs=-1 )
#     for score in scoring.keys():
#         print(f'''{key :<25} {round(training_score["test_"+score].mean(), 2) * 100} % ({round(training_score["test_"+score].std(), 2) * 100}){score}''')
        
#     print("\n")
    
    


**Faster method with train test split**

In [21]:
class_weight=dict({0:1,1:100})
classifiers = {
    "LogisiticRegression": LogisticRegression(),
    "KNearest": KNeighborsClassifier(),
#     "Support Vector Classifier": SVC(),
    "RandomForestClassifier": RandomForestClassifier(class_weight=class_weight),
    "LGBMClassifier":LGBMClassifier()
}

scoring = {'accuracy' : make_scorer(accuracy_score), 
           'precision' : make_scorer(precision_score),
           'recall' : make_scorer(recall_score), 
           'f1_micro' : make_scorer(f1_score, average="micro")}

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print(f"len of X {len(y)}\nlen of train {len(y_train)}\nlen of test {len(y_test)}")

for key, classifier in classifiers.items():

    feat_select = SelectPercentile(mutual_info_classif, percentile=25)
    pipe = imbpipeline(steps=[
        ("select",feat_select),
        ("smote",SMOTETomek(random_state=22,sampling_strategy=0.5)),
        (key, classifier)

    ])

    print(f"-----------{key}-----------------")
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    
    print(classification_report(y_test, y_pred))
    print("\n")


len of X 284807
len of train 227845
len of test 56962
-----------LogisiticRegression-----------------
              precision    recall  f1-score   support

           0       1.00      0.99      0.99     56865
           1       0.11      0.89      0.20        97

    accuracy                           0.99     56962
   macro avg       0.56      0.94      0.60     56962
weighted avg       1.00      0.99      0.99     56962



-----------KNearest-----------------
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56865
           1       0.30      0.87      0.44        97

    accuracy                           1.00     56962
   macro avg       0.65      0.93      0.72     56962
weighted avg       1.00      1.00      1.00     56962



-----------RandomForestClassifier-----------------
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56865
           1       0.69      0.84      0.76

refer my previous works on pipeline and others [link](https://www.kaggle.com/imams2000/pipelines-clearly-explained-why-you-should-use)