# CREDIT CARD FRAUD DETECTION

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

I downloaded a dataset from Kaggle that contains transactions made by credit cards in September 2013 by European cardholders.  Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

In this exercise we will create a function to divide our data and we will create personalized transformators and Pipelines. 

### Data visualization

In [24]:
import os  #We use the os module to recieve the path of a file

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
%matplotlib inline

In [25]:
path = os.getcwd() + '\data\creditcard.csv'

column_names = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
                'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
                'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']
data = pd.read_csv(path, header=None, names=column_names)

data = data.drop(data.columns[0], axis=1)  # Delete the firs column
data = data.drop(0, axis=0)  # Delete the first row

data.head()



  data = pd.read_csv(path, header=None, names=column_names)


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
1,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,...,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,0
2,1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,...,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.0089830991432281,0.0147241691924927,2.69,0
3,-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,...,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,378.66,0
4,-0.966271711572087,-0.185226008082898,1.79299333957872,-0.863291275036453,-0.0103088796030823,1.24720316752486,0.23760893977178,0.377435874652262,-1.38702406270197,-0.0549519224713749,...,-0.108300452035545,0.0052735967825345,-0.190320518742841,-1.17557533186321,0.647376034602038,-0.221928844458407,0.0627228487293033,0.0614576285006353,123.5,0
5,-1.15823309349523,0.877736754848451,1.548717846511,0.403033933955121,-0.407193377311653,0.0959214624684256,0.592940745385545,-0.270532677192282,0.817739308235294,0.753074431976354,...,-0.0094306971323291,0.79827849458971,-0.137458079619063,0.141266983824769,-0.206009587619756,0.502292224181569,0.219422229513348,0.215153147499206,69.99,0


### Data division and 0 filling

In [26]:
#We create a function to divide the data into the train set (60%), the test set (20%), the validation set (20%)

def train_val_test_split(data, rstate=42, shuffle=True, stratify=None):
    strat = data[stratify] if stratify else None
    train_set, test_set = train_test_split(
        data, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)


In [27]:
data.isnull().values.any()  #We get False, meaning that there are no null (0) values in the dataset


False

Since there are no null values, we are going to ceate them in order to make the example more challenging and apply some *transformers* and *pipelines*.

In [81]:
# First we call the function created before and split our data

train_set, val_set, test_set = train_val_test_split(data, stratify=None)

X_train = train_set.drop('Class', axis=1)  #We eliminate the last column because it tells whether the transaction is fraud or not.
y_train = train_set['Class'].copy()  #Y_train corresponds to the last column where the results are indicated. 


164407    0
220384    0
49923     0
181643    0
210716    0
49810     0
88178     0
235279    0
228397    0
133844    0
142865    0
221890    0
210571    0
257712    0
103348    0
253802    0
109710    0
218414    0
30676     0
144602    0
139917    0
181387    0
104279    0
167448    0
1374      0
124553    0
257837    0
127793    0
52559     0
93881     0
173516    0
154074    0
162371    0
92934     0
278321    0
78884     0
175285    0
138576    0
171731    0
91227     0
104523    0
221915    0
28662     0
84751     0
725       0
234267    0
253897    0
135066    0
107849    0
37251     0
Name: Class, dtype: object

In [56]:
#It seems that the "V24" and "V2" columns in the DataFrame contain string values, and in order to filter based on the values 
#being greater than or less than certain float values we have to convert the column values to float before performing the filtering operation. 

X_train["V24"] = X_train["V24"].astype(float)
X_train.loc[(X_train["V24"] > 0.9), "V24"] = np.nan

X_train["V2"] = X_train["V2"].astype(float)
X_train.loc[(X_train["V2"] > 0.9), "V2"] = np.nan


X_train.isnull().sum() #It returns the count of every null value of each column


V1            0
V2        82882
V3            0
V4            0
V5            0
V6            0
V7            0
V8            0
V9            0
V10           0
V11           0
V12           0
V13           0
V14           0
V15           0
V16           0
V17           0
V18           0
V19           0
V20           0
V21           0
V22           0
V23           0
V24       74816
V25           0
V26           0
V27           0
V28           0
Amount        0
dtype: int64

### Personalized transformator creation

We are going to create a transformator to delate the null values that we introduced before.


In [57]:
from sklearn.base import BaseEstimator, TransformerMixin

# The dropna method of a Pandas DataFrame is used to remove rows with NaN values. By default, dropna removes any row with at 
#least one NaN value. The returned DataFrame has the same columns as the input DataFrame X but with fewer rows. 

class DeleteNanRows(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return X.dropna()
    

In [58]:
#We apply the transformator created before

delete_nan = DeleteNanRows()
X_train_prep = delete_nan.fit_transform(X_train)

In [59]:
X_train_prep

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
164407,-1.509348,-0.987335,1.089287,-0.779271,1.213151,-0.250687,0.1767,0.211197,0.22693,-0.975273,...,0.502244,0.369591,0.422263,0.222516,0.030641,0.173874,0.28808,-1.509348,0.156923,160.0
181643,1.898722,-0.321038,-1.771837,0.672408,0.115019,-1.267347,0.61281,-0.44107,0.450298,0.107004,...,-0.082106,0.015111,0.006269,-0.029094,-0.071333,0.179444,0.378225,1.898722,-0.059506,104.36
221890,2.207557,-0.591985,-2.104402,-0.866674,-0.131505,-1.804666,0.317702,-0.633757,-0.940007,0.923388,...,0.005309,0.363225,0.971828,-0.105365,0.036243,0.436013,0.078374,2.207557,-0.075659,49.9
257712,1.998537,-0.114150,-0.568839,0.593704,-0.34037,-0.817538,-0.082697,-0.194737,0.719569,-0.005457,...,-0.188603,-0.480196,-1.240859,0.51433,-0.050203,-0.59731,-0.241303,1.998537,-0.042653,7.99
253802,2.021351,-0.022914,-1.761375,1.221473,0.381788,-0.711667,0.318994,-0.077629,0.423742,0.519871,...,-0.489465,0.040648,0.182425,-0.011857,-0.473213,0.372188,-0.47561,2.021351,-0.083353,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194028,1.8872,-0.310380,-1.424391,0.582389,0.672704,1.240693,-0.532214,0.361545,1.222261,-0.737905,...,-0.241503,0.222073,1.2344,0.034773,-0.311531,0.058957,0.919369,1.887200,-0.045539,2.36
262914,1.876679,-0.155204,-1.748219,0.498272,0.003679,-1.178741,0.137847,-0.157584,0.550154,-0.295628,...,-0.174858,0.27917,0.733576,-0.052769,-0.084763,0.12155,-0.122577,1.876679,-0.033335,66.48
175204,-0.210465,-1.057217,0.201472,0.389173,-0.118451,-0.547436,0.765765,-0.308306,0.901931,-0.305594,...,0.664043,0.635084,1.633047,0.928664,-0.105168,-2.017903,-0.775345,-0.210465,0.198168,260.0
207893,1.95284,-0.965820,-1.325992,-0.97928,-0.026764,0.205408,-0.411692,0.012614,-0.598558,0.615366,...,0.02415,0.072499,0.180845,0.321661,-1.031229,-0.441599,-0.337488,1.952840,-0.055575,77.0


In [60]:
# We will create another transformator to scale the columns that have been modified before using RobustScaler, 
#which is really resistat to outliers. 

class CustomScaler(BaseEstimator, TransformerMixin):
    def __init__(self, attributes):
        self.attributes = attributes
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        X_copy = X.copy()
        scale_attrs = X_copy[self.attributes]
        robust_scaler = RobustScaler()
        X_scaled = robust_scaler.fit_transform(scale_attrs)
        X_scaled = pd.DataFrame(X_scaled, columns=self.attributes, index=X_copy.index)
        for attr in self.attributes:
            X_copy[attr] = X_scaled[attr]
        return X_copy

In [61]:
custom_scaler = CustomScaler(["V24", "V2"])
X_train_prep = custom_scaler.fit_transform(X_train_prep)

In [62]:
X_train_prep.head(20)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
164407,-1.509348,-0.430391,1.089287,-0.779271,1.213151,-0.250687,0.1767,0.211197,0.22693,-0.975273,...,0.502244,0.369591,0.422263,0.222516,0.530152,0.173874,0.28808,-1.509348,0.156923,160.0
181643,1.898722,0.252384,-1.771837,0.672408,0.115019,-1.267347,0.61281,-0.44107,0.450298,0.107004,...,-0.082106,0.015111,0.006269,-0.029094,0.371707,0.179444,0.378225,1.898722,-0.059506,104.36
221890,2.207557,-0.025263,-2.104402,-0.866674,-0.131505,-1.804666,0.317702,-0.633757,-0.940007,0.923388,...,0.005309,0.363225,0.971828,-0.105365,0.538856,0.436013,0.078374,2.207557,-0.075659,49.9
257712,1.998537,0.464389,-0.568839,0.593704,-0.34037,-0.817538,-0.082697,-0.194737,0.719569,-0.005457,...,-0.188603,-0.480196,-1.240859,0.51433,0.404538,-0.59731,-0.241303,1.998537,-0.042653,7.99
253802,2.021351,0.557881,-1.761375,1.221473,0.381788,-0.711667,0.318994,-0.077629,0.423742,0.519871,...,-0.489465,0.040648,0.182425,-0.011857,-0.252724,0.372188,-0.47561,2.021351,-0.083353,1.0
218414,-0.334332,0.295166,0.558407,-2.721258,0.222245,0.105259,0.424721,-0.808027,-1.190487,2.396678,...,0.074775,-0.332232,0.203322,-0.152383,0.506053,-0.389804,-0.234411,-0.334332,-0.354962,20.0
139917,-0.552311,-0.314721,1.612646,-2.662326,-0.318061,-0.053011,-0.868999,0.228058,-2.240834,0.95971,...,-0.214437,0.026717,0.236322,-0.256758,-0.879256,0.074152,-0.190813,-0.552311,0.098293,12.0
104279,-2.117529,-0.024523,1.054192,0.020196,2.234944,-0.296009,0.534111,0.063122,-0.472544,-0.625054,...,-0.274365,-0.053201,0.152503,0.651114,-1.599976,1.119583,-0.25911,-2.117529,-0.044974,6.2
127793,-0.709215,0.169415,2.30018,-2.461564,-0.759022,-0.127262,-0.333105,0.132349,-2.577745,0.506961,...,-0.261923,-0.210459,-0.314094,0.044342,0.552925,-0.004456,-0.458079,-0.709215,0.072301,29.4
173516,2.036382,0.091892,-1.219139,-0.39333,-0.325108,-1.089611,0.036977,-0.363482,1.151787,-0.342488,...,-0.08978,0.065222,0.48436,0.017664,0.513982,0.060489,1.002667,2.036382,-0.06948,35.0


### Personalized Pipeline creation

Pipelines allow us to group all the transformation operations we need to perform on a dataset into a single execution flow. This greatly facilitates transformations for different datasets.

In [63]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.impute import SimpleImputer  #The SimpleImputer from sklearn.impute is a transformer class that 
                                          #allows us to impute missing values in a dataset. 
                                          #It replaces missing values with either the mean, median, or most frequent value 
                                          #of each column, based on the strategy chosen by the user.

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('rbst_scaler', RobustScaler()),
    ])




In [65]:
X_train_prep = num_pipeline.fit_transform(X_train)  #We apply the pipeline