## Import essential libraries that will be used in the project


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Reading the dataset using pandas


In [2]:
df = pd.read_csv("Fraud.csv")

In [3]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


## 1. Data cleaning including missing values, outliers and multi-collinearity
### Checking for missing values and removing of outliers in this step

In [4]:
df.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

From above we can see that there are no missing values in the dataset.
Talking about outliers, there is no point to check for outliers because being it a financial data, the bank balance can be any real value, so we can not check it on any parameter.

### Checking for Multicollinearity:
For this we will use variance inflation factor.
Now, for variance inflation calculation we need to drop non-numeric columns from our dataframe.

In [5]:
copy_df = df.copy(deep=True)
copy_df['type'] = copy_df['type'].map({'PAYMENT':1 ,'TRANSFER':2, 'CASH_OUT':3, 'DEBIT':4, 'CASH_IN':5})
copy_df = copy_df.drop(columns = ['nameOrig','nameDest','isFraud','isFlaggedFraud'])
copy_df.head()

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest
0,1,1,9839.64,170136.0,160296.36,0.0,0.0
1,1,1,1864.28,21249.0,19384.72,0.0,0.0
2,1,2,181.0,181.0,0.0,0.0,0.0
3,1,3,181.0,181.0,0.0,21182.0,0.0
4,1,1,11668.14,41554.0,29885.86,0.0,0.0


In [6]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [7]:
variance_data = pd.DataFrame()
variance_data["feature"] = copy_df.columns
variance_data["VIF"] = [variance_inflation_factor(copy_df.values, i )
                        for i in range(len(copy_df.columns))]

print(variance_data)

          feature         VIF
0            step    2.466060
1            type    3.251976
2          amount    4.129854
3   oldbalanceOrg  501.282300
4  newbalanceOrig  508.906801
5  oldbalanceDest   73.377939
6  newbalanceDest   84.656570


From the results we can observe that there are two features namely, oldbalanceOrg and newbalanceOrg, which are high correlated as they have high values of variance inflation

Now there are two types of Fraud, isFraud indicates tha actual fraud whereas isFlaggedfraud tells what the system prevents the transaction due to some threshold

### isFraud:

In [8]:
print("Type of transactions which are fraud: {}".format(list(df.loc[df.isFraud == 1].type.drop_duplicates())))

fraud_transfer = df.loc[(df.isFraud == 1) & (df.type == "TRANSFER")]
fraud_cashout = df.loc[(df.isFraud == 1) & (df.type == "CASH_OUT")]

print("Number of transfers are fraud: {}".format(len(fraud_transfer)))
print("Number of cashouts are fraud: {}".format(len(fraud_cashout)))


Type of transactions which are fraud: ['TRANSFER', 'CASH_OUT']
Number of transfers are fraud: 4097
Number of cashouts are fraud: 4116


### isFlaggedFraud:

In [9]:
flagfraud1 = list(df.loc[(df.isFlaggedFraud == 1)].amount.values)
flagfraud0 = list(df.loc[(df.isFlaggedFraud == 0)].amount.values)

print('minimum amount transfer where isFlaggedFraud is 1 = {}'.format(min(flagfraud1)))
print('maximum amount transfer where isFlaggedFraud is 0 = {}'.format(max(flagfraud0)))
print('Number of isFlaggedFraud == 1 : {}'.format(len(list(df.loc[df.isFlaggedFraud ==1 ].isFlaggedFraud.values))))

minimum amount transfer where isFlaggedFraud is 1 = 353874.22
maximum amount transfer where isFlaggedFraud is 0 = 92445516.64
Number of isFlaggedFraud == 1 : 16


isFlaggedFraud variable is set to 1 when transfer is more than 200,000 in single transaction but we can see that from above analysis. This variable is useless because maximum amount transfer in single transcation is 92445516.64 , There are only 16 values where isFlaggedFraud is set so we will drop this feature