<a href="https://colab.research.google.com/github/ItshMoh/fraud_transaction/blob/main/Fraud_detection_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Libraries

In [137]:
import pandas as pd
import numpy as np
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import VarianceThreshold
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

Importing dataset from the google Drive

In [138]:
data=pd.read_csv('/content/drive/MyDrive/Fraud.csv')

OverView Of Data

In [139]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [140]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [141]:
data.isnull().count()

step              6362620
type              6362620
amount            6362620
nameOrig          6362620
oldbalanceOrg     6362620
newbalanceOrig    6362620
nameDest          6362620
oldbalanceDest    6362620
newbalanceDest    6362620
isFraud           6362620
isFlaggedFraud    6362620
dtype: int64

In [142]:
data.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

Data Preprocessing

In [143]:
#Checking If there is any Duplicate columns
def get_duplicate_columns(df):

    duplicate_columns = {}
    seen_columns = {}

    for column in df.columns:
        current_column = df[column]

        # Convert column data to bytes
        try:
            current_column_hash = current_column.values.tobytes()
        except AttributeError:
            current_column_hash = current_column.to_string().encode()

        if current_column_hash in seen_columns:
            if seen_columns[current_column_hash] in duplicate_columns:
                duplicate_columns[seen_columns[current_column_hash]].append(column)
            else:
                duplicate_columns[seen_columns[current_column_hash]] = [column]
        else:
            seen_columns[current_column_hash] = column

    return duplicate_columns

In [10]:
get_duplicate_columns(data)
#It has no duplicate columns.

{}

Checking the distribution of target Variable that is 'isFraud'.

In [154]:
data.isFraud.value_counts()#The Data is highly imbalance.

0    6354407
1       8213
Name: isFraud, dtype: int64

In [148]:
fr_trans= data[data.isFraud==1]

In [150]:
len(fr_trans)

8213

Checking some statistics of the amount of fraud Transaction

In [151]:
fr_trans.amount.describe()#It is giving the statistics related to the amount of fraud transaction.

count    8.213000e+03
mean     1.467967e+06
std      2.404253e+06
min      0.000000e+00
25%      1.270913e+05
50%      4.414234e+05
75%      1.517771e+06
max      1.000000e+07
Name: amount, dtype: float64

Checking some statistics of the amount of Legit Transaction.

In [152]:
leg_trans= data[data.isFraud==0]

In [153]:
leg_trans.amount.describe()

count    6.354407e+06
mean     1.781970e+05
std      5.962370e+05
min      1.000000e-02
25%      1.336840e+04
50%      7.468472e+04
75%      2.083648e+05
max      9.244552e+07
Name: amount, dtype: float64

It is clear that the dataset is highly imbalance. We have to make it balanced. Here we will be applying RandomUnderSampling.

In [155]:
from imblearn.under_sampling import RandomUnderSampler
X = data.drop('isFraud', axis = 1)
y = data.isFraud
rus = RandomUnderSampler(sampling_strategy=0.8)
X_res, y_res = rus.fit_resample(X, y)
print(X_res.shape, y_res.shape)
print(pd.value_counts(y_res))

(18479, 10) (18479,)
0    10266
1     8213
Name: isFraud, dtype: int64


Checking which Feature to be kept and which to be discarded.

Variance Threshold

In [156]:
sel= VarianceThreshold(threshold=0.05)
X_res= X_res.drop(['type','nameOrig','nameDest'],axis=1) #Dropping the non_numeric features.

In [157]:
from sklearn.preprocessing import MinMaxScaler

In [158]:
scaler = MinMaxScaler()
X_res1= scaler.fit_transform(X_res)

In [159]:
X_res= pd.DataFrame(X_res1,columns=X_res.columns)

In [160]:
X_res.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud
0,0.060647,0.000586,0.000827,0.000693,0.0,0.0,0.0
1,0.498652,0.00609,0.000863,0.0,0.002286,0.002937,0.0
2,0.473046,0.000462,0.001544,0.001618,0.0,0.0,0.0
3,0.408356,0.00074,0.002672,0.00283,0.004415,0.004485,0.0
4,0.17655,5.6e-05,0.000836,0.000976,0.0,0.0,0.0


In [161]:
sel.fit(X_res)

In [162]:
columns=X_res.columns[sel.get_support()]

In [163]:
columns

Index(['step'], dtype='object')

Using the Correlation Matrix for selecting features.

In [164]:
corr_matrix= X_res.corr()

In [165]:
# Get the column names of the DataFrame
columns = corr_matrix.columns

# Create an empty list to keep track of columns to drop
columns_to_drop = []

# Loop over the columns
for i in range(len(columns)):
    for j in range(i + 1, len(columns)):
        # Access the cell of the DataFrame
        if corr_matrix.loc[columns[i], columns[j]] > 0.95:
            columns_to_drop.append(columns[j])

print(len(columns_to_drop))

0


Applying Correlation matrix and we see it is showing no columns to drop. Here we have not removed the "step" features. In the variance threshold we are getting that we should remove the "step" features.

Checking for multicollinearity

In [166]:
Y= y_res
iv= X_res.columns
# iv=iv.delete(0)
X=X_res[iv]

In [167]:
iv

Index(['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFlaggedFraud'],
      dtype='object')

In [168]:
[vif(X_res[iv].values, index) for index in range(len(iv))]

[1.2359382940604318,
 24.35745162640952,
 67.54379644316828,
 38.68735076636686,
 16.061303708732495,
 18.6084903083912,
 1.1263868995966957]

In [169]:
for i in range(len(iv)):
    vif_list = [vif(X_res[iv].values, index) for index in range(len(iv))]
    maxvif = max(vif_list)
    print("Max VIF value is ", maxvif)
    drop_index = vif_list.index(maxvif)
    print("For Independent variable", iv[drop_index])

    if maxvif > 10:

        print("Deleting", iv[drop_index])
        iv = iv.delete(drop_index)
        print("Final Independent_variables ", iv)

Max VIF value is  67.54379644316828
For Independent variable oldbalanceOrg
Deleting oldbalanceOrg
Final Independent_variables  Index(['step', 'amount', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest',
       'isFlaggedFraud'],
      dtype='object')
Max VIF value is  17.396742346956778
For Independent variable newbalanceDest
Deleting newbalanceDest
Final Independent_variables  Index(['step', 'amount', 'newbalanceOrig', 'oldbalanceDest', 'isFlaggedFraud'], dtype='object')
Max VIF value is  1.2342963409718601
For Independent variable step
Max VIF value is  1.2342963409718601
For Independent variable step
Max VIF value is  1.2342963409718601
For Independent variable step
Max VIF value is  1.2342963409718601
For Independent variable step
Max VIF value is  1.2342963409718601
For Independent variable step


In [170]:
iv

Index(['step', 'amount', 'newbalanceOrig', 'oldbalanceDest', 'isFlaggedFraud'], dtype='object')

Here after the multicollinearity check we get that feature 'oldbalanceOrg' should be removed. from Variance Threshold we get that feature'step' should be removed. Now we remove both the features.

On removing the feature 'oldbalanceOrg' the accuacy and roc_auc_score decreases. So we will not remove the feature'oldbalanceOrg'

Training Phase of the model

Here we will again RandomUnderSampling the data from scratch. As we have find the features to be removed.

In [171]:
from imblearn.under_sampling import RandomUnderSampler
X = data.drop('isFraud', axis = 1)
y = data.isFraud
rus = RandomUnderSampler(sampling_strategy=0.8)
X_res, y_res = rus.fit_resample(X, y)
print(X_res.shape, y_res.shape)
print(pd.value_counts(y_res))

(18479, 10) (18479,)
0    10266
1     8213
Name: isFraud, dtype: int64


Here we will first remove the features that we don't want.

In [172]:
X_res=X_res.drop(['step','type','nameOrig','nameDest'],axis=1)

In [173]:
def train_validation_test_split(
    X, y, train_size=0.8, val_size=0.1, test_size=0.1,
    random_state=None, shuffle=True):
    assert int(train_size + val_size + test_size + 1e-7) == 1
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, shuffle=shuffle)
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_val, y_train_val,    test_size=val_size/(train_size+val_size),
        random_state=random_state, shuffle=shuffle)
    return X_train, X_val, X_test, y_train, y_val, y_test

In [174]:
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(
    X_res, y_res, train_size=0.8, val_size=0.1, test_size=0.1, random_state=1)


In [191]:
model = RandomForestClassifier(n_estimators=100)

In [192]:
model.fit(X_train,y_train)

In [193]:
y_pred=model.predict(X_test)

In [194]:
print(classification_report(y_test, y_pred))
print('accuracy', accuracy_score(y_test, y_pred))
roc_auc_score(y_test, y_pred)

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1019
           1       0.98      1.00      0.99       829

    accuracy                           0.99      1848
   macro avg       0.99      0.99      0.99      1848
weighted avg       0.99      0.99      0.99      1848

accuracy 0.9902597402597403


0.9907179748825394

The github link to this notebook: