# **Fraud Transaction Detection Using Random Forest Classifier**


**Project Overview**


This project focuses on building a predictive model using the Random Forest classifier to detect fraudulent transactions in a dataset. The objective is to identify patterns and classify transactions as either fraudulent or legitimate, ensuring accuracy and reliability in detection.

**1. Cleaning the Dataset.**

In [28]:
# Importing Useful Libraries for EDA and model building 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

In [4]:
# Loading  and viewing the dataset 
df = pd.read_csv('Fraud.csv')
df

In [6]:
# Dataset Opening Values 
df.head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0


In [7]:
df.tail(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
6362610,742,TRANSFER,63416.99,C778071008,63416.99,0.0,C1812552860,0.0,0.0,1,0
6362611,742,CASH_OUT,63416.99,C994950684,63416.99,0.0,C1662241365,276433.18,339850.17,1,0
6362612,743,TRANSFER,1258818.82,C1531301470,1258818.82,0.0,C1470998563,0.0,0.0,1,0
6362613,743,CASH_OUT,1258818.82,C1436118706,1258818.82,0.0,C1240760502,503464.5,1762283.33,1,0
6362614,743,TRANSFER,339682.13,C2013999242,339682.13,0.0,C1850423904,0.0,0.0,1,0
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.0,C776919290,0.0,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.0,C1881841831,0.0,0.0,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.0,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.0,C2080388513,0.0,0.0,1,0
6362619,743,CASH_OUT,850002.52,C1280323807,850002.52,0.0,C873221189,6510099.11,7360101.63,1,0


In [8]:
#  Checking the datatypes and info.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [9]:
#  Understanding Statistic values in Dataset.
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


# **Statistical Analysis Summary**

The dataset contains over 6.36 million entries with large variations in transaction amounts and balances (amount, oldbalanceOrg, newbalanceDest). Fraudulent transactions (isFraud) are rare, with a mean of 0.0013, and flagged cases (isFlaggedFraud) are even less frequent. The wide ranges and skewed distribution in amounts highlight the need to address class imbalance for effective modeling.

In [10]:
#  Checking for Null Values
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


**There was a guideline about the merchent transactions that there are not data for fraud. Removing the rows contain merchent transactions.**

In [11]:
#  Removing the rows related to Merchent Transaction.
df2 = df[~df['nameOrig'].str.startswith('M') & ~df['nameDest'].str.startswith('M')]
print("Original dataframe shape:", df.shape)
print("Dataframe shape after dropping merchant transactions:", df2.shape)

Original dataframe shape: (6362620, 11)
Dataframe shape after dropping merchant transactions: (4211125, 11)



**# Balancing the Dataset.**

The isFraud variable is highly imbalanced, with significantly fewer fraudulent transactions compared to legitimate ones. To address this, random sampling was used to balance the entries for both classes (1 and 0). This ensures that the dataset is more suitable for training a model, reducing bias towards the majority class.

In [29]:
# Assuming your dataframe is named df
# Separate the two classes in the iFraud column
class_0 = df2[df2['isFraud'] == 0]
class_1 = df2[df2['isFraud'] == 1]

# Get the minimum sample size
min_sample_size = min(len(class_0), len(class_1))

# Take random samples from each class
balanced_class_0 = class_0.sample(n=min_sample_size, random_state=42)
balanced_class_1 = class_1.sample(n=min_sample_size, random_state=42)

# Combine the samples to form a balanced dataframe
df2  = pd.concat([balanced_class_0, balanced_class_1])

# Shuffle the resulting dataframe
df2 = df2.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Balanced dataset size: {df2.shape}")
print(df2['isFraud'].value_counts())


Balanced dataset size: (16426, 11)
isFraud
1    8213
0    8213
Name: count, dtype: int64


df2 is now balanced dataset according to number of fraud and not-fraud.

In [31]:
df2

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,671,TRANSFER,61505.06,C1590661770,61505.06,0.00,C1045739215,0.00,0.00,1,0
1,380,TRANSFER,799797.88,C479893074,799797.88,0.00,C2076141378,0.00,0.00,1,0
2,226,CASH_OUT,283485.64,C743973427,252043.00,0.00,C1153134931,0.00,283485.64,0,0
3,349,CASH_OUT,353465.78,C490855557,0.00,0.00,C55500962,2667535.09,3021000.87,0,0
4,236,CASH_OUT,271588.03,C212428730,42138.00,0.00,C993613585,0.00,271588.03,0,0
...,...,...,...,...,...,...,...,...,...,...,...
16421,411,TRANSFER,427373.65,C985224754,427373.65,0.00,C2110832776,0.00,0.00,1,0
16422,645,TRANSFER,1576874.04,C1859428383,1576874.04,0.00,C1424577358,0.00,0.00,1,0
16423,681,CASH_IN,20092.65,C791271593,9151396.68,9171489.33,C1215233957,54978.47,34885.82,0,0
16424,333,CASH_IN,122019.26,C1364847276,0.00,122019.26,C1978196925,303069.37,181050.11,0,0


**4. Creating new column to convert column type into numeric form by name of type_numeric.**

In [37]:

type_mapping = {
    'CASH_OUT': 1,
    'TRANSFER': 2,
    'DEBIT': 3,
    'PAYMENT': 4,
    'CASH_IN': 5
}


df2['type_numeric'] = df2['type'].map(type_mapping)
print("DataFrame with numeric 'type' column:\n", df2[['type', 'type_numeric']].head())


DataFrame with numeric 'type' column:
        type  type_numeric
0  TRANSFER             2
1  TRANSFER             2
2  CASH_OUT             1
3  CASH_OUT             1
4  CASH_OUT             1


**Using VIF method to check Multi-collinearity in variables.**

In [36]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm 
df2_cleaned = df.copy()  
X = df2_cleaned.select_dtypes(include=[float, int])
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

          feature         VIF
0           const    4.137111
1            step    1.003137
2          amount    3.771634
3   oldbalanceOrg  502.913267
4  newbalanceOrig  504.282321
5  oldbalanceDest   66.101079
6  newbalanceDest   76.200749
7         isFraud    1.186855
8  isFlaggedFraud    1.002562


# Variance Inflation Factor (VIF) Analysis Summary


**The VIF analysis evaluates multicollinearity among features:**

**High VIF Values:** oldbalanceOrg (502.91) and newbalanceOrig (504.28) indicate severe multicollinearity, suggesting redundancy in these variables.


**Moderate VIF Values:** oldbalanceDest (66.10) and newbalanceDest (76.20) also show significant multicollinearity.


**Low VIF Values:** Features like step (1.00), isFraud (1.19), and isFlaggedFraud (1.00) have acceptable VIF values, indicating minimal collinearity.
This analysis highlights variables that may need removal or transformation to improve model stability and performance

**Checking Correlation of variables in dataset to output of (isfraud).**

In [40]:
import pandas as pd

# Select numerical variables only
numerical_columns = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig',
                     'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'type_numeric']

# Compute the correlation matrix
correlation_matrix = df2[numerical_columns].corr()

# Get correlations with the 'isFraud' column
fraud_correlations = correlation_matrix['isFraud'].sort_values(ascending=False)

# Print correlations sorted by importance
print("Correlations with 'isFraud':")
fraud_correlations


Correlations with 'isFraud':


isFraud           1.000000
step              0.321213
amount            0.315449
oldbalanceOrg     0.060008
isFlaggedFraud    0.031225
newbalanceDest   -0.071121
oldbalanceDest   -0.151967
newbalanceOrig   -0.184499
type_numeric     -0.343490
Name: isFraud, dtype: float64

# Correlation Analysis Summary


The correlation analysis identifies the relationships between features and the target variable isFraud. Key observations include:

Positive Correlations: step (0.32) and amount (0.31) show moderate positive correlations with isFraud, suggesting they may influence fraudulent behavior.


Negative Correlations: Features like type_numeric (-0.34) and newbalanceOrig (-0.18) have moderate to weak negative correlations, indicating inverse relationships with fraud occurrence.
This analysis helps prioritize variables for feature selection and modeling.

In [41]:
# Correct column selection using square brackets
X = df2[['type_numeric', 'oldbalanceOrg','newbalanceOrig','amount']]  # Select specific columns as features
y = df2['isFraud']  # Select the target column

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Print the shapes of training and testing sets
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


Training set shape: (13140, 4)
Test set shape: (3286, 4)


In [44]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)


rf_model.fit(X_train, y_train)


y_pred = rf_model.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:", accuracy_score(y_test, y_pred))

Confusion Matrix:
[[1623   20]
 [   3 1640]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1643
           1       0.99      1.00      0.99      1643

    accuracy                           0.99      3286
   macro avg       0.99      0.99      0.99      3286
weighted avg       0.99      0.99      0.99      3286

Accuracy Score: 0.9930006086427268


# Model Evaluation Summary.


The model's performance is summarized using the confusion matrix, classification report, and accuracy score. The confusion matrix highlights true/false positives and negatives, while the classification report provides precision, recall, and F1-scores for each class. The accuracy score reflects the overall proportion of correctly classified transactions. These metrics help evaluate the model's ability to detect fraudulent transactions effectively.