In [18]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import metrics

In [2]:
data = pd.read_csv('Fraud_pr.csv')


In [3]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
data.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [20]:
data.isFraud.value_counts()

isFraud
0    1047433
1       1142
Name: count, dtype: int64

In [21]:
data.isFlaggedFraud.value_counts()


isFlaggedFraud
0    1048575
Name: count, dtype: int64

In [5]:
data=data.drop(['nameOrig','nameDest'],axis=1)

In [6]:
label_encoder = preprocessing.LabelEncoder()
data['type'] = label_encoder.fit_transform(data['type'])

In [7]:

X = data.drop('isFraud', axis=1)
y = data['isFraud']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)


In [9]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [21]:
#Using GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_nb = gnb.predict(X_test)

print("Naive Bayes Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_nb))
print("Classification Report:\n", classification_report(y_test, y_pred_nb))

Naive Bayes Results:
Accuracy: 0.9717664449371767
Confusion Matrix:
 [[407439  11546]
 [   296    149]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.97      0.99    418985
           1       0.01      0.33      0.02       445

    accuracy                           0.97    419430
   macro avg       0.51      0.65      0.51    419430
weighted avg       1.00      0.97      0.98    419430



In [23]:
# Using Logistic Regression

lr = LogisticRegression(random_state=0, max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

Logistic Regression Results:
Accuracy: 0.9990892401592638
Confusion Matrix:
 [[418980      5]
 [   377     68]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    418985
           1       0.93      0.15      0.26       445

    accuracy                           1.00    419430
   macro avg       0.97      0.58      0.63    419430
weighted avg       1.00      1.00      1.00    419430



**Fraud Detection in a Financial Firm**

1. Data Cleaning (Missing Values, Outliers, Multicollinearity)

    Missing Values: I checked for missing values using data.isna().sum() and found no missing values in the dataset.

    Outliers: Outliers in transaction amounts can distort results. I reviewed distributions and used scaling (StandardScaler) to minimize impact.

    Multicollinearity: I removed columns like nameOrig and nameDest since they are identifiers, not features. I also checked for feature correlations to avoid redundancy and ensure the model works effectively.

2. Fraud Detection Model Description

   I implemented two models:

    * Naive Bayes Classifier

    * Logistic Regression

    I used LabelEncoder to convert the type column and scaled numerical features for better performance. The dataset was split into 60% training and 40% testing for evaluation.

3. Variable Selection

      All relevant numerical and encoded categorical features were included:

      step, amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest, and type.

      Columns like nameOrig and nameDest were removed because they don’t contribute predictive value and may lead to overfitting.

4. Model Performance

   Accuracy:

    Naive Bayes: ~97.17%

    Logistic Regression: ~99.90%

    Used confusion_matrix and classification_report to evaluate precision, recall, F1-score.

    Logistic Regression showed superior performance and balance between false positives and false negatives.

5. Key Factors That Predict Fraudulent Transactions

    Transaction Type: Certain transaction types (e.g., TRANSFER, CASH_OUT) are strongly associated with fraud.

    Amount: Higher transaction amounts are often suspicious.

    Old and New Balances: Fraudulent transactions often show zero or unusual balances.

    Step: To find specific time patterns

6. Do These Factors Make Sense?

    Yes:

    Fraud is more common in TRANSFER and CASH_OUT because these types are used to move funds quickly.

    Zero balances before/after transactions are red flags.

    Large amounts or sudden balance changes are suspicious behavior.

7. Prevention Recommendations

    Real-Time Monitoring: Flag high-value transactions or unusual behavior using rules or ML models.

    Multi-Factor Authentication (MFA): Add identity verification steps for risky operations.

    User Behavior Profiling: Learn normal user patterns and detect deviations.

    Limit High-Risk Transactions: Impose thresholds or extra checks on TRANSFER or CASH_OUT.

8. How to Measure Success of Preventive Actions

    Reduction in Fraud Rate: Track how many frauds are prevented vs. previous baseline.

    False Positive Rate: Ensure that legitimate transactions are not wrongly flagged.

    Customer Feedback: Monitor for complaints due to blocked transactions.

    Precision/Recall: Re-evaluate model performance after changes.