<a href="https://colab.research.google.com/github/RajeevRanjany/Applied-Machine-Learning/blob/main/Fraud_Transaction_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fraud Transaction Detection

## Objective
The objective of this project is to build a machine learning model to detect fraudulent financial transactions and to derive actionable business insights that can help a financial company prevent fraud proactively.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Dataset Description
The dataset used is the PaySim simulated mobile money transaction dataset.  
It contains millions of transactions and is highly imbalanced, with fraudulent transactions forming a very small percentage of the total data.


In [None]:
df = pd.read_csv("/content/PS_20174392719_1491204439457_log.csv")

In [None]:
df.sample()


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2864215,227,CASH_OUT,17921.97,C1905211569,3565.0,0.0,C286556195,87309.89,105231.86,0.0,0.0


In [None]:
df.shape


(4707100, 11)

In [None]:
df['isFraud'].value_counts()


Unnamed: 0_level_0,count
isFraud,Unnamed: 1_level_1
0.0,4703372
1.0,3727


## 1. Data cleaning including missing values, outliers and multi-collinearity.

## Data Cleaning
### Missing Values
Missing values were checked to ensure data quality before model building.


In [None]:
df.isnull().sum()

Unnamed: 0,0
step,0
type,0
amount,0
nameOrig,1
oldbalanceOrg,1
newbalanceOrig,1
nameDest,1
oldbalanceDest,1
newbalanceDest,1
isFraud,1


Only a negligible number of missing values were found. Since the proportion was extremely small, the affected rows were removed to avoid introducing bias through imputation.


In [None]:
df = df.dropna()
df.isnull().sum()

Unnamed: 0,0
step,0
type,0
amount,0
nameOrig,0
oldbalanceOrg,0
newbalanceOrig,0
nameDest,0
oldbalanceDest,0
newbalanceDest,0
isFraud,0


### Outliers
Transaction amount was found to be highly right-skewed.  
Log transformation was applied to reduce the effect of extreme values.


In [None]:
df['log_amount'] = np.log1p(df['amount'])


### Multicollinearity
Correlation between balance-related variables was analyzed to detect multicollinearity.

Although some balance variables show correlation, they were retained as they represent meaningful financial behavior.


In [None]:
df[['oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest']].corr()

Unnamed: 0,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest
oldbalanceOrg,1.0,0.999106,0.070548,0.04151
newbalanceOrig,0.999106,1.0,0.072091,0.041344
oldbalanceDest,0.070548,0.072091,1.0,0.966439
newbalanceDest,0.04151,0.041344,0.966439,1.0


Highly correlated features were reviewed, and only meaningful balance-related variables were retained for modeling.


## 2. Describe your fraud detection model in elaboration.
## Fraud Detection Model
Logistic Regression was used as the primary fraud detection model.
It was chosen due to its interpretability, simplicity, and effectiveness in handling binary classification problems with imbalanced data.
Class weighting was applied to give more importance to fraudulent transactions.


## 3. How did you select variables to be included in the model?
## Feature Selection
Variables were selected based on:
- Domain understanding of transaction behavior
- Correlation analysis
- Ability to capture balance changes and transaction patterns


In [None]:

df = df.drop(columns=['nameOrig', 'nameDest'])

df = pd.get_dummies(df, columns=['type'], drop_first=True)

features = [
    'log_amount',
    'oldbalanceOrg', 'newbalanceOrig',
    'oldbalanceDest', 'newbalanceDest'
] + [c for c in df.columns if c.startswith('type_')]

X = df[features]
y = df['isFraud']


In [None]:
df.sample()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,log_amount,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
1432364,139,316845.08,627867.22,944712.29,590626.72,273781.64,0.0,0.0,12.666171,False,False,False,False


## 4. Demonstrate the performance of the model by using best set of tools.
## Model Performance
The model performance was evaluated using Precision, Recall, F1-score, and ROC-AUC.
Recall was prioritized because missing a fraudulent transaction is more costly than flagging a genuine one.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))


              precision    recall  f1-score   support

         0.0       1.00      0.95      0.98    940675
         1.0       0.02      0.90      0.03       745

    accuracy                           0.95    941420
   macro avg       0.51      0.93      0.50    941420
weighted avg       1.00      0.95      0.98    941420

ROC-AUC: 0.9877641954593865


Due to extreme class imbalance, precision for fraud class is low; however, recall and ROC-AUC are prioritized to minimize missed fraudulent transactions.


## 5. What are the key factors that predict fraudulent customer?
## Key Factors Predicting Fraud
Model coefficients were analyzed to identify the most important predictors of fraudulent transactions.


In [None]:
importance = pd.Series(
    model.coef_[0],
    index=X.columns
).sort_values(ascending=False)

importance.head(10)


Unnamed: 0,0
type_TRANSFER,3.105234
type_CASH_OUT,1.010427
oldbalanceOrg,2.3e-05
oldbalanceDest,6e-06
newbalanceDest,-6e-06
newbalanceOrig,-2.8e-05
type_DEBIT,-0.066201
log_amount,-0.172828
type_PAYMENT,-4.035114


The most important predictors of fraud include TRANSFER transactions, CASH_OUT transactions, and sudden changes in account balances.


## 6. Do these factors make sense? If yes, How? If not, How not?
Yes, these factors make sense. Fraudsters typically attempt to transfer or cash out money quickly before detection. Such behavior results in abnormal transaction types and sudden balance changes, which are captured by the model.

## 7. What kind of prevention should be adopted while company update its infrastructure?
## Fraud Prevention Strategies
- Real-time monitoring of high-risk transactions
- Velocity checks for rapid consecutive transactions
- Multi-factor authentication for large or suspicious transactions
- Threshold-based alerts for abnormal balance changes


## 8. Assuming these actions have been implemented, how would you determine if they work?
## Measuring Effectiveness of Fraud Prevention
- Monitor reduction in fraud rate over time
- Track false positive rates
- Perform A/B testing on fraud detection rules
- Periodically retrain the model with recent data
