# Fraudulent Transaction Detection

Cell 1 - Reading Dataset

In [None]:
import pandas as pd
dataset = pd.read_csv('Fraud.csv')

Cell 2 - Dropping Unneccesary Features

In [None]:
dataset = dataset.drop(columns=['nameOrig','nameDest'])

Cell 3 - Checking for missing values

In [None]:
print(dataset.isnull().sum())

Cell 4 - Checking for outliers

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

sns.boxplot(x=dataset['amount'])
plt.show()

Q1 = dataset['amount'].quantile(0.25)
Q3 = dataset['amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

dataset['log_amount'] = np.log1p(dataset['amount'])

upper_cap = dataset['amount'].quantile(0.99)
lower_cap = dataset['amount'].quantile(0.01)
dataset['amount_capped'] = np.where(dataset['amount'] > upper_cap, upper_cap,
                        np.where(dataset['amount'] < lower_cap, lower_cap, dataset['amount']))

dataset = dataset[dataset['amount'] >= 0]

dataset['is_outlier_amount'] = ((dataset['amount'] < lower_bound) | (dataset['amount'] > upper_bound)).astype(int)

Cell 5 - Check for multi-collinearity

In [None]:
dataset['type'] = dataset['type'].map({'CASH-IN':1, 'CASH-OUT':2,'DEBIT':3,'PAYMENT':4,'TRANSFER':5})

corr_matrix = dataset.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

Cell 6 - Dropping Features due to multi-collinearity

In [None]:
dataset = dataset.drop(columns=['oldbalanceOrg','oldbalanceDest'])

dataset['type'] = dataset['type'].map({1:'CASH-IN',2:'CASH-OUT',3:'DEBIT',4:'PAYMENT',5:'TRANSFER'})

X = dataset.drop(columns=['isFraud'])
y = dataset['isFraud']

Cell 7 - Spliting data for train and test

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=28)

Cell 8 - Pipelining

In [None]:
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

num_features = ['step','amount','newbalanceOrig','newbalanceDest','isFlaggedFraud','log_amount','amount_capped', 'is_outlier_amount']
cat_features = ['type']

categorical_pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

numeric_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, num_features),
    ('cat', categorical_pipeline, cat_features)
])

pipe = Pipeline([
    ('processor',preprocessor),
    ('model', XGBClassifier())
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

Cell 9 - Evaluation Of Model

In [None]:
from sklearn.metrics import precision_score,recall_score,f1_score
print('Precision:',precision_score(y_test,y_pred))
print('Recall:',recall_score(y_test,y_pred))
print('F1-score:',f1_score(y_test,y_pred))

## Results of different Classification Algorithms:

**With Logistics Regression + SMOTE**
Precision: 0.0059
Recall: 0.878
F1-score: 0.0119

**GBMClassifier(min_data_in_leaf=20)**
Precision: 0.5110
Recall: 0.456
F1-score: 0.4821

**XGBClassifier**
Precision: 0.9194
Recall: 0.641
F1-score: 0.7558

## 1. Data Cleaning: Missing Values, Outliers, and Multi-Collinearity

Missing Values:
You checked for missing values with dataset.isnull().sum() (code commented out). If there were missing values, imputation strategies such as mean/median for numeric variables or mode for categorical variables would be appropriate. Since you did not impute, we assume no serious missingness.

Outliers:
Detected with boxplots and IQR (Interquartile Range). Outliers in the amount column were capped at the 1st and 99th percentiles, log-transformed (with log1p for normalization), and flagged with a binary indicator.

Multi-Collinearity:
Checked visually using correlation heatmaps (code commented out). Highly correlated features (specifically 'oldbalanceOrg' and 'oldbalanceDest') were dropped to improve model stability and avoid redundancy.

## 2. Fraud Detection Model Description

Model Used:
    The main and most effective model was XGBoost (Extreme Gradient Boosting), a powerful tree-based ensemble algorithm well-suited for tabular data and imbalanced tasks.

Pipeline:
    Preprocessing:
        Numeric features (StandardScaler): Step, amounts, balances, outlier flags.
        Categorical feature (OneHotEncoder): Transaction type.
        Combined via ColumnTransformer.
    Model Fitting:
        Used XGBClassifier (with default or tuned parameters).
    Prediction and Evaluation:
        Performance measured with precision, recall, and F1-score.

# 3. Variable Selection

Removal of uninformative or high-cardinality features:
Dropped account IDs (nameOrig, nameDest).

Kept features with predictive/value and low redundancy:
Steps, amounts, balances, transaction type, fraud flags, engineered features (log_amount, is_outlier_amount, amount_capped).

Dropped multi-collinear or redundant columns:
Used correlation analysis for this process.

Categorical variables encoded appropriately for XGBoost using OneHotEncoder.

## 4. Model Performance Demonstration
XGBoost Results:

Precision: 0.92

Recall: 0.64

F1-score: 0.76

## 5. Key Predictive Factors
Transaction amount (especially extreme or unusual amounts).

Transaction type (certain types are more typical for fraud).

Account balances before and after the transaction (unusual patterns flagged).

Flag indicators (e.g., isFlaggedFraud).

Step (time) (certain patterns over time may matter).

Engineered features (is_outlier_amount, log_amount).

## 6. Does this Make Sense?
Yes, it does, because…
Fraudulent activity often involves:

Large, sudden, or atypical transactions.

Specific transaction types that are easier to exploit (e.g., TRANSFER or CASH-OUT).

Balance inconsistencies (e.g., zero balance immediately after a transfer).

Repeated transactions or patterns within short time windows.

Your model's top features align with known fraud risk factors in real-world banking/finance.

## 7. Prevention Recommendations for Company Infrastructure
Real-time fraud detection:
Integrate the XGBoost model (and/or rule-based filters) for live transaction monitoring.

Automated flagging and alert systems based on model predictions.

Regular model retraining and evaluation to adapt to new fraud tactics.

Augment features with additional context: historical behavior, device/location info, etc.

Educate staff and customers on typical fraud warning signs.

Deploy anomaly detection as an early-warning complement to classification.

## 8. How to Measure Post-Implementation Success
Key indicators to track:

Reduction in fraud losses (total $ prevented).

Fraud case catch rate (recall) and false positives (precision).

Customer complaint and reversal rates.

Model F1-score and ROC AUC on live and holdout data.

Case speed from flag->investigation->resolution.

A/B testing:
Split traffic to old vs. new systems and compare fraud rates, customer satisfaction, and operational costs.