In [None]:
# Project 3 – Identity Fraud Detection

## 🎯 Objective

To build and evaluate a machine learning model that detects fraudulent behavior by leveraging both transaction and identity-related data. 
This project focuses on identifying hidden fraud patterns linked to user identities, devices, and behavioral signals.

---

## 📁 Dataset Description

This project uses a real-world fraud dataset combining two complementary files:

- `train_transaction.csv` – Contains transaction-level features along with the binary target variable `isFraud`.
- `train_identity.csv` – Includes additional identity and device metadata linked to each transaction via the `TransactionID` key.
` key.

In [None]:
### Starter Code – Load & Merge Data

In [1]:
import pandas as pd
from IPython.display import display, Markdown

# Load datasets
transaction_df = pd.read_csv("train_transaction.csv")
identity_df = pd.read_csv("train_identity.csv")

# Merge on TransactionID
df = pd.merge(transaction_df, identity_df, how='left', on='TransactionID')

# Preview
# Display preview of transaction_df
display(Markdown("### First 5 Rows of `transaction_df`"))
display(transaction_df.head())

# Display preview of identity_df
display(Markdown("### First 5 Rows of `identity_df`"))
display(identity_df.head())

# (Optional) Preview of the merged dataset, if already merged
display(Markdown("### First 5 Rows of Merged Dataset"))
display(df.head())

### First 5 Rows of `transaction_df`

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### First 5 Rows of `identity_df`

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,chrome 62.0,,,,F,F,T,T,desktop,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


### First 5 Rows of Merged Dataset

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


## Dataset Overview and Context

The dataset used in this assignment is derived from a real-world identity fraud detection challenge. It includes two primary files:

- `train_transaction.csv`: Contains transaction-related features and the fraud label (`isFraud`).
- `train_identity.csv`: Contains additional identity-related features for a subset of transactions.

### Why Are the Variable Names Not Intuitive?

Many variables in the dataset are anonymized for privacy and proprietary reasons. For example:

- Features are labeled generically (e.g., `V1`, `C1`, `D9`, `id_12`, etc.).
- This protects sensitive financial information and reflects the kind of datasets you might encounter in industry where full data dictionaries aren't always available.

### How Can We Still Use This Data?

Even though the variable names lack clear definitions, the dataset still contains:

- **Rich signals**: Numeric, categorical, and timestamp-based features.
- **Ground truth**: A clear target variable (`isFraud`) allows us to train supervised machine learning models.
- **Structure**: Enough consistency for preprocessing, model training, and evaluation.

We’ll apply standard AI/ML techniques to preprocess, explore, and model the data—just like you would in a real-world fraud analytics environment.


### Task 1. Data Merging & Basic Exploration

In [2]:
### Basic Exploration and checking of missing values in data- 

# Display shape and preview merged dataset
print("\nShape of the merged dataset (rows, columns):")
print(df.shape)

print("\nPreview of the merged dataset:")
display(df.head())

# Calculate and print the percentage of missing values per column for Merged Dataset -
missing_percentage = df.isnull().mean() * 100
print("\nPercentage of missing values per column:")
print(missing_percentage)

# Get columns with >75% missing in the merged dataset
missing_fraction = df.isnull().mean()
columns_over_75_missing = missing_fraction[missing_fraction > 0.75].index

# Get the list of columns that originally belonged to the identity dataset
identity_columns = identity_df.columns.difference(transaction_df.columns)

# Find intersection — how many of the >75% missing columns are from identity
missing_identity_columns = set(columns_over_75_missing).intersection(set(identity_columns))

# Output

print(f"\nTotal columns with >75% missing: {len(columns_over_75_missing)}")
print(f"Out of these, identity columns: {len(missing_identity_columns)}")



Shape of the merged dataset (rows, columns):
(590540, 434)

Preview of the merged dataset:


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M



Percentage of missing values per column:
TransactionID      0.000000
isFraud            0.000000
TransactionDT      0.000000
TransactionAmt     0.000000
ProductCD          0.000000
                    ...    
id_36             76.126088
id_37             76.126088
id_38             76.126088
DeviceType        76.155722
DeviceInfo        79.905510
Length: 434, dtype: float64

Total columns with >75% missing: 208
Out of these, identity columns: 40


- We explored the merged dataset to assess missing data and identified columns with over 75% missing values. 
Among these, we will focus on numeric features and compute their correlation with the target variable isFraud.

- This will help us detect sparse features that still carry strong predictive signals (correlation > 0.7 or < -0.7), 
guiding us on which variables to potentially retain for further modeling instead of dropping them outright.

#### Check for correlation among numeric features(that has >75% missing values ) before dropping them outrightly !

In [3]:
# To make sure we are not dropping valuable information, We are doing Correlation analysis to understand if these variables 
# have high correlation to isFraud, even though they’re mostly missing -----

# Step: Filter numeric columns from the >75% missing columns
numeric_missing_cols = df[columns_over_75_missing].select_dtypes(include=['float64', 'int64'])

# Add isFraud for correlation computation
temp_df = numeric_missing_cols.copy()
temp_df['isFraud'] = df['isFraud']

# Compute correlation with isFraud
correlation_series = temp_df.corr()['isFraud'].drop('isFraud')

# Filter for strong correlation >0.7 or <-0.7
strong_corr = correlation_series[correlation_series.abs() > 0.7]

# Display results
print("\nColumns with >75% missing AND strong correlation with isFraud (>|0.7|):")
print(strong_corr.sort_values(ascending=False))



Columns with >75% missing AND strong correlation with isFraud (>|0.7|):
Series([], Name: isFraud, dtype: float64)



#### Insights  - 
Columns with over 75% missing values show no strong correlation (|r| > 0.7) with the target variable isFraud. 
These sparse features lack significant predictive value for fraud detection and can be safely dropped from the dataset.


#### Subtask- Briefly describe how much of the identity data was missing and how you handled it.


Dataset Overview - 
The above merged dataset provides complete transaction data and nearly complete card-related data. 
However, identity and device-related features are significantly sparse. 

Key identity/device fields with substantial missing values include:
- id_36, id_37, id_38: ~76% missing
- DeviceType: ~76% missing
- DeviceInfo: ~80% missing

<b>Missing Data Handling Strategy:</b>
- Drop features: Remove fields with >75% missing data (listed above).
- Impute remaining features: Use mean/median (numerical) or mode (categorical) during processing, based on data type/distribution.



Dataset Overview -
Complete transaction data and nearly complete card data are available. Identity/device features show significant sparsity.

Key Missing Data (76-80%)

id_36, id_37, id_38, DeviceType, DeviceInfo

Handling Strategy

Drop features: Remove fields with >75% missing data (listed above).

Impute remaining features: Use mean/median (numerical) or mode (categorical) during processing, based on data type/distribution.

### Task 2. Feature Preprocessing 

### Drop columns with >75% missing values 

In [4]:
# Step 1: Set threshold
threshold = 0.75

# Step 2: Calculate missing fraction
missing_fraction = df.isnull().mean()

# Step 3: Identify columns to drop
columns_to_drop = missing_fraction[missing_fraction > threshold].index

# Step 4: Drop columns from the merged DataFrame
df = df.drop(columns=columns_to_drop)

# Step 5: Output result
print(f"\nDropped {len(columns_to_drop)} columns with more than {threshold*100}% missing values.")
print(f"Remaining columns in the dataset: {df.shape[1]}")



Dropped 208 columns with more than 75.0% missing values.
Remaining columns in the dataset: 226


### Impute missing values 

In [5]:
# Step 1: Separate column types
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include='object').columns

# Step 2: Impute numerical columns with mean
for col in numerical_cols:
    if df[col].isnull().sum() > 0:
        df[col] = df[col].fillna(df[col].mean())

# Step 3: Impute categorical columns with mode
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col] = df[col].fillna(df[col].mode()[0])

print("Missing values imputed: numerical → mean, categorical → mode.")

print(df.head())


Missing values imputed: numerical → mean, categorical → mode.
   TransactionID  isFraud  TransactionDT  TransactionAmt ProductCD  card1  \
0        2987000        0          86400            68.5         W  13926   
1        2987001        0          86401            29.0         W   2755   
2        2987002        0          86469            59.0         W   4663   
3        2987003        0          86499            50.0         W  18132   
4        2987004        0          86506            50.0         H   4497   

        card2  card3       card4  card5  ...   V312  V313  V314  V315  V316  \
0  362.555488  150.0    discover  142.0  ...    0.0   0.0   0.0   0.0   0.0   
1  404.000000  150.0  mastercard  102.0  ...    0.0   0.0   0.0   0.0   0.0   
2  490.000000  150.0        visa  166.0  ...    0.0   0.0   0.0   0.0   0.0   
3  567.000000  150.0  mastercard  117.0  ...  135.0   0.0   0.0   0.0  50.0   
4  514.000000  150.0  mastercard  102.0  ...    0.0   0.0   0.0   0.0   0.0   



### Cross-checking for successful imputation 

In [6]:
# Check if any missing values remain
missing_summary = df.isnull().sum().sort_values(ascending=False)

# Display top variables (if any) with missing values
print("\nRemaining missing values (if any):")
print(missing_summary[missing_summary > 0])



Remaining missing values (if any):
Series([], dtype: int64)


### Label encoding the categorical variables 

In [7]:
from sklearn.preprocessing import LabelEncoder

# Identify categorical columns
categorical_cols = df.select_dtypes(include='object').columns

# Initialize LabelEncoder
le = LabelEncoder()

# Encode each categorical column
for col in categorical_cols:
    df[col] = le.fit_transform(df[col].astype(str))

print("All categorical variables encoded as 0,1,2,... for modeling.")

# Preview first few rows after encoding
print("\n Preview of DataFrame after label encoding:")
print(df.head())


All categorical variables encoded as 0,1,2,... for modeling.

 Preview of DataFrame after label encoding:
   TransactionID  isFraud  TransactionDT  TransactionAmt  ProductCD  card1  \
0        2987000        0          86400            68.5          4  13926   
1        2987001        0          86401            29.0          4   2755   
2        2987002        0          86469            59.0          4   4663   
3        2987003        0          86499            50.0          4  18132   
4        2987004        0          86506            50.0          1   4497   

        card2  card3  card4  card5  ...   V312  V313  V314  V315  V316  \
0  362.555488  150.0      1  142.0  ...    0.0   0.0   0.0   0.0   0.0   
1  404.000000  150.0      2  102.0  ...    0.0   0.0   0.0   0.0   0.0   
2  490.000000  150.0      3  166.0  ...    0.0   0.0   0.0   0.0   0.0   
3  567.000000  150.0      2  117.0  ...  135.0   0.0   0.0   0.0  50.0   
4  514.000000  150.0      2  102.0  ...    0.0   0.0   

### Task 3. Model Training

In [8]:
## Split the data into training and testing sets (80/20 split)

from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop('isFraud', axis=1)
y = df['isFraud']

# Perform 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

print("\n Data split completed. Training samples:", X_train.shape[0], " | Testing samples:", X_test.shape[0])

# Display shapes
print("\n Data split completed.")
print(f"Training set shape: {X_train.shape}, Target: {y_train.shape}")
print(f"Test set shape:     {X_test.shape}, Target: {y_test.shape}")

# Display class distribution
print("\n Class distribution in training set:")
print(y_train.value_counts(normalize=True).rename(lambda x: f"isFraud={x}"))

print("\n Class distribution in test set:")
print(y_test.value_counts(normalize=True).rename(lambda x: f"isFraud={x}"))



 Data split completed. Training samples: 472432  | Testing samples: 118108

 Data split completed.
Training set shape: (472432, 225), Target: (472432,)
Test set shape:     (118108, 225), Target: (118108,)

 Class distribution in training set:
isFraud
isFraud=0    0.965011
isFraud=1    0.034989
Name: proportion, dtype: float64

 Class distribution in test set:
isFraud
isFraud=0    0.965007
isFraud=1    0.034993
Name: proportion, dtype: float64


In [9]:
## Use a Random Forest classifier to train the model and make predictions:

from sklearn.ensemble import RandomForestClassifier

# Define and train model
model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=42)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]


### Task 4. Model Evaluation

In [34]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\n Confusion Matrix:")
print(cm)

# Classification report
print("\n Classification Report:")
print(classification_report(y_test, y_pred))

# ROC-AUC
roc_auc = roc_auc_score(y_test, y_prob)
print(f" ROC-AUC Score: {roc_auc:.4f}")



 Confusion Matrix:
[[113870    105]
 [  3289    844]]

 Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99    113975
           1       0.89      0.20      0.33      4133

    accuracy                           0.97    118108
   macro avg       0.93      0.60      0.66    118108
weighted avg       0.97      0.97      0.96    118108

 ROC-AUC Score: 0.8661


### Interpretation: Precision vs. Recall (Fraud Detection Context)

- Precision (Fraud class = 1): 0.89
  Interpretation: When this model predicts a transaction as fraud, it's correct 89% of the time. 
  This means few false alarms — good for user experience and reputation.

- Recall (Fraud class = 1): 0.20
  Interpretation: This model only caught 20% of all actual fraud cases.
  This means it's missing 80% of frauds — not acceptable in real-world fraud detection where recall is more critical.

Summary:
The above model gives - High precision (few false positives), which is Good , and Low recall (many missed frauds) which is Bad.
ROC-AUC = 0.8661: That means, Overall model can separate fraud vs. non-fraud reasonably well, but threshold or class balance needs work

Therefore,to improve recall, we will now apply SMOTE or undersampling, retrain the model, and compare performance.

### Task 5. Imbalanced Data Handling 

#### Apply Undersampling the majority class to handle the imbalance.
#### Re-train and Re-Evaluate the model performance after applying undersampling technique 

In [18]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print("Undersampling Applied:")
print(y_train_rus.value_counts())

# Retrain
model_rus = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=42)
model_rus.fit(X_train_rus, y_train_rus)

# Predict
y_pred_rus = model_rus.predict(X_test)
y_prob_rus = model_rus.predict_proba(X_test)[:, 1]

# Evaluate
print("\n Evaluation After Undersampling:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rus))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rus))

print(f"🎯 ROC-AUC After Undersampling: {roc_auc_score(y_test, y_prob_rus):.4f}")


Undersampling Applied:
isFraud
0    16530
1    16530
Name: count, dtype: int64

 Evaluation After Undersampling:
Confusion Matrix:
[[95812 18163]
 [  964  3169]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.84      0.91    113975
           1       0.15      0.77      0.25      4133

    accuracy                           0.84    118108
   macro avg       0.57      0.80      0.58    118108
weighted avg       0.96      0.84      0.89    118108

🎯 ROC-AUC After Undersampling: 0.8773


#### Comparison and Interpretation of model performance results - 

| Metric                | Original Model (Imbalanced) | Undersampled Model |
| --------------------- | --------------------------- | ------------------ |
| **Accuracy**          | 97%                         | 84%                |
| **Precision (Fraud)** | 0.89                        | 0.15               |
| **Recall (Fraud)**    | 0.20                        | 0.77               |
| **F1-Score (Fraud)**  | 0.33                        | 0.25               |
| **ROC-AUC**           | 0.8661                      | **0.8773**         |


Key Insights:
Recall (Fraud) improved significantly (0.20 → 0.77)
→ The model now detects 77% of fraud cases, compared to just 20% before.

ROC-AUC also improved slightly (from 0.8661 → 0.8773), indicating better overall separation of fraud vs non-fraud.

Trade-Off:
Precision (Fraud) dropped (0.89 → 0.15)
→ This means the model is flagging more legitimate transactions as fraud (i.e., more false positives).

Accuracy decreased from 97% to 84% — expected, as the model now focuses more on catching frauds.

Conclusion:
Since catching fraud is critical, the undersampled model is better due to high recall.


### Enhancement Task 1. Use a Different Model - 
Here we are training a Gradient Boosting Model on already udersampled data, to which we applied Random Forest Classifier before.



In [20]:
## Installing XGBoost for applying Gradient Boosting Model 

!pip install xgboost




In [27]:

#Step 1: Create undersampled training set
rus = RandomUnderSampler(random_state=42)
X_train_bal, y_train_bal = rus.fit_resample(X_train, y_train)

# Optional check
print(" Undersampled Class Distribution:")
print(y_train_bal.value_counts())


 Undersampled Class Distribution:
isFraud
0    16530
1    16530
Name: count, dtype: int64


In [28]:
#### Train XGBoost on Undersampled Data

from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

xgb_model = XGBClassifier(n_estimators=100, max_depth=6, learning_rate=0.1,
                          eval_metric='logloss', random_state=42)

xgb_model.fit(X_train_bal, y_train_bal)

# Predict on original test set
y_pred_xgb = xgb_model.predict(X_test)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]

# Evaluate
print(" Evaluation for XGBoost (Undersampled Training):")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_xgb))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb))
print(f" ROC-AUC Score: {roc_auc_score(y_test, y_prob_xgb):.4f}")


 Evaluation for XGBoost (Undersampled Training):
Confusion Matrix:
[[100398  13577]
 [   774   3359]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.88      0.93    113975
           1       0.20      0.81      0.32      4133

    accuracy                           0.88    118108
   macro avg       0.60      0.85      0.63    118108
weighted avg       0.96      0.88      0.91    118108

 ROC-AUC Score: 0.9225


#### Comparison between XGBoost and Random Forest Classifier(Undersampled Training Set)

| Model            | Precision (Fraud) | Recall (Fraud) | F1-Score (Fraud) | ROC-AUC |
|------------------|-------------------|----------------|------------------|---------|
| Random Forest     | 0.15              | 0.77           | 0.25             | 0.8773  |
| XGBoost           | 0.20              | **0.81**       | **0.32**         | **0.9225** |

**Conclusion:**  
XGBoost performed better than Random Forest by achieving higher recall and ROC-AUC, making it more effective for detecting 
identity fraud and distinguishing between fraud and non-fraud overall. 
While its precision dropped slightly, the improved recall makes it more effective for real-world fraud detection where catching fraud is critical.
    

### Enhancement Task2. Ensemble Modeling

-Combine two or more models using a voting ensemble or stacked model.


In [30]:
from sklearn.ensemble import VotingClassifier

# Create ensemble using soft voting (uses predicted probabilities)
ensemble_model = VotingClassifier(estimators=[
    ('rf', model),         # Random Forest trained on undersampled data
    ('xgb', xgb_model)     # XGBoost trained on undersampled data
], voting='soft')

# Train ensemble on undersampled data
ensemble_model.fit(X_train_bal, y_train_bal)

# Predict on original test set
y_pred_ensemble = ensemble_model.predict(X_test)
y_prob_ensemble = ensemble_model.predict_proba(X_test)[:, 1]

# Evaluate
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print("\n Evaluation for Voting Ensemble:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_ensemble))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_ensemble))

print(f" ROC-AUC Score: {roc_auc_score(y_test, y_prob_ensemble):.4f}")



 Evaluation for Voting Ensemble:
Confusion Matrix:
[[99400 14575]
 [  842  3291]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.87      0.93    113975
           1       0.18      0.80      0.30      4133

    accuracy                           0.87    118108
   macro avg       0.59      0.83      0.61    118108
weighted avg       0.96      0.87      0.91    118108

 ROC-AUC Score: 0.9091


#### Model Performance Comparison (Undersampled Training Set)

| Model            | Precision (Fraud) | Recall (Fraud) | F1-Score (Fraud) | ROC-AUC |
|------------------|-------------------|----------------|------------------|---------|
| Random Forest     | 0.15              | 0.77           | 0.25             | 0.8773  |
| XGBoost           | 0.20              | **0.81**       | **0.32**         | **0.9225** |
| Voting Ensemble   | 0.18              | 0.80           | 0.30             | 0.9091  |

**Conclusion:**  
The Voting Ensemble model performed better than Random Forest and closely followed XGBoost in recall and ROC-AUC. However, it didn't significantly outperform XGBoost and added more complexity, making XGBoost the more efficient and effective choice in this case.


Downside - 

- The main drawback is that the ensemble still suffers from low precision (0.18) — meaning many legitimate transactions are incorrectly flagged as fraud. 
- Also, ensembles are more computationally expensive and harder to interpret. 
- Minimal Performance Gain - The ensemble had only a slight improvement over Random Forest and did not surpass XGBoost significantly in recall or ROC-AUC.
  That means it added complexity without major benefit.

#### Enhancement Task 3. Anomaly Detection Approach

Here we are using an unsupervised method -Isolation Forest to detect outliers.
Goal - To detect fraud using no labels during training, and evaluate how well the model identifies fraudulent transactions.

Isolation Forest identifies outliers based on learned structure, making it valuable for pre-filtering suspicious activity — 
especially in early-stage fraud detection pipelines.

In [32]:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# 1. Train Isolation Forest on full training features (unsupervised)
iso_model = IsolationForest(n_estimators=100, contamination=0.035, random_state=42)
iso_model.fit(X_train)

# 2. Predict anomalies on the test set
y_pred_iso = iso_model.predict(X_test)   # Output: -1 = anomaly (fraud), 1 = normal

# 3. Convert to binary format: 1 = fraud, 0 = non-fraud
y_pred_iso_binary = (y_pred_iso == -1).astype(int)

# 4. Evaluate against true labels
print("\n Evaluation for Isolation Forest (Unsupervised):")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_iso_binary))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_iso_binary))

print(f" ROC-AUC Score: {roc_auc_score(y_test, y_pred_iso_binary):.4f}")



 Evaluation for Isolation Forest (Unsupervised):
Confusion Matrix:
[[110565   3410]
 [  3367    766]]

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97    113975
           1       0.18      0.19      0.18      4133

    accuracy                           0.94    118108
   macro avg       0.58      0.58      0.58    118108
weighted avg       0.94      0.94      0.94    118108

 ROC-AUC Score: 0.5777


Q. How well did the anomaly detection model identify fraud?

Ans. The Isolation Forest model identified fraud based on outlier behavior without using labels. It achieved a recall of 0.19 and precision of 0.18, meaning it detected some fraud but also produced many false positives. With a ROC-AUC of 0.5777, it’s not suitable as a standalone model but can be useful as a supporting tool in fraud detection pipelines, especially when labels are unavailable.