## Enhanced Fraud Detection System Model for E-Commerce and Banking Transactions Using Machine Learning
## Overview
This challenge focuses on building robust machine learning models for fraud detection in e-commerce and banking transactions. The project will involve analyzing, preprocessing, and integrating transaction data from both domains, engineering fraud-related features, and training multiple machine learning and deep learning models to detect fraudulent activity. The challenge also includes using geolocation and transaction pattern analysis to improve detection accuracy.

Participants will develop real-time fraud detection systems, ensuring efficiency and scalability in deployment. The project aims to deliver a fully operational fraud detection pipeline that includes model explainability, API deployment, and interactive dashboards.

## Objective
The primary objective of this challenge is to improve the detection of fraud cases in e-commerce and bank credit transactions by:

- Developing machine learning models to identify fraudulent patterns in both e-commerce and credit card transactions.
- Implementing geolocation analysis using IP address mappings and transaction pattern recognition.
- Enhancing model explainability through tools like SHAP and LIME for transparency in fraud detection.
- Deploying models in real-time using Flask, Docker, and developing APIs for serving fraud predictions.
- Creating an interactive dashboard with Dash to visualize fraud trends, fraud hotspots, and summary insights from transaction data.

## 1. Data Colllection

### Import Necessary Library

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [2]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Conv1D, Flatten

### Load Dataset

In [3]:
fraud_data = pd.read_csv("../data/Fraud_cleaned_Data.csv")
fraud_data.head()

Unnamed: 0.1,Unnamed: 0,purchase_value,age,ip_address,class,frequency,velocity,hour_of_day,day_of_week,time_diff,signup_hour,signup_day_of_week,purchase_day_of_week,source_Direct,source_SEO,browser_FireFox,browser_IE,browser_Opera,browser_Safari,sex_M
0,0,0.172414,0.362069,0.170603,0,1,34,0.086957,0.833333,4506682.0,22,1,5,False,True,False,False,False,False,True
1,1,0.048276,0.603448,0.081554,0,1,16,0.043478,0.0,17944.0,20,6,0,False,False,False,False,False,False,False
2,2,0.041379,0.603448,0.610371,1,1,15,0.782609,0.5,1.0,18,3,3,False,True,False,False,True,False,True
3,3,0.241379,0.396552,0.894219,0,1,44,0.565217,0.0,492085.0,21,1,0,False,True,False,False,False,True,True
4,4,0.206897,0.465517,0.096752,0,1,39,0.782609,0.333333,4361461.0,7,1,2,False,False,False,False,False,True,True


In [4]:
creditcard_data = pd.read_csv("../data/creditcard.csv")
creditcard_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## 2. Train test split

#### 2.1 Separet dependant and independant feature

In [5]:
# For Credit Card Data
X_creditcard = creditcard_data.drop(columns=['Class'])  # independant Features
y_creditcard = creditcard_data['Class']                   # Target variable

# For Fraud Data
X_fraud = fraud_data.drop(columns=['class'])  # independant Features
y_fraud = fraud_data['class']      # Target variable


### 2.2 trian test split by 80/20 percent

In [6]:
# Train-test split for Credit Card Data
X_train_creditcard, X_test_creditcard, y_train_creditcard, y_test_creditcard = train_test_split(
    X_creditcard, y_creditcard, test_size=0.2, random_state=42, stratify=y_creditcard
)

# Train-test split for Fraud Data
X_train_fraud, X_test_fraud, y_train_fraud, y_test_fraud = train_test_split(
    X_fraud, y_fraud, test_size=0.2, random_state=42, stratify=y_fraud
)

In [7]:
print(y_train_fraud.value_counts())
print(y_train_creditcard.value_counts())

class
0    109568
1     11321
Name: count, dtype: int64
Class
0    227451
1       394
Name: count, dtype: int64


from the above the two dataset is highly imbalanced. we have used SMOTE to handle imbalancing natures of the dataset to maintain model biasing 

### 2.3 Class balancing (Data resampling)

In [8]:
# Apply SMOTE for Credit Card Data
smote_creditcard = SMOTE(random_state=42)
X_train_creditcard_resampled, y_train_creditcard_resampled = smote_creditcard.fit_resample(X_train_creditcard, y_train_creditcard)

# Apply SMOTE for Fraud Data
smote_fraud = SMOTE(random_state=42)
X_train_fraud_resampled, y_train_fraud_resampled = smote_fraud.fit_resample(X_train_fraud, y_train_fraud)


# 3. Model Building and Traning

### 3.1 Machine learning Model

#### Define a Function for Model Training and Evaluation

In [9]:
#Define a Function for Model Training and Evaluation
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {model.__class__.__name__}")
    print(f"Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

#### Train and Evaluate the Models

In [10]:
# List of models to evaluate
models = [
    LogisticRegression(max_iter=1000),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    MLPClassifier(max_iter=1000)
]

# Evaluate models for Credit Card Data
for model in models:
    print(f"\nEvaluating {model.__class__.__name__} on Credit Card Data:")
    evaluate_model(model, X_train_creditcard_resampled, y_train_creditcard_resampled, X_test_creditcard, y_test_creditcard)

# Similarly, you can evaluate models for Fraud Data
for model in models:
    print(f"\nEvaluating {model.__class__.__name__} on Fraud Data:")
    evaluate_model(model, X_train_fraud_resampled, y_train_fraud_resampled, X_test_fraud, y_test_fraud)



Evaluating LogisticRegression on Credit Card Data:


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model: LogisticRegression
Accuracy: 0.9894
              precision    recall  f1-score   support

           0       1.00      0.99      0.99     56864
           1       0.13      0.90      0.23        98

    accuracy                           0.99     56962
   macro avg       0.56      0.94      0.61     56962
weighted avg       1.00      0.99      0.99     56962

[[56271   593]
 [   10    88]]

Evaluating DecisionTreeClassifier on Credit Card Data:
Model: DecisionTreeClassifier
Accuracy: 0.9978
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.42      0.79      0.55        98

    accuracy                           1.00     56962
   macro avg       0.71      0.89      0.77     56962
weighted avg       1.00      1.00      1.00     56962

[[56758   106]
 [   21    77]]

Evaluating RandomForestClassifier on Credit Card Data:
Model: RandomForestClassifier
Accuracy: 0.9994
              precision    recall  