<a href="https://colab.research.google.com/github/Habeebhassan/Online_Fraud_Detection/blob/main/Fraud_Detection_for_Online_Payment_Platform.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Introduction

Online transactions have become increasingly popular, and this trend is expected to continue in the future, according to various surveys and research. However, this growth has also led to an increase in fraudulent transactions. Despite the implementation of various security systems, a significant amount of money is still being lost due to fraudulent transactions. Online fraud transactions occur when a person uses someone else’s credit card for personal reasons without the owner or the card-issuing authorities being aware of it. This project aims to address this issue.

**Project Scope**

The Online Fraud Transaction Detection System is an extension of an existing system. The algorithms built using this system will go through the dataset and provide the appropriate output. In the long run, this system will be beneficial as it provides an efficient way to create a secure transaction system to analyze and detect fraudulent transactions. The Proposed algorithm algorithm used in this project is XGBOOST. Xgboost algorithm is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. This accuracy can be increased further by providing a huge dataset for model training. The scope of this application is far-reaching, and it can be used to detect the features of fraud transactions in datasets that are applicable in various sectors such as banking, insurance, e-commerce, money transfer, bill payments, etc. This will help increase security.

**Work Flow**

- Load DataSet
- Data Preprocessing
- Feature Selection/Feature Engineering
- Classification
- XGboost Model Training
- Prediction
- Evaluation

In [1]:
print("Installing necessary libraries")
!pip install pyforest
!pip install category_encoders


Installing necessary libraries
Collecting pyforest
  Downloading pyforest-1.1.0.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyforest
  Building wheel for pyforest (setup.py) ... [?25l[?25hdone
  Created wheel for pyforest: filename=pyforest-1.1.0-py2.py3-none-any.whl size=14605 sha256=c0abbecf6e3932891921c5434edbb3facca7551347aefbaf0eabb78273442c1f
  Stored in directory: /root/.cache/pip/wheels/9e/7d/2c/5d2f5e62de376c386fd3bf5a8e5bd119ace6a9f48f49df6017
Successfully built pyforest
Installing collected packages: pyforest
Successfully installed pyforest-1.1.0
Collecting category_encoders
  Downloading category_encoders-2.6.2-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.8/81.8 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.2


In [2]:
print("lmporting libraries")
from pyforest import *

lmporting libraries


In [3]:
print("Mounting Google drive to load data")
from google.colab import drive
drive.mount('/content/gdrive')

Mounting Google drive to load data
Mounted at /content/gdrive


In [4]:
data = pd.read_csv('gdrive/MyDrive/Fraud_Detection_Dataset.csv')

<IPython.core.display.Javascript object>

In [5]:

data.shape

(6000000, 32)

In [6]:
# CHeck for missing values
data.isnull().sum()

Transaction ID                       0
User ID                              0
Transaction Amount                   0
Transaction Date and Time            0
Merchant ID                          0
Payment Method                       0
Country Code                         0
Transaction Type                     0
Device Type                          0
IP Address                           0
Browser Type                         0
Operating System                     0
Merchant Category                    0
User Age                             0
User Occupation                      0
User Income                          0
User Gender                          0
User Account Status                  0
Transaction Status                   0
Location Distance                    0
Time Taken for Transaction           0
Transaction Time of Day              0
User's Transaction History           0
Merchant's Reputation Score          0
User's Device Location               0
Transaction Currency     

In [7]:
data.head(2).T

Unnamed: 0,0,1
Transaction ID,51595306,85052974
User ID,9822,4698
Transaction Amount,163.08,430.74
Transaction Date and Time,2023-01-02 07:47:54,2021-09-12 15:15:41
Merchant ID,4044,4576
Payment Method,ACH Transfer,2Checkout
Country Code,KOR,VNM
Transaction Type,Charity,Cashback
Device Type,GPS Device,Medical Device
IP Address,42.23.223.120,39.52.212.120


In [8]:
# Check data types of each features
data.dtypes

Transaction ID                         int64
User ID                                int64
Transaction Amount                   float64
Transaction Date and Time             object
Merchant ID                            int64
Payment Method                        object
Country Code                          object
Transaction Type                      object
Device Type                           object
IP Address                            object
Browser Type                          object
Operating System                      object
Merchant Category                     object
User Age                               int64
User Occupation                       object
User Income                          float64
User Gender                           object
User Account Status                   object
Transaction Status                    object
Location Distance                    float64
Time Taken for Transaction           float64
Transaction Time of Day               object
User's Tra

In [9]:
# Create a copy of the dataset
X = data.copy()

Transaction Date and Time variable is identified as an object. So i converted it time series data type. follow by extracting the year, month, and day into different columns.

In [10]:
# convert to date time series
X['Transaction Date and Time'] = pd.to_datetime(X['Transaction Date and Time'])

<IPython.core.display.Javascript object>

In [11]:
# Extract year, month, day in each column

X['Year'] = X['Transaction Date and Time'].dt.year
X['Month'] = X['Transaction Date and Time'].dt.month
X['Day'] = X['Transaction Date and Time'].dt.day


In [12]:
# Remove User Gender Columns
#X.drop('User Gender', inplace=True, axis=1)

Due to large number of the dataset observations, it will be resource driven to visualize the dataset to check the correlation between the features and the target variable. so subset of the observation is extracted.

In [13]:
#sample_X = X.sample(n=10000, random_state=0)

In [14]:
# # Split categorical variables and check their unique values
# cat_col = sample_X.dtypes[sample_X.dtypes == 'object'].index.tolist()
# for col in cat_col:
#   sample_X[col] = sample_X[col].astype('category')
# cat_col = sample_X.dtypes[sample_X.dtypes == 'category'].index.tolist()

In [15]:
# Split categorical variables and convert from object to category
cat_col = X.dtypes[X.dtypes == 'object'].index.tolist()
for col in cat_col:
  X[col] = X[col].astype('category')
cat_col = X.dtypes[X.dtypes == 'category'].index.tolist()

In [16]:
cat_col

['Payment Method',
 'Country Code',
 'Transaction Type',
 'Device Type',
 'IP Address',
 'Browser Type',
 'Operating System',
 'Merchant Category',
 'User Occupation',
 'User Gender',
 'User Account Status',
 'Transaction Status',
 'Transaction Time of Day',
 "User's Device Location",
 'Transaction Currency',
 'Transaction Purpose',
 "User's Email Domain",
 'Transaction Authentication Method']

In [17]:
# # visualize the relationship between the categorical variables and the target variable
# for cat_col in cat_col:
#   contingency_table = pd.crosstab(sample_X[cat_col], sample_X['Fraudulent Flag'])

#   plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
#   sns.heatmap(contingency_table, annot=True, fmt='d', cmap='YlGnBu')
#   plt.title('Heatmap of Relationship for '+cat_col)
#   plt.xlabel('fraudulent Flag')
#   plt.ylabel(cat_col)
#   plt.show()



Most of the categorical variable have a large number unique values, especcially the "IP address" column. But let us check the relationship between the categorical variable and the target variable.

In [18]:
# sample_X.drop('Transaction Date and Time', axis =1, inplace=True)
X.drop('Transaction Date and Time', axis =1, inplace=True)

In [19]:
# # split features from target

# X_data = sample_X.drop('Fraudulent Flag', axis=1)
# y_data = sample_X.pop('Fraudulent Flag')

# split features from target

X_data = X.drop('Fraudulent Flag', axis=1)
y_data = X.pop('Fraudulent Flag')

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.5, stratify=y_data, random_state=0)

<IPython.core.display.Javascript object>

In [21]:
X_train_small = X_train.iloc[:1000000]
y_train_small = y_train.iloc[:1000000]
X_test_small = X_test.iloc[:500000]
y_test_small = y_test.iloc[:500000]

In [22]:
# dtrain = xgb.DMatrix(data=X_train, label= y_train, enable_categorical=True)
# dtest = xgb.DMatrix(data=X_test, label= y_test, enable_categorical=True)

# category_col = [cat_col]

# dtrain.set_info(feature_types=[('f' if col in category_col else 'q') for col in X_train.columns])
# #dtest.set_info(feature_types=[('f' if col in category_col else 'q') for col in X_test.columns])

# # params = {'objective': 'binary:logistic', 'eval_metric': 'logloss'}

# params = {
#     'min_child_weight': 7,
#     'subsample': 0.577,
#     'max_depth': 10,
#     'reg_lambda': 0.377,
#     'learning_rate': 0.017,
#     'colsample_bytree': 0.974,
#     'reg_alpha': 0.206
# }

# #params = {'objective': 'binary:logistic', 'eval_metric': 'logloss', 'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 1, 'subsample': 1.0}

# model = xgb.train(params, dtrain, num_boost_round=100)

Tuning the parameters

In [41]:
import category_encoders as ce
from sklearn.metrics import accuracy_score

# # Initialize target encoder
encoder = ce.TargetEncoder(cols=cat_col)  # Specify categorical columns

# # Fit the encoder on the training data and transform both training and testing data
X_train_encoded = encoder.fit_transform(X_train_small, y_train_small)
X_test_encoded = encoder.transform(X_test_small, y_test_small)

from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X_train_encoded, y_train_small)
X_test_resampled, y_test_resampled = SMOTE().fit_resample(X_test_encoded, y_test_small)

X_resampled.drop(['Year', 'Month', 'Day'], axis=1, inplace=True)
X_test_resampled.drop(['Year', 'Month', 'Day'], axis=1, inplace=True)


dtrain = xgb.DMatrix(data=X_resampled, label= y_resampled)
dtest = xgb.DMatrix(data=X_test_resampled, label= y_test_resampled)

# # Assuming 'X_train' and 'y_train' are your training data

# # Initialize the XGBoost model with categorical support enabled
# model = xgb.XGBClassifier(verbosity=2)

# # Define the parameter grid for hyperparameter tuning
# param_grid = {
#     'max_depth': [3, 7],
#     'min_child_weight': [1, 5],
#     'subsample': [0.8, 1.0],
#     'colsample_bytree': [0.8, 1.0],
#     'learning_rate': [0.1, 0.01, 0.5, 0.9],
#     'n_estimator': [200]
# }

scale_pos_weight = len(y_resampled[y_resampled == 0]) / len(y_resampled[y_resampled == 1])

params = {
    'min_child_weight': 7,
    'subsample': 0.577,
    'max_depth': 10,
    'reg_lambda': 0.377,
    'learning_rate': 0.017,
    'colsample_bytree': 0.974,
    'reg_alpha': 0.206,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'scale_pos_weight': scale_pos_weight

}
# # Initialize GridSearchCV
# grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='roc_auc', cv=3, verbose=2, n_jobs=-1,  error_score='raise')

# # Perform the hyperparameter tuning
# grid_search.fit(X_train_encoded, y_train_small)

# # Get the best hyperparameters
# best_params = grid_search.best_params_

# # Train the model with the best hyperparameters
#best_model = xgb.XGBClassifier(params, verbosity=2)
best_model =xgb.train(params, dtrain, num_boost_round=100)

# # Evaluate the model
y_pred = best_model.predict(dtest)
# accuracy = accuracy_score(y_test_resampled, y_pred)

# # Display results
# #print(f'Best Hyperparameters: {best_params}')
# print(f'Accuracy: {accuracy}')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

ValueError: ignored

In [None]:
#X_test_resampled, y_test_resampled = SMOTE().fit_resample(X_test_encoded, y_test)
#dtest = xgb.DMatrix(data=X_test_resampled, label= y_test_resampled)


In [None]:
# best_model.save_model('xgboost_model.json')


In [None]:
# # Create the parameter grid: gbm_param_grid
# gbm_param_grid = {
#     'n_estimators': [25],
#     'max_depth': range(2, 12)
# }

# # Instantiate the regressor: gbm
# gbm = xgb.XGBRegressor(n_estimators=10)

# # Perform random search: grid_mse
# randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, estimator=gbm, scoring="roc_auc", n_iter=5, cv=4, verbose=1)


# # Fit randomized_mse to the data
# randomized_mse.fit(X_train_encoded, y_train_small)

# best_params = randomized_mse.best_params_

# # Train the model with the best hyperparameters
# best_model = xgb.XGBClassifier(**best_params, verbosity=2)
# best_model.fit(X_train_encoded, y_train_small)

# # Print the best parameters and lowest RMSE
# print("Best parameters found: ", randomized_mse.best_params_)
# print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

In [None]:
# feature_importance = feature_importances = model.get_score(importance_type='weight')


In [None]:
# # Assuming 'threshold' is a chosen threshold for feature importance
# threshold = 10
# relevant_features = [feature for feature, score in feature_importances.items() if score >= threshold]

# # Filter out irrelevant features
# X_train_filtered = X_train[relevant_features]
# X_test_filtered = X_test[relevant_features]


In [None]:
# dtrain = xgb.DMatrix(data=X_train_filtered, label= y_train, enable_categorical=True)
# dtest = xgb.DMatrix(data=X_test_filtered, label= y_test, enable_categorical=True)

# category_col = [cat_col]

# dtrain.set_info(feature_types=[('f' if col in category_col else 'q') for col in X_train.columns])
# #dtest.set_info(feature_types=[('f' if col in category_col else 'q') for col in X_test.columns])

# params = {'objective': 'binary:logistic', 'eval_metric': 'logloss'}

# #params = {'objective': 'binary:logistic', 'eval_metric': 'logloss', 'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 1, 'subsample': 1.0}

# model = xgb.train(params, dtrain, num_boost_round=100)

In [None]:
# # 'importance_scores' is the dictionary obtained from model.get_score()
# importance_scores = model.get_score()

# # # Sort the featuresby importance score in descending order
# sorted_scores = sorted(importance_scores.items(), key=lambda x: x[1], reverse=True)
# features, scores = zip(*sorted_scores)

# # Create a bar plot
# plt.figure(figsize=(10, 6))
# plt.bar(features, scores)
# plt.xlabel('Features')
# plt.ylabel('Importance Score')
# plt.title('Feature Importance')
# plt.xticks(rotation=45)
# plt.show()


In [42]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = best_model.predict(dtest)

# Convert predicted probabilities to binary predictions (0 or 1)
y_pred_binary = [1 if p > 0.3 else 0 for p in y_pred]

# Evaluate model performance
accuracy = accuracy_score(y_test_resampled, y_pred_binary)
report = classification_report(y_test_resampled, y_pred_binary)

# Display results
print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)

Accuracy: 0.5000179536795069
Classification Report:
              precision    recall  f1-score   support

           0       0.52      0.00      0.00    250645
           1       0.50      1.00      0.67    250645

    accuracy                           0.50    501290
   macro avg       0.51      0.50      0.33    501290
weighted avg       0.51      0.50      0.33    501290



Identify categorical variables less than 20 number of uniques values using LabelEncoder and Target encoder for the categorical variable that more than 20 number of unique values.


In [None]:

# unique_cat_val_count = X[cat_col].nunique()
# cat_val_less_20 = unique_cat_val_count[unique_cat_val_count < 20].index.tolist()
# cat_val_grter_20 = unique_cat_val_count[unique_cat_val_count > 20].index.tolist()

In [None]:
# X[cat_val_less_20].nunique()

In [None]:
# Encoding categorical variable less than 20 using OneHotENcoder
# X = pd.get_dummies(X, columns=cat_val_less_20)

In [None]:
# y_data

In [None]:
# cat_val_grter_20

In [None]:
# # Categorical variable greater than 20 using frequency encoder
# val_grter_20 = X[cat_val_grter_20].value_counts(normalize=True)
# X['Cat_grter_20_freq'] = X[cat_val_grter_20].apply(lambda x: val_grter_20[x])

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
# model = xgb()
# xgb.train()