<a href="https://colab.research.google.com/github/AnnetteNakiwala/Business-Intelligence/blob/main/Fraud_detection_on_online_bank_transactions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**FRAUD DETECTION ON ONLINE BANK TRANSACTIONS**

**Introduction:**

The purpose of this project is to develop a fraud detection system for online transactions that uses machine learning models. The goal is to build a system that can accurately detect fraudulent transactions and prevent financial losses for both consumers and businesses.

**Data:**

The first step in this project is to collect and preprocess data from online transactions. This data should include information such as the transaction amount, the time and date of the transaction, the location of the transaction, and the type of payment method used. Other relevant data could include the user's location, device information, and behavior patterns.

**Model Development:**

The next step is to develop machine learning models that can accurately detect fraudulent transactions. This involves training the models on historical data that contains both fraudulent and legitimate transactions. The models should be able to learn patterns in the data that are associated with fraudulent activity and use those patterns to identify suspicious transactions.

**Evaluation:**

Once the models have been developed, they need to be evaluated to determine their effectiveness at detecting fraud. This involves testing the models on a separate set of data that contains both fraudulent and legitimate transactions. The performance of the models should be evaluated using metrics such as precision, recall, and F1 score.

**Deployment:**

Once the models have been developed and evaluated, they can be deployed in a real-world environment. This involves integrating the models into an online payment system and monitoring transactions in real-time. If a transaction is flagged as potentially fraudulent, it can be reviewed by a human analyst who can make the final determination about whether it is fraudulent or not.

**Conclusion:**

In conclusion, this project aims to develop a fraud detection system for online transactions that uses machine learning models. By accurately detecting fraudulent transactions, this system can prevent financial losses for both consumers and businesses and help to maintain the integrity of online payment systems.

**Task Note:**
1. Machine Leanring models to use:
2. LogisticRegression
3. KNeighborsClassifier
4. RandomForestClassifier
5. XGBClassifier
6. SupportVectorMachine Classifier.

**Data:**
Download the dataset from this link below:
https://www.kaggle.com/ntnu-testimon/paysim1

**Dataset has fillowing columns:**

**step** - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

**type** - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

**amount** - amount of the transaction in local currency.

**nameOrig** - customer who started the transaction

**oldbalanceOrg** - initial balance before the transaction

**newbalanceOrig** - new balance after the transaction

**nameDest** - customer who is the recipient of the transaction

**oldbalanceDest** - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

**newbalanceDest** - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

**isFraud** - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

**isFlaggedFraud** - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

In [None]:
# Import Libraries




In [None]:
#Load data




**Exploratory Data Analysis (EDA)**

In [None]:
#check the shape of the dataset




In [None]:
#examine the dataset using describe function




In [None]:
#Check if there is anu null values




In [None]:
#check for duplicate values




**Distribution of all Transactions**

In [None]:
#Distribution of the frequency of all transactions 




In [None]:
#safe transactions amount distribution plot




In [None]:
# Fraud transactions amount distribution plot




In [None]:
#Fraud transaction boxplot for amount distribution




**Type of Transactions**

In [None]:
# check the type of  safe transactions using count_values function




In [None]:
# check the type of  fraud transactions using count_values function




**Machine learning**

In [None]:
#drop the name columns



In [None]:
#Binary-encoding of labelled data in 'type' for 'CASH_OUT' and 'TRANSFER' Columns





In [None]:
# Split the target and features from the dataset





In [None]:
# split the data into train and test





In [None]:
# General function to run classifier with default parameters to get baseline model
def ml_func (algoritm):
  #train and fit regression model

  # predict
  


  # Evaluate



  # store accuracy in a new dataframe






**Machine Learning Algorithms Explained**

**Logistic Regression:**

Logistic Regression is a linear classification algorithm that is used to predict binary or multi-class outcomes. It uses the logistic function to estimate the probability of a binary response based on one or more predictor variables. It is commonly used in fields such as finance, marketing, and healthcare. In logistic regression, the coefficients are estimated using maximum likelihood estimation, and the model is evaluated based on metrics such as accuracy, precision, recall, and F1-score.

**K-Nearest Neighbors Classifier:**

The K-Nearest Neighbors (KNN) Classifier is a non-parametric algorithm that is used for classification and regression. KNN works by calculating the distances between a new observation and all the other observations in the dataset, and then selecting the K closest neighbors. The new observation is then classified based on the majority class of its K nearest neighbors. The value of K is chosen based on cross-validation and is evaluated based on metrics such as accuracy, precision, recall, and F1-score.

**Random Forest Classifier:**

Random Forest Classifier is an ensemble learning algorithm that is used for classification and regression. It works by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. Random Forest Classifier is widely used because it is relatively easy to use, performs well on many datasets, and is resistant to overfitting. The performance of Random Forest Classifier is evaluated using metrics such as accuracy, precision, recall, and F1-score.

**XGBClassifier:**

XGBClassifier is an implementation of the Gradient Boosting algorithm that is designed for efficient and scalable tree boosting. Gradient Boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models. The XGBClassifier is widely used in Kaggle competitions and is known for its performance and speed. The performance of XGBClassifier is evaluated using metrics such as accuracy, precision, recall, and F1-score.

**Support Vector Machine (SVM) Classifier:**

Support Vector Machine (SVM) Classifier is a linear classification algorithm that is used for binary and multi-class classification. SVM works by finding a hyperplane that separates the data into different classes. SVM is commonly used in fields such as image recognition, text classification, and bioinformatics. The performance of SVM is evaluated using metrics such as accuracy, precision, recall, and F1-score. SVM can also be used for non-linear classification by using a kernel trick to transform the data into a higher dimensional space.

In [None]:
#list of all classifiers that I will run for base models 
algoritms=[LogisticRegression,KNeighborsClassifier,RandomForestClassifier,XGBClassifier,svm.SVC]

#running each model and print accuracy scores
for algoritm in algoritms:
    ml_func (algoritm)

**Deployment of the most accourate model**

Pickle is a Python module used for serializing and de-serializing Python object structures. It is used for data storage, data transfer, and object persistence. It can also be used to store trained machine learning models in Python, allowing you to save your model to disk and reload it at a later time.

Here are the steps to use pickle to build a model in Python:

1. Train your machine learning model using your dataset. For example, you might train a decision tree classifier to predict the class of an object based on its features.

2. Once you have trained your model, import the pickle module and use the dump() method to save the trained model to disk. For example, you might save your decision tree classifier as follows:

In [None]:
import pickle

# train decision tree classifier




In [None]:
# save model to using pickle





In [None]:
# load the model using pickle



