# Online Payment Fraud Detection
Author: [Mohd Mudassir Ansari](https://www.linkedin.com/in/mudassir-ia/)
## About Data:
Dataset: ([Kaggle-Online Payment Fraud Detection Dataset](https://www.kaggle.com/datasets/rupakroy/online-payments-fraud-detection-dataset/data))

The below column reference:

- step: represents a unit of time where 1 step equals 1 hour
- type: type of online transaction
- amount: the amount of the transaction
- nameOrig: customer starting the transaction
- oldbalanceOrg: balance before the transaction
- newbalanceOrig: balance after the transaction
- nameDest: recipient of the transaction
- oldbalanceDest: initial balance of recipient before the transaction
- newbalanceDest: the new balance of recipient after the transaction
- isFraud: fraud transaction

## Reading the Data

In [None]:
# importing important data science libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# reading data
data = pd.read_csv('../data/online-payment-fraud.csv')
data.head()

In [None]:
# basic information about the data
data.info()

## Numeric Values

In [None]:
numeric_features = [x for x in data.columns if data[x].dtype != 'object']
numeric_features

## Correlation between Features

Checking the relation among the features i.e how one or more variables are related to each other

In [None]:
sns.heatmap(data[numeric_features].corr(), annot=True)
plt.show()

## Categorical Values

In [None]:
data['type'].unique()

## Distribution of transaction type

In [None]:
transaction_type = data.type.value_counts()
transaction = transaction_type.index
quantity = transaction_type.values

plt.pie(quantity, labels=transaction, autopct='%.1f%%', startangle=90)
plt.legend(transaction, loc='best')
plt.axis('equal')
plt.tight_layout()
plt.show()

## Converting categorical data to numerical values

In [None]:
# type of transaction: ['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN']
# alphabetically assigning all values a number from 0 to 4

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['type'] = encoder.fit_transform(data['type'])
data.head()

## Splitting Training and Test data

In [None]:
from sklearn.model_selection import train_test_split

X = data[['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig']]
y = data['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
print(X_train.shape, y_train.shape)

In [None]:
print(X_test.shape, y_test.shape)

## Standardization

Standardizing the data so for improving scores and a good prediction

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model Training

Since, this is a Binary Classification problem (Fraud: 1, Not Fraud: 0), we will first try the 
### Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
regressor = LogisticRegression()
regressor.fit(X_train_scaled, y_train)
y_pred = regressor.predict(X_test_scaled)

In [None]:
np.unique(y_pred)

## Accuracy of Model

Accuracy of Logistic Regression Model is evaluated with the help of Confusion Matrix and Classification Report  
The Accuracy Score basically indicates the overall accuracy of the trained model

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

def accuracy_report(y_test, y_pred):
    print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n')
    print(f'Accuracy: {accuracy_score(y_test, y_pred)}\n')
    print(f'Classification Report:\n{classification_report(y_test, y_pred)}\n')

accuracy_report(y_test, y_pred)

In [None]:
X.head()

In [None]:
regressor.predict([[4, 9839, 170136, 0]])

Since, the data is very large. The Logistic Regression model gives Overfitting.  
Hence we will next try using:
### Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
treeclassifier = DecisionTreeClassifier()
treeclassifier.fit(X_train, y_train)

In [None]:
# prediction
y_pred = treeclassifier.predict(X_test)
y_pred

In [None]:
# Accuracy
def accuracy_report(y_test, y_pred):
    print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n')
    print(f'Accuracy: {accuracy_score(y_test, y_pred)}\n')
    print(f'Classification Report:\n{classification_report(y_test, y_pred)}\n')

accuracy_report(y_test, y_pred)

Since the Accuracy of Decision Tree model is slightly more than the Logistic Regression, we will choose this one

## Pickling

converting models into a byte stream. This byte stream can then be stored in a file, transmitted over a network, or stored in memory

In [None]:
import pickle

# pickle.dump(scaler, open('../models/scaler.pkl', 'wb'))
pickle.dump(treeclassifier, open('../models/treeclassifier.pkl', 'wb'))