## FindDefault (Prediction of Credit Card fraud)

### Context
Credit card companies need to spot fraud to avoid charging customers for unauthorized transactions.

### Content
This dataset has credit card transactions from European cardholders in September 2013. It includes 492 frauds out of 284,807 transactions, with frauds making up just 0.172% of all transactions.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
import pickle

In [2]:
# Load data
data = pd.read_csv('../data/raw/creditcard.csv', sep=',')
data1 = data.sample(frac=0.1, random_state=1)
columns = data1.columns.tolist()
columns = [c for c in columns if c not in ["Class"]]
target = "Class"
X = data1[columns]
Y = data1[target]

In [3]:
# Determine outlier fraction
fraud = data1[data1['Class'] == 1]
valid = data1[data1['Class'] == 0]
outlier_fraction = len(fraud) / float(len(valid))

In [4]:
# Define classifiers
classifiers = {
    "Isolation Forest": IsolationForest(n_estimators=100, max_samples=len(X), contamination=outlier_fraction, random_state=42),
    "Local Outlier Factor": LocalOutlierFactor(n_neighbors=20, contamination=outlier_fraction),
    "Support Vector Machine": OneClassSVM(kernel='rbf', gamma=0.1, nu=0.05)
}

## Model Prediction

We will use the following algorithms to detect anomalies in this dataset:

### Isolation Forest Algorithm
Isolation Forests detect anomalies by isolating data points that are few and different. This method is efficient because it uses isolation trees to separate anomalies, requiring fewer conditions compared to normal data. It performs well with less memory and processing time.

### Local Outlier Factor (LOF) Algorithm
LOF detects outliers by comparing the density of a data point to its neighbors. Points with significantly lower density than their neighbors are considered outliers. Typically, 20 neighbors are used for this comparison.

In [5]:
# Train and evaluate models
for clf_name, clf in classifiers.items():
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_prediction = clf.negative_outlier_factor_
    elif clf_name == "Support Vector Machine":
        clf.fit(X)
        y_pred = clf.predict(X)
    else:
        clf.fit(X)
        scores_prediction = clf.decision_function(X)
        y_pred = clf.predict(X)
    
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1
    
    n_errors = (y_pred != Y).sum()
    print(f"{clf_name}: {n_errors}")
    print("Accuracy Score:")
    print(accuracy_score(Y, y_pred))
    print("Classification Report:")
    print(classification_report(Y, y_pred))
    
    # Save the model
    with open(f'../models/{clf_name.lower().replace(" ", "_")}.pkl', 'wb') as model_file:
        pickle.dump(clf, model_file)



Isolation Forest: 73
Accuracy Score:
0.9974368877497279
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28432
           1       0.26      0.27      0.26        49

    accuracy                           1.00     28481
   macro avg       0.63      0.63      0.63     28481
weighted avg       1.00      1.00      1.00     28481

Local Outlier Factor: 97
Accuracy Score:
0.9965942207085425
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28432
           1       0.02      0.02      0.02        49

    accuracy                           1.00     28481
   macro avg       0.51      0.51      0.51     28481
weighted avg       1.00      1.00      1.00     28481

Support Vector Machine: 8515
Accuracy Score:
0.7010287560127805
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.70      0.8

#### Observations :
- Isolation Forest found 73 errors, LOF found 97 errors, and SVM found 8516 errors.
- Isolation Forest is 99.74% accurate, better than LOF at 99.65% and SVM at 70.09%.
- Isolation Forest detects about 27% of fraud cases, much better than LOF's 2% and SVM's 0%.
- Overall, Isolation Forest is the best for identifying fraud, with about 30% accuracy.
- Accuracy can be improved by using larger samples or deep learning, though it will be more computationally expensive.