#### Paysim dataset description:

Synthetic dataset. Recorded transactions in a agent-based simulation of a financial system augmented with anonimized real financial data.

Private nature of financial transactions -> there are few publicly available datasets
 
More info at https://www.kaggle.com/ntnu-testimon/paysim1

#### Your plan for this experiment could look like this : 

    0) Install all dependancies, configure the environement, etc 
    1) Look at the data 
    2) What is our goal? classification/regression/outlier detection/... 
    3) Decide what's important and what's not. What could help us to classify/predict/... better ?
    4) Choose the columns 
    5) Define the preprocessing for theese columns (based on their type)
    6) Define how we will measure the performance of future model. Evaluation metrics.
    7) Split the data in train/test subsets ( + validation set if we want to use smth like EarlyStopping)
    8) Apply the preprocessing
    9) Build a simple model (using the train subset)
    10) Save model via Pickle to accelerate the research
    11) Test the model with the unseen data (test subset)

    If the models' performance is not good enough, go to -> 1/2/3/...|

#### Lets start with some common imports:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing the dataset from external csv file (Comma Separated File)
dataset = pd.read_csv('data/PS_20174392719_1491204439457_log.csv')

### Analysis 

#### What's inside our dataset ?

In [3]:
dataset.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
nb_classes = dataset['isFraud'].value_counts()
print("total nb samples in dataset : ", nb_classes.sum())

total nb samples in dataset :  6362620


#### Ratio of "fraudlent" transactions

In [5]:
nb_classes.min()/nb_classes.sum()

0.001290820448180152

#### Ratio of "regular" transactions

In [6]:
nb_classes.max()/nb_classes.sum()

0.9987091795518198

It's a common problem for such datasets -> **imbalanced classes**.   

We have 0.1% of samples which correspond to the class "fraud" and it could be complicated for most Machine Learning algorithms to extract meaningful information from few samples in the presence of a 99.9% dominant class.

#### Lets' check the repartition of "fraud" by type

Only Cash-out and Transfer transactions could be fraudlent in this example. This could be useful later.

## Data preparation

#### Data preparation : main ideas

In [None]:
# Understand column type - if it is categorical/numerical/part of an image/etc ...

# In general, categorical columns are transformed to numerical via some trick like One-Hot encoding
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

# On the other side, numerical columns are often standartized to have zero mean and unit variance (avg(x)=0, std(x)=1)
# https://stackoverflow.com/questions/26414913
# normalized_data=(data-data.mean())/data.std()

# All this is explained in great detail here : 
# https://scikit-learn.org/stable/modules/preprocessing.html

In [None]:
# We could remove several columns. Often things like IDs, text descriptions could be omitted.

#### Train/test data split

In [7]:
from sklearn.model_selection import train_test_split

In [None]:
# split the data: often it's 80% for training and 20% for evaluation (test)

## Model training

In [None]:
# There are a lot of models avaliable in sklearn python library
# Linear Regression, Decision Tree, Random Forest, ... 
# https://sklearn.org/user_guide.html

In [8]:
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LogisticRegression

In [None]:
# the most important part -> lets' train the model!
# model = model.fit(train_data)

In [None]:
# after that model c.b. used to make predictions.
# Important: do not use the same data for training and evaluation. 
# Such data leak will corrupt the results of your experiment.

In [None]:
# the second most important part will look like this :
# test_prediction_result = model.predict(test_data)

#### Model evaluation 

In [9]:
from sklearn.metrics import confusion_matrix

In [None]:
# after that, the test_result is compared with actual labels from test_data 
# cm = confusion_matrix(test_prediction_result, test_real_labels)
# print(cm)

In [None]:
# If it s performing poorly, what could we do about that ?
# We could try smth from this list: 
# feature engineering, another algorithm, bigger/smaller model, more data, another preprocessing, ...
# If data is unbablanced:
# class_weights with supervised algorithm, undersampling, synthetic samples, ...

#### Metrics which could be used in the context of anomaly detection:

In [None]:
# AUC_PR, AUC_ROC, recall, precision, ...
# but not only accuracy
# (you could have accuracy 99.9% just by predicting  ALL of the data samples as "normal")

## Additional information

### If you want to try real-world data with real fraud / money laundering : 

https://www.elliptic.co/blog/elliptic-dataset-cryptocurrency-financial-crime

### More advanced techniques for those who s interesed :

In [None]:
# Neural net model + focal loss : 
# https://www.dlology.com/blog/multi-class-classification-with-focal-loss-for-imbalanced-datasets/
# https://towardsdatascience.com/lightgbm-with-the-focal-loss-for-imbalanced-datasets-9836a9ae00ca

In [None]:
# A particular type of neural net, autoencoder
# https://keras.io/examples/timeseries/timeseries_anomaly_detection/

In [None]:
# to better represent categorical features there are alternatives to one-hot encoding
# Emeddings, TargetEncoding, ...

In [None]:
# aggregate several models into one : 
# model blending, model stacking, ...
# https://www.kaggle.com/anuragbantu/stacking-ensemble-learning-beginner-s-guide

In [None]:
# c.b. useful - comparison of different Anomaly detection models 
# https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html