# Fraud Detection
## Ubisoft home assignment
Arnaud Baumann

#### Some statistics about the dataset

We will first show some statistics about the dataset

In [None]:
import pandas as pd
import sys
pd.options.mode.chained_assignment = None

from sklearn.tree import DecisionTreeClassifier
import seaborn as sns; sns.set(style="ticks", color_codes=True)
from assembleAdaboost import AssembleAdaBoost
import pickle
from utils import *
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# First read the csv with pandas and transform it to a dataframe
df = pd.read_csv("mle_fraud_test.csv", sep=";")
df.describe()

I kept most of the features unchanged. I dropped the `order_id` column as each order has a unique ID.
For the rest of the features, I only transformed `user_id` and `order_created_datetime` features because
they had too high cardinality:
* Each `user_id` has been replaced by the number of times it appears in the train set (`user_id_count`).
* The range of `order_created_datetime` has been reduced by keeping only the day number of the month (`day_number`)

Let's try to visualize the data distributions with a pairplot. We sample `n_samples = 300` from each class and
then plot some of the features for each class by using seaborn's pairplot


In [None]:
# Sample for each class 300 points
n_samples = 300
legit_df = df[df["transaction_status"]=="LEGIT"].sample(n=n_samples)
fraud_df = df[df["transaction_status"]=="FRAUD"].sample(n=n_samples)
blocked_df = df[df["transaction_status"]=="BLOCKED"].sample(n=n_samples)
X_legit = get_df_features(legit_df)
X_fraud= get_df_features(fraud_df)
X_blocked = get_df_features(blocked_df)

data = np.concatenate([X_legit, X_fraud,X_blocked])
# Create user_id_counter
user_id_counter = dict(Counter( data[:,0]))

# Preprocess the data
preprocess(data, user_id_counter)
labs = np.array([['LEGIT'] for i in range(n_samples)]+
                [['FRAUD'] for i in range(n_samples)]+
                [['BLOCKED'] for i in range(n_samples)])

data = np.append(data, labs, axis=1)

data_df = pd.DataFrame(data=data)
data_df.columns = ["user_id_count", "day_number", "amount", "total_amount_14days", "email_handle_length", \
         "email_handle_dst_char", "total_nb_orders_player", "player_seniority", \
         "total_nb_play_sessions", "geographic_distance_risk", "label"]
g = sns.pairplot(data_df[['amount','day_number', 'total_amount_14days', 'email_handle_dst_char', 'geographic_distance_risk','label']], hue='label')

### Dataset imbalance

Datasets for fraud detection are usually very unbalanced, as the majority of samples are valid
(LEGIT in our case). Here we have an additionnal third class named 'BLOCKED' meaning that the existing fraud
management tool has stopped the transaction and we do not have a final label for it. We refer in this notebook
to data with 'BLOCKED' label with variables containing 'blocked'

In [None]:
num_legits = len(df[df["transaction_status"]=="LEGIT"])
num_blocked = len(df[df["transaction_status"]=="BLOCKED"])
num_frauds = len(df[df["transaction_status"]=="FRAUD"])
num_payments = len(df["transaction_status"])
print("Number of legit payments: {} ouf of {} ({:.4f}%)".format( num_legits, num_payments, num_legits/num_payments))
print("Number of blocked payments: {} ouf of {} ({:.4f}%)".format( num_blocked, num_payments, num_blocked/num_payments))
print("Number of fraud payments: {} ouf of {} ({:.4f}%)".format( num_frauds, num_payments, num_frauds/num_payments))


# ASSEMBLE.Adaboost algorithm


The ASSEMBLE.Adaboost is implemented in the file `assembleAdaBoost.py` as a class named AssembleAdaBoost.
<b>(Question 1) </b>.

### Answer to question 2:

You can find below a code cell that trains the assembleAdaBoost algorithm and
outputs the resulting confusion matrix on the test set. I chose a split ratio between train and test of 75%
for train and 25% for test.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from collections import Counter
from time import time, strftime

# Read the dataframe and get labeled and unlabeled data
df = pd.read_csv("mle_fraud_test.csv", sep=";")
not_blocked_df = df[df["transaction_status"]!="BLOCKED"]
blocked_df = df[df["transaction_status"]=="BLOCKED"]


y_labeled = get_df_labels(not_blocked_df)
X_train_blocked = get_df_features(blocked_df)
X_labeled = get_df_features(not_blocked_df)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X_labeled, y_labeled, \
                                                    stratify=y_labeled, \
                                                    test_size=0.25, \
                                                    shuffle=True)

# Create user_id_dictionary based on the train dataset
user_id_counter = dict(Counter( np.concatenate([X_train[:,0], X_train_blocked[:,0]]) ))
# Save the dictionnary as we are going to reuse it in the app
file = open('user_id_counter.pkl', 'wb')
pickle.dump(user_id_counter, file)
file.close()

# Transform user_id and order_created_datetime features
print('Preprocessing data...')
preprocess(X_train, user_id_counter)
preprocess(X_test, user_id_counter)
preprocess(X_train_blocked, user_id_counter)
# Initialize pseudo-labels for blocked data
y_train_blocked = get_initial_blocked_labels(X_train, y_train, X_train_blocked)

print("Starting model fitting...")
start = time()
clf = AssembleAdaBoost(n_estimators=50, sample=False)
clf.fit(X_train.astype(float), X_train_blocked.astype(float),y_train, y_train_blocked)
elapsed = time()-start
print("Done training, took {:.2f}s. doing prediction...\n".format(elapsed))
y_pred = clf.predict(X_test)


# Save the fitted model with pickle
file = open('assemble_adaboost_model.pkl', 'wb')
pickle.dump(clf, file)
file.close()

# Print metrcis and the confusion matrix
print_metrics(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, ["FRAUD", "LEGIT"], normalize=False)

### Assemble Adaboost benchmark

To see if the algorithm performs well, we can compare it to a similar model, Adaboost. To do so, we use sklearn's
AdaBoostClassifier class and perform 4-fold classification on the dataset for both classifiers.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import confusion_matrix
df = pd.read_csv("mle_fraud_test.csv", sep=";")

not_blocked_df = df[df["transaction_status"]!="BLOCKED"]
blocked_df = df[df["transaction_status"]=="BLOCKED"]

y_labeled = get_df_labels(not_blocked_df)
X_labeled = get_df_features(not_blocked_df)

skf = StratifiedKFold(n_splits=4, shuffle=True)
skf.get_n_splits(X_labeled, y_labeled)
print(skf)

for idx, (train_index, test_index) in enumerate(skf.split(X_labeled, y_labeled)):
    print("\ncurrent k-fold:", idx+1)

    X_train_labeled, X_test = X_labeled[train_index], X_labeled[test_index]
    y_train_labeled, y_test= y_labeled[train_index], y_labeled[test_index]


    user_id_counter = dict(Counter(X_train_labeled[:,0]))

    preprocess(X_train_labeled, user_id_counter)
    preprocess(X_test, user_id_counter)
    X_train_blocked = get_df_features(blocked_df)
    preprocess(X_train_blocked, user_id_counter)

    y_train_blocked= get_initial_blocked_labels(X_train_labeled, y_train_labeled, X_train_blocked)

    clf = AssembleAdaBoost(n_estimators=50, sample=False)

    clf.fit(X_train_labeled.astype(float), X_train_blocked.astype(float),y_train_labeled, y_train_blocked)
    y_pred = clf.predict(X_test)
    print("\nASSEMBLE Adaboost classification metrics:")
    print_metrics(y_test, y_pred)

    clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2),algorithm='SAMME')
    clf.fit(X_train_labeled, y_train_labeled)
    y_pred = clf.predict(X_test)
    print("\nAdaboost classification metrics:")
    print_metrics(y_test, y_pred)




We can see that Assemble.Adaboost classifier has similar performance or sometimes slightly below
performances compared to Adaboost. The reason could be that blocked samples have very similar distribution
to legit samples (see pairplot) and thus poorly contributing to the increase of performances of fraud classification.

### Answer to question 3

Taking the optimal decision here means to maximize the profit of the transaction given the transaction amount, the fraud
fee and the probability that the transaction is a fraud. We can calculate the potential profit with
the formula:

 `potential_profit = amount*(1-p) - p*fraud_fee`

If this potential profit is above 0, we should allow the transaction. Otherwise, we blocked it as we
potentially lose money.

Below is a plot describing how this function behaves given different amount, a fixed fraud fee of 15€ and a variable
probability of fraud.
* When the amount increase, we can allow a higher probability of fraud.
* At an amount of 15€ and p=0.5 the potential profit is 0, which makes sense.

You'll also find a code cell with the associated method implemented.

In [None]:
ax = plt.subplot(111)
fraud_fee = 15.0
amounts = [5, 10, 15, 25, 50]
for amount in amounts:
    p = np.arange(0.0, 1.0, 0.01)
    potential_profit = amount*(1-p) - p*fraud_fee
    line, = plt.plot(p, potential_profit, lw=2,label="amount: {}".format(amount))
plt.grid(True)
plt.legend()
ax.set_ylabel("Decision score")
ax.set_xlabel("Probability of being fraud")
plt.ylim(-20, 20)
plt.show()


In [None]:
# optimal decision method
def optimal_decision(amount, fraud_fee, p):
    """Computes the optimal decision between blocking or not blocking of the transaction
        Parameters
        ----------
        amount : amount of the transaction
        fraud_fee : fraud fee
        p : probability of the transaction to be a fraud
        Returns
        -------
        decision : boolean :
                    True if the transaction must be blocked
                    False if the transaction must not be blocked
        """
    potential_profit  = amount*(1-p) - p*fraud_fee
    if potential_profit < 0:
        return 'BLOCK'
    else:
        return 'ACCEPT'
