# Isolation Forest

We want to follow a completely different approach, and see if we can treat _True_ samples as outliers/novelties, by applying an unsupervised ensamble approach (the Isolation Forest).

In [19]:
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import recall_score, f1_score, balanced_accuracy_score
from preprocessing import load_dataset, split_train_test_validation, corresponding_features_interaction
from utilities import print_log, print_full

First, we import the dataset and we split it.

In [20]:
test_size = 0.1        # the ratio of the dataset we want to use as test set
val_size = 0.2         # the ratio of the dataset we want to use as validation set
stratify=True

# first, we load the dataset
X, y = load_dataset('./data/data.pkl')
# then, we split it
X_tr, y_tr, X_te, y_te, X_val, y_val = split_train_test_validation(X, y, test=test_size, val=val_size, stratify=stratify)

# generate the dataset version with interaction, with/without drop
X_tr_drop = corresponding_features_interaction(X_tr, drop=True)
X_tr_int = corresponding_features_interaction(X_tr, drop=False)

X_val_drop = corresponding_features_interaction(X_val, drop=True)
X_val_int = corresponding_features_interaction(X_val, drop=False)

Then, we create the Isolation Forest model.

In [21]:
# we want to use our a-priori knowledge on the dataset to inform the model about the percentage of our outliers
match_ratio = np.sum(y)/len(y)
# match_ratio = 0.5

if_model = IsolationForest(contamination=match_ratio)

## No interactions
First of all, we try to apply the model on the standard dataset (without interactions). We fit the model on the training set, and then we apply it to the validation set.

If the predicted result is **+1**, then it is an _inlier_, meaning it should be labeled as 0. Otherwise, if the result is **-1**, it is an _outlier_, therefore it should be labeled as 1.

In [22]:
# fit the model
if_fit = if_model.fit(X_tr.values)
# predict if it is an inlier
is_inlier = np.array(if_fit.predict(X_val.values))
# map the result as follows
#   inlier (+1) -> 0
#   outlier (-1) -> 1
y_pred = (1-is_inlier)/2
# compute metrics
res = {
    'best_recall': recall_score(y_val, y_pred),
    'best_balanced_accuracy': balanced_accuracy_score(y_val, y_pred),
    'best_f1': f1_score(y_val, y_pred),
}
# print results
print_log(res, is_grid=False)

-----------------------------------------
Recall : 0.12162162162162163
Balanced accuracy: 0.4615645796550319
f1: 0.11111111111111113


## Interactions, no drop

In [23]:
# fit the model
if_fit = if_model.fit(X_tr_int.values)
# predict if it is an inlier
is_inlier = np.array(if_fit.predict(X_val_int.values))
# map the result as follows
#   inlier (+1) -> 0
#   outlier (-1) -> 1
y_pred = (1-is_inlier)/2
# compute metrics
res = {
    'best_recall': recall_score(y_val, y_pred),
    'best_balanced_accuracy': balanced_accuracy_score(y_val, y_pred),
    'best_f1': f1_score(y_val, y_pred),
}
# print results
print_log(res, is_grid=False)

-----------------------------------------
Recall : 0.16216216216216217
Balanced accuracy: 0.49439766399565394
f1: 0.15483870967741936


## Interactions, drop

In [24]:
# fit the model
if_fit = if_model.fit(X_tr_drop.values)
# predict if it is an inlier
is_inlier = np.array(if_fit.predict(X_val_drop.values))
# map the result as follows
#   inlier (+1) -> 0
#   outlier (-1) -> 1
y_pred = (1-is_inlier)/2
# compute metrics
res = {
    'best_recall': recall_score(y_val, y_pred),
    'best_balanced_accuracy': balanced_accuracy_score(y_val, y_pred),
    'best_f1': f1_score(y_val, y_pred),
}
# print results
print_log(res, is_grid=False)

-----------------------------------------
Recall : 0.1891891891891892
Balanced accuracy: 0.5192177101724841
f1: 0.18918918918918917
