# Isolation Forest

We want to follow a completely different approach, and see if we can treat _True_ samples as outliers/novelties, by applying an unsupervised ensamble approach (the Isolation Forest).

In [12]:
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import recall_score, f1_score, balanced_accuracy_score
from preprocessing import load_dataset, split_train_test_validation, interactions
from utilities import print_log, print_full

First, we import the dataset and we split it.

In [13]:
test_size = 0.1        # the ratio of the dataset we want to use as test set
val_size = 0.2         # the ratio of the dataset we want to use as validation set
stratify=True

# first, we load the dataset
X, y = load_dataset('./data/data.pkl')
# then, we split it
X_tr, y_tr, X_te, y_te, X_val, y_val = split_train_test_validation(X, y, test=test_size, val=val_size, stratify=stratify)

# generate the dataset version with interaction, with/without drop
X_tr_drop = interactions(X_tr, drop=True)
X_tr_int = interactions(X_tr, drop=False)

X_val_drop = interactions(X_val, drop=True)
X_val_int = interactions(X_val, drop=False)

Then, we create the Isolation Forest model.

In [14]:
if_model = IsolationForest()

## No interactions
First of all, we try to apply the model on the standard dataset (without interactions). We fit the model on the training set, and then we apply it to the validation set.

If the predicted result is **+1**, then it is an _inlier_, meaning it should be labeled as 0. Otherwise, if the result is **-1**, it is an _outlier_, therefore it should be labeled as 1.

In [15]:
# fit the model
if_fit = if_model.fit(X_tr)
# predict if it is an inlier
is_inlier = np.array(if_fit.predict(X_val))
# map the result as follows
#   inlier (+1) -> 0
#   outlier (-1) -> 1
y_pred = (1-is_inlier)/2
# compute metrics
res = {
    'best_recall': recall_score(y_val, y_pred),
    'best_balanced_accuracy': balanced_accuracy_score(y_val, y_pred),
    'best_f1': f1_score(y_val, y_pred),
}
# print results
print_log(res, is_grid=False)

-----------------------------------------
Recall : 0.0
Balanced accuracy: 0.4974905897114178
f1: 0.0


## Interactions, no drop

In [16]:
# fit the model
if_fit = if_model.fit(X_tr_int)
# predict if it is an inlier
is_inlier = np.array(if_fit.predict(X_val_int))
# map the result as follows
#   inlier (+1) -> 0
#   outlier (-1) -> 1
y_pred = (1-is_inlier)/2
# compute metrics
res = {
    'best_recall': recall_score(y_val, y_pred),
    'best_balanced_accuracy': balanced_accuracy_score(y_val, y_pred),
    'best_f1': f1_score(y_val, y_pred),
}
# print results
print_log(res, is_grid=False)

-----------------------------------------
Recall : 0.006802721088435374
Balanced accuracy: 0.501519302827781
f1: 0.01324503311258278


## Interactions, drop

In [17]:
# fit the model
if_fit = if_model.fit(X_tr_drop)
# predict if it is an inlier
is_inlier = np.array(if_fit.predict(X_val_drop))
# map the result as follows
#   inlier (+1) -> 0
#   outlier (-1) -> 1
y_pred = (1-is_inlier)/2
# compute metrics
res = {
    'best_recall': recall_score(y_val, y_pred),
    'best_balanced_accuracy': balanced_accuracy_score(y_val, y_pred),
    'best_f1': f1_score(y_val, y_pred),
}
# print results
print_log(res, is_grid=False)

-----------------------------------------
Recall : 0.034013605442176874
Balanced accuracy: 0.5075965141389053
f1: 0.05988023952095809
