With this notebook we want to test how many and which of the important features have been exploited by each isolation tree in a given isolation forest.

In [1]:
import os
import numpy as np
import pickle as pkl 
import time
import matplotlib.pyplot as plt 
%matplotlib inline
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score, average_precision_score
from sklearn.utils import shuffle
import shap
import diffi.interpretability_module as interp
from diffi.utils import *

  from .autonotebook import tqdm as notebook_tqdm


## Load dataset

In [2]:
with open(os.path.join(os.getcwd(), 'dataset', 'syn_train.pkl'), 'rb') as f:
    data_tr = pkl.load(f)
with open(os.path.join(os.getcwd(), 'dataset', 'syn_test.pkl'), 'rb') as f:
    data_te = pkl.load(f)

X_tr = data_tr.iloc[:, :-1]
y_tr = data_tr.iloc[:, -1]

X_tr, y_tr = shuffle(X_tr, y_tr, random_state=0)

X_te = data_te.iloc[:, :-1]
y_te = data_te.iloc[:, -1]


In [3]:
print('Training set size: ', X_tr.shape)
print('Test set size: ', X_te.shape)

print('Trainin label size: ', y_tr.shape)
print('Test label size: ', y_te.shape)

Training set size:  (1000, 20)
Test set size:  (100, 20)
Trainin label size:  (1000,)
Test label size:  (100,)


## Train the Isolation Forest

In [24]:
iforest = IsolationForest(n_estimators=100, max_samples=256, contamination=0.1, random_state=0, bootstrap=False)
iforest.fit(X_tr)
y_tr_pred = np.array(iforest.decision_function(X_tr) < 0).astype(int)    # > 0 -> True -> 1; < 0 -> False -> 0
f1 = f1_score(y_tr, y_tr_pred)
print('F1 score on training data: {:.4f}'.format(f1))

F1 score on training data: 0.4300


In [33]:
y_te_pred = np.array(iforest.decision_function(X_te) < 0).astype(int)    
print('Detected anomalies: {} out of {}'.format(int(sum(y_te_pred)), len(y_te)))

Detected anomalies: 50 out of 100


In [34]:
sorted_idx, avg_f1 = diffi_ranks(X=X_tr.to_numpy(), y=y_tr.to_numpy(), n_trees=100, max_samples=256, n_iter=10, contamination=0.1)
print('Average F1 score: {:.4f}'.format(avg_f1))

Average F1 score: 0.4330


This cell prints the features ranking, computed by global DIFFI.

In [45]:
for i in sorted_idx:
    print('Feature {}: {}'.format(i+1, sorted_idx[i]))

Feature 2: 0
Feature 1: 1
Feature 11: 11
Feature 14: 16
Feature 4: 13
Feature 9: 17
Feature 16: 14
Feature 10: 4
Feature 18: 12
Feature 5: 3
Feature 12: 7
Feature 8: 9
Feature 19: 5
Feature 17: 6
Feature 20: 2
Feature 15: 19
Feature 7: 15
Feature 13: 18
Feature 6: 8
Feature 3: 10


Notice that the two most important features are the meaningful ones, while the other are the noise ones.

In [46]:
two_most_important_features = sorted_idx[:2]
print('Two most important features:', two_most_important_features)

Two most important features: [1 0]


Now we want to check how many times each tree has used these two meaningful features

In [69]:
# for i, tree in enumerate(iforest.estimators_):
    # print(f"Tree {i}:")
    # print(f"  Depth: {tree.tree_.max_depth}")
    # print(f"  Number of nodes: {tree.tree_.node_count}")

    # features_per_tree.append(tree.tree_.feature)
    # print(f"  Features used: {tree.tree_.feature}")    

features_per_tree = [tree.tree_.feature for _, tree in enumerate(iforest.estimators_)]
print('Number of features per tree:', len(features_per_tree))
print('Features per tree:', features_per_tree)


Number of features per tree: 100
Features per tree: [array([11,  9, -2, 16, -2, -2,  5,  9, 12, 16, 12, 11, 14, -2, -2, 11, -2,
       -2,  6,  2, -2, -2, -2, -2,  6, 17,  9, 13, -2, -2,  8, -2, -2, -2,
       19, 12, 15, -2, -2,  0, -2, -2,  9, 14, -2, -2, -2,  0,  3, -2,  8,
       -2, -2,  0,  4, 18,  8, -2, -2, -2,  5, 17, -2, -2, 17, -2, -2, 17,
       -2,  3, 18, -2, -2, -2,  7, 13, 19, -2, -2,  9, 12, 19, -2, 15, -2,
       -2, -2,  9, 10, -2, -2, -2,  6,  7,  3, 16,  6, -2, -2, 13, -2, -2,
       -2, -2, -2], dtype=int64), array([ 4,  5, 14,  2, 16, -2,  9, -2, -2, -2,  5, -2, 14, -2,  1, -2,  0,
       19, -2, -2,  8, -2, -2, 17, 17, 11, -2,  1, -2, 13, -2, -2, -2,  4,
       18, 14, 15, -2, -2, -2,  3,  4, -2,  4, -2, -2, -2, 11,  0, 15,  9,
       -2, -2, -2, -2, -2, 17, -2,  8, 19,  5,  5,  7,  7, -2, -2, 16, -2,
       -2,  6,  3, -2, -2,  1, -2, -2,  0, -2, -2, 17, 16, -2, 19,  6, -2,
       -2, 16, -2, -2,  2, -2, 18, 19, -2, -2,  8, -2, -2,  5,  9, 13, -2,
       -2, 19

TO-DO:
- capire che statistiche tirare fuori da `features_per_tree`:
    * quante volte sono state utilizzate le `two_most_important_features` in ciascun albero (percentuale di utilizzo)
    * altro?
- capire/chiedere il motivo per cui le performance sono basse e di conseguenza se il dataset sintetico creato è opportuno
- altro?