In [1]:
import numpy as np

from sklearn.datasets import load_boston

from sklearn.ensemble.iforest import IsolationForest
from sklearn.ensemble import RandomForestRegressor
from bench_isolation_forest_parallel_predict import ParallelPredIsolationForest

In this notebook, we are conducting a very simple experiment to compare the runtimes of the Random Forest Regression, when predicting with parallel jobs, with the proposed PR Isolation Forest with parallel jobs during prediction.

Disclaimer: I understand that these algorithms may be not comparable in terms of priciples, but we are only trying to understand the time they both spend during handling parallel predictions in regimes of just **1 sample prediction**.

**Methodology**

1) We use the boston house-prices dataset.

2) We train the RF Regression using the data and the target.

3) We train the current single-job prediction Isolation Forest using the data, only.

4) We train the proposed parallel-job prediction Isolation Forest using the data, only.

5) We repreat it for different number of parallel jobs, and we use timeit to evaluate the runtimes.

In [2]:
X, y = load_boston()['data'], load_boston()['target']
X.shape

(506, 13)

In [3]:
n_jobs = [1, 2, 4, 8, 12]

#### Random Forest Regression

In [4]:
for n in n_jobs:
    print("Number of jobs: {}.".format(n))
    rf = RandomForestRegressor(n_estimators=100, n_jobs=n, random_state=123)
    rf.fit(X, y)
    %time rf.predict(X[0:1])

Number of jobs: 1.
CPU times: user 5.43 ms, sys: 31 µs, total: 5.46 ms
Wall time: 5.46 ms
Number of jobs: 2.
CPU times: user 12.3 ms, sys: 2.85 ms, total: 15.1 ms
Wall time: 102 ms
Number of jobs: 4.
CPU times: user 13.7 ms, sys: 2.7 ms, total: 16.4 ms
Wall time: 102 ms
Number of jobs: 8.
CPU times: user 15.3 ms, sys: 3.37 ms, total: 18.7 ms
Wall time: 107 ms
Number of jobs: 12.
CPU times: user 14 ms, sys: 2.99 ms, total: 17 ms
Wall time: 105 ms


#### Single job Isolation Forest 

In [5]:
for n in n_jobs:
    print("Number of jobs: {}.".format(n))
    iforest = IsolationForest(n_estimators=100, n_jobs=n, random_state=123)
    iforest.fit(X)
    %time iforest.predict(X[0:1])

Number of jobs: 1.
CPU times: user 28.3 ms, sys: 41 µs, total: 28.4 ms
Wall time: 28.4 ms
Number of jobs: 2.
CPU times: user 45.6 ms, sys: 2.93 ms, total: 48.6 ms
Wall time: 46.8 ms
Number of jobs: 4.
CPU times: user 34.8 ms, sys: 17 µs, total: 34.8 ms
Wall time: 34.8 ms
Number of jobs: 8.
CPU times: user 33.8 ms, sys: 28 µs, total: 33.8 ms
Wall time: 33.9 ms
Number of jobs: 12.
CPU times: user 35.2 ms, sys: 163 µs, total: 35.4 ms
Wall time: 35.5 ms


#### Parallel job Isolation Forest 

In [6]:
for n in n_jobs:
    print("Number of jobs: {}.".format(n))
    iforest = ParallelPredIsolationForest(n_estimators=100, n_jobs=n, random_state=123)
    iforest.fit(X)
    %time iforest.predict(X[0:1])

Number of jobs: 1.
CPU times: user 30.4 ms, sys: 50 µs, total: 30.5 ms
Wall time: 30.6 ms
Number of jobs: 2.
CPU times: user 51 ms, sys: 8.43 ms, total: 59.4 ms
Wall time: 107 ms
Number of jobs: 4.
CPU times: user 45.4 ms, sys: 9.31 ms, total: 54.7 ms
Wall time: 106 ms
Number of jobs: 8.
CPU times: user 57.5 ms, sys: 12.8 ms, total: 70.3 ms
Wall time: 103 ms
Number of jobs: 12.
CPU times: user 54.6 ms, sys: 11.5 ms, total: 66.2 ms
Wall time: 103 ms


## Final remarks

We can observe that the RF Reg. for 1 thread achieves predict times of around 3 ms. But, when we start increasing the number of threads, the minimum predict call time is 100 ms. With the single-threaded Isolation Forest, in this scenario, the time for 1 sample is around 30 ms. When we use more parallel jobs during predict with Isolation Forest, we also obtain around 100 ms predict time. So, I can conclude that this issue is really not so much Isolation Forest-specific, but is more related with handling the parallel jobs.