Skip to content

Latest commit

 

History

History
105 lines (80 loc) · 4.82 KB

extension_level.rst

File metadata and controls

105 lines (80 loc) · 4.82 KB

extension_level

  • Available in: Extended Isolation Forest
  • Hyperparameter: yes

Description

For Extended Isolation Forest the extension_level hyperparameter allows you to leverage the generalization of Isolation Forest. The number 0 corresponds to Isolation Forest's behavior because the split points are not randomized with slope. For the dataset with P features, the maximum extension_level is P-1 and means full-extension. As the extension_level is increased, the bias of the standard Isolation Forest is reduced. A lower extension is suitable for a domain where the range of the minimum and maximum for each feature highly differs (for example, when one feature is measured in millimeters, and the second one in meters). The following paragraphs deliver a more detailed explanation.

The branching criteria in Extended Isolation Forest for the data splitting for a given data point x is as follows:

(x - p) * n \leq  0
where:
  • x, p, and n are vectors with P features
  • p is random intercept generated from the uniform distribution with bounds coming from the sub-sample of data to be split.
  • n is random slope for the branching cut generated from \mathcal{N(0,1)} distribution.

The function of extension_level is to force random items of n to be zero. The extension_level hyperparameter value is between 0 and P-1. A value of 0 means that all slopes will be parallel with all of the axes, which corresponds to Isolation Forest's behavior. A higher number of extension level indicates that the split will be parallel with extension_level-number of axes. The full-extension means extension_level is equal to P - 1. This indicates that the slope of the branching point will always be randomized.

For a full insight into the extension_level hyperparameter, please read the section High Dimensional Data and Extension Levels from the original paper.

Example

.. tabs::
   .. code-tab:: r R

        library(h2o)
        h2o.init()

        # Import the prostate dataset
        prostate <- h2o.importFile(path = "https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv")

        # Set the predictors
        predictors <- c("AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON")

        # Build an Extended Isolation forest model
        model_if <- h2o.extendedIsolationForest(x = predictors,
                                                training_frame = prostate,
                                                model_id = "eif_if.hex",
                                                ntrees = 100,
                                                sample_size = 256,
                                                extension_level = 0)

        # Use full-extension
        model_eif <- h2o.extendedIsolationForest(x = predictors,
                                                 training_frame = prostate,
                                                 model_id = "eif.hex",
                                                 ntrees = 100,
                                                 sample_size = 256,
                                                 extension_level = length(predictors) - 1)

        # Calculate score
        score_if <- h2o.predict(model_if, prostate)
        anomaly_score_if <- score_if$anomaly_score
        score_eif <- h2o.predict(model_eif, prostate)
        anomaly_score_eif <- score_eif$anomaly_score


   .. code-tab:: python

        import h2o
        from h2o.estimators import H2OExtendedIsolationForestEstimator
        h2o.init()

        # Import the prostate dataset
        h2o_df = h2o.import_file("https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv")

        # Set the predictors
        predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]

        # Simulate Isolation Forest behavior with Extended Isolation Forest algorithm
        eif_if = H2OExtendedIsolationForestEstimator(model_id = "eif_if.hex",
                                                     ntrees = 100,
                                                     extension_level = 0)

        # Use full-extension
        eif_full = H2OExtendedIsolationForestEstimator(model_id = "eif_full.hex",
                                                       ntrees = 100,
                                                       extension_level = len(predictors) - 1)

        eif_if.train(x = predictors, training_frame = h2o_df)
        eif_full.train(x = predictors, training_frame = h2o_df)

        # Calculate score
        eif_if_result = eif_if.predict(h2o_df)
        eif_full_result = eif_full.predict(h2o_df)
        print(eif_if_result["anomaly_score"])
        print(eif_full_result["anomaly_score"])