Isolation Forest Implementation

The goal of this project is to implement the original Isolation Forest algorithm by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. (A later version of this work is also available: Isolation-based Anomaly Detection.) There are two general approaches to anomaly detection:

model what normal looks like and then look for nonnormal observations
focus on the anomalies, which are few and different. This is the interesting and relatively-new approach taken by the authors of isolation forests.

The isolation forest algorithm is original and beautiful in its simplicity; and also seems to work very well, with a few known weaknesses. The academic paper is extremely readable so is the best start to know everything.

Data sets

The datasets we use are

Kaggle credit card fraud competition data set; download, unzip to get creditcard.csv
Get cancer data into cancer.csv from savecancer.csv .
http.zip; download, unzip to get http.csv.

The code assumes the data files are in the same directory as the code.

Visualization of normal versus anomaly separation

Using plot_anomalies.py, you can see the results of the isolation forest trying to detect anomalies. These data sets all have known targets indicating normal versus anomaly, but this information is only used during testing and not during training. In other words, we use this information to discover how well we can separate the distribution of normal versus anomalous observations. The section provides a number of results, but yours might look different because of the inherent randomness involved in selecting subsets of the data and constructing random trees. (click on the images to enlarge.)

http.csv, 200 trees, 99% desired TPR

creditcard.csv, 200 trees, 80% desired TPR	creditcard.csv, 200 trees, 90% desired TPR

cancer, 300 trees, 70% desired TPR	cancer, 300 trees, 80% desired TPR

Algorithm

The algorithms extracted from the Liu et al paper:

Please use this version of average path length c(), not the one in the original paper:

Then finally here's the scoring formula:

where "H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant)."

Increasing speed

We also have an improved version that reduces the run time.
The changes done are :

Sample the columns from a beta-binomial distribution which favors values on the extremes

Add additional condition - Accept the random split only if the number of elements in left node is atmost 25% of right node or the converse

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
IsolationForestPaper.pdf		IsolationForestPaper.pdf
README.md		README.md
iforest.py		iforest.py
plot_anomalies.py		plot_anomalies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

IsolationForestPaper.pdf

IsolationForestPaper.pdf

README.md

README.md

iforest.py

iforest.py

plot_anomalies.py

plot_anomalies.py

Repository files navigation

Isolation Forest Implementation

Data sets

Visualization of normal versus anomaly separation

Algorithm

Increasing speed

About

Releases

Packages

Languages

Divya-Bhargavi/isolation-forest

Folders and files

Latest commit

History

Repository files navigation

Isolation Forest Implementation

Data sets

Visualization of normal versus anomaly separation

Algorithm

Increasing speed

About

Topics

Resources

Stars

Watchers

Forks

Languages