# Isolation Forest

#### Experimenting with the [Isolation Forest](https://en.wikipedia.org/wiki/Isolation_forest) algorithm for anomaly detection

Links:
- [Isolation Forest Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)
- [Anomaly Detection using Isolation Forest - A Complete Guide](https://www.analyticsvidhya.com/blog/2021/07/anomaly-detection-using-isolation-forest-a-complete-guide/)
- [Anomaly Detection Analysis - Isolation Forest](https://deepnote.com/@christopher-hui/Anomaly-Detection-Analysis-Isolation-Forest-wBLaaICBTi6byIvFmhwtbA)
- [Extended Isolation Forest](https://github.com/sahandha/eif)
- [Feature Importance in Isolation Forest](https://stats.stackexchange.com/questions/386558/feature-importance-in-isolation-forest)

Setting up the dataset:
- Read in CSV and reformat the columns

In [None]:
from dotenv import load_dotenv
import os

import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interact
from sklearn.ensemble import IsolationForest

load_dotenv()
DATASET_PATH = os.environ.get("DATASET_PATH")

# Read in CSV and reformat columns
df = pd.read_csv(DATASET_PATH + "Day 1 Conversion Rate Motor.csv")
df.rename(columns={"Grouping": "State", "Selected Measure1": "Day 1 Conversion Rate"}, inplace=True)
df["Month"] = pd.to_datetime(df["Month"], format="%d/%m/%y")

states = sorted(set(df["State"]))
df.head()

Investigating NSW data:

In [None]:
# create new dataframe to filter for NSW
df_nsw = df.loc[df.State == states[0]]
df_nsw.reset_index()
df_nsw

Run the Isolation Forest algorithm:

In [None]:
# create iso forest model
model = IsolationForest(
    n_estimators=100,
    max_samples="auto",
    contamination=float(0.2),
)

# fit conversion rate values to model
model.fit(df_nsw[["Day 1 Conversion Rate"]].values)

# create new rows in the dataframe to store anomaly scores;
# -1: anomaly
# 1: not an anomaly
df_nsw.insert(len(df_nsw.columns), "scores", model.decision_function(df_nsw[["Day 1 Conversion Rate"]].values))
df_nsw.insert(len(df_nsw.columns), "anomaly_score", model.predict(df_nsw[["Day 1 Conversion Rate"]].values))

# filter for anomalies only
df_nsw[df_nsw["anomaly_score"]==-1].head()