# How to Perform N Times Faster Outlier Detection in UMAP (Python) on Million-row Datasets
## SUBTITLE TODO
![](images/pixabay.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://pixabay.com/users/publicdomainpictures-14/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=2205'>PublicDomainPictures</a>
        on 
        <a href='https://pixabay.com/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=2205'></a>
    </strong>
</figcaption>

# Setup

In [1]:
import logging
import time
import warnings

import catboost as cb
import datatable as dt
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings("ignore")
pd.set_option("float_format", "{:.5f}".format)

# Introduction

We've all used those simple techniques - plot a scatterplot or a KDE, and the data points farthest from the group are outliers. Now, tell me - how would you use these methods if you were to find outliers in say, 100-dimensional datasets? Right off the bat, visual outlier detection methods are out of the question. So, fancy machine learning algorithms like Local Outlier Factor or Isolation Forest come to mind, which are fairly effectively against outliers that lie in high-dimensional data. 

But there are many caveats to using ML methods to detect outliers. A method called Elliptical Envelope uses a covariance estimation but assumes the data is normally distributed (which is rarely the case). Local Outlier Factor is much faster than Isolation Forest at the risk of lower accuracy. 

So, how can we excel at both speed and accuracy at outlier detection when dealing with massive, million-row datasets that are so common? That's where UMAP comes in.

# What is UMAP?

UMAP (Uniform Manifold Approximation & Projection) is a dimensionality reduction algorithm, introduced in 2018. It combines the best features of PCA and tSNE - it can scale to large datasets easily and compete with PCA in terms of speed, and project data to low dimensional space much more effectively and beautifully than tSNE:

<p float="left">
  <img src="https://miro.medium.com/max/1250/1*OMXhwgFxgwn5fLEkGrV_Pw.png" width="300" height="300"/>
  <img src="https://miro.medium.com/max/3750/1*GEeKKJET7WzzrGcWhL3H-A.png" width="300" height="300"/> 
  <img src="https://miro.medium.com/max/1250/1*rlfn-CugxKhmZ8G7sLJQJQ.png" width="300" height="300"/>
</p>

The UMAP python package has a familiar Scikit-learn API. Below is an example projecting the Kaggle TPS September Competition to 2D:

In [2]:
import datatable as dt
import pandas as pd
import umap  # pip install umap-learn

tps = dt.fread("data/train.csv").to_pandas()
tps.shape

(957919, 120)

```python
X, y = tps.drop("claim", axis=1), tps[["claim"]].values.flatten()

# Initialize the manifold
manifold = umap.UMAP(n_components=2)
manifold.fit(X, y)

X_2d = manifold.transform(X)
```

UMAP is designed so that whatever dimension you project the data to, it reserves as much variance and topological structure of the data as possible. But what does it have to do with outlier detection?

Well, since we know that UMAP preserves all the pecularities and attributes of the dataset even in lower dimensions, we can use it to first project the data to a lower space and then use any other outlier detection algorithm much faster! In the coming sections, we will look at an example of this using the above TPS September data.

> If you want to learn more about UMAP and its awesome features, I have covered it in-depth in a previous article:

https://towardsdatascience.com/beginners-guide-to-umap-for-reducing-dimensionality-and-visualizing-100-dimensional-datasets-ff5590fb17be


# Setting a baseline with pure Isolation Forest

Before we move on, let's establish a baseline performance with pure Isolation Forest algorithm on the whole data:

In [3]:
from sklearn.ensemble import IsolationForest

tps = dt.fread("data/train.csv").to_pandas()
X, y = tps.drop("claim", axis=1), tps[["claim"]].values.flatten()

Right before we fit Isolation Forest, we will impute missing values with the mean:

In [4]:
from sklearn.impute import SimpleImputer

# Impute
X_imputed = SimpleImputer().fit_transform(X.copy())

In [5]:
%%time

# Init Isolation Forest
iso = IsolationForest(n_estimators=3000, n_jobs=9)
labels = iso.fit_predict(X_imputed)

Wall time: 44min 33s


In [10]:
np.sum(labels == -1)

2713

Event though powerful, Isolation Forest has only a few parameters to tune. The most important one is `n_estimators`, which controls the number of trees to be built. We are setting to 3000 considering the dataset size.

After waiting for about 45 minutes, we discover that Isolation Forest found 2713 outliers in the data. Now, let's perform the same operation after projecting the data with UMAP.