# How to Perform N Times Faster Outlier Detection in UMAP (Python) on Million-row Datasets
## SUBTITLE TODO
![](images/pixabay.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://pixabay.com/users/publicdomainpictures-14/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=2205'>PublicDomainPictures</a>
        on 
        <a href='https://pixabay.com/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=2205'></a>
    </strong>
</figcaption>

# Setup

In [1]:
import logging
import time
import warnings

import catboost as cb
import datatable as dt
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings("ignore")
pd.set_option("float_format", "{:.5f}".format)

# Introduction

We've all used those simple techniques - plot a scatterplot or a KDE, and the data points farthest from the group are outliers. Now, tell me - how would you use these methods if you were to find outliers in say, 100-dimensional datasets? Right off the bat, visual outlier detection methods are out of the question. So, fancy machine learning algorithms like Local Outlier Factor or Isolation Forests come to mind, which are fairly effectively against outliers that lie in high-dimensional data. 

But there are many caveats to using ML methods to detect outliers. A method called Elliptical Envelope uses a covariance estimation but assumes the data is normally distributed (which is rarely the case). Local Outlier Factor is much faster than Isolation Forests at the risk of lower accuracy. 

So, how can we excel at both speed and accuracy at outlier detection when dealing with massive, million-row datasets that are so common? That's where UMAP comes in.