<center>
    <h1 id='detecting-outliers-with-pyod' style='color:#7159c1'>‚öôÔ∏è Detecting Outliers with PyOD ‚öôÔ∏è</h1>
    <i>Outliers Identification</i>
</center>

---

`PyOD` is a Python package to identify outliers in datasets. In a nutshell, it contains the following models:

> **Angle-Based Outlier Detection (ABOD)** - `It considers the relationship between each point and its neighbor(s). It does not consider the relationships among these neighbors. The variance of its weighted cosine scores to all neighbors could be viewed as the outlying score`;

>> `ABOD performs well on multi-dimensional data`;

>>> `PyOD provides two different versions of ABOD: Fast ABOD: 1) Uses k-nearest neighbors to approximate; 2) Original ABOD: Considers all training points with high-time complexity`.

<br />

> **k-Nearest Neighbors Detector** - `For any data point, the distance to its kth nearest neighbor could be viewed as the outlying score`;

>> `PyOD supports three kNN detectors: 1) Largest: Uses the distance of the kth neighbor as the outlier score; 2) Mean: Uses the average of all k neighbors as the outlier score; 3) Median: Uses the median of the distance to k neighbors as the outlier score`.

<br />

> **Isolation Forest** - `It uses the scikit-learn library internally. In this method, data partitioning is done using a set of trees. Isolation Forest provides an anomaly score looking at how isolated the point is in the structure. The anomaly score is then used to identify outliers from normal observations`;

>> `Isolation Forest performs well on multi-dimensional data`.

<br />

> **Histogram-based Outlier Detection** -`It is an efficient unsupervised method which assumes the feature independence and calculates the outlier score by building histograms`;

>> `It is much faster than multivariate approaches, but at the cost of less precision`.

<br />

> **Feature Bagging** - `A feature bagging detector fits a number of base detectors on various sub-samples of the dataset. It uses averaging or other combination methods to improve the prediction accuracy`;

>> `By default, Local Outlier Factor (LOF) is used as the base estimator. However, any estimator could be used as the base estimator, such as kNN and ABOD`;

>>> `Feature bagging first constructs n sub-samples by randomly selecting a subset of features. This brings out the diversity of base estimators. Finally, the prediction score is generated by averaging or taking the maximum of all base detectors`.

<br />

> **Clustering Based Local Outlier Factor** - `It classifies the data into small clusters and large clusters. The anomaly score is then calculated based on the size of the cluster the point belongs to, as well as the distance to the nearest large cluster`.

<br />

> **Extra Utilities provided by PyOD** - `A function "generate_data" can be used to generate random data with outliers. Inliers data is generated by a multivariate Gaussian distribution and outliers are generated by a uniform distribution`;

>> `We can provide our own values of outliers fraction and the total number of samples that we want in our dataset. We will use this utility function to create data in the implementation part`.

---

In particular, we can use PyOD with in two approaches:

```
- Single Model
- Combining Multiple Models
```

<h1 id='0-single-model' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Single Model</h1>

In [45]:
# ---- Settings ----
#
# pip install pyod
# pip install combo
#
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from pyod.models.ocsvm import OCSVM
from pyod.models.pca import PCA
from pyod.utils.data import evaluate_print

# ---- Defining Fraction for Outliers ----
from random import randrange # pip install random
out_frac = randrange(0, 45) / 100

# ---- Reading Dataset and Splitting into Training and Validation ----
import pandas as pd # pip install pandas
import numpy as np # pip install numpy
from sklearn.model_selection import train_test_split # pip install sklearn

autos_df = pd.read_csv('./datasets/autos.csv')
autos_df = autos_df.select_dtypes(exclude=['object'])
X_train, X_valid, y_train, y_valid = train_test_split(
    autos_df.loc[:, 'symboling':'highway_mpg']
    , autos_df.loc[:, 'price']
    , train_size=0.70
    , test_size=0.30
)

X_train.reset_index(inplace=True)
X_valid.reset_index(inplace=True)
y_train = y_train.reset_index()
y_valid = y_valid.reset_index()

In [46]:
# ---- Defining Methods/Models ----
rs = np.random.RandomState(20241901)

clf = { 
    'Angle-based Outlier Detector (ABOD)': ABOD(contamination=out_frac),
    'Cluster-based Local Outlier Factor (CBLOF)': CBLOF(contamination=out_frac,check_estimator=False, random_state=rs),
    'Isolation Forest': IForest(contamination=out_frac,random_state=rs),
    'K Nearest Neighbors (KNN)': KNN(contamination=out_frac, method='largest', n_neighbors=5, n_jobs=4),
    'Average KNN': KNN(method='mean', contamination=out_frac),
    'Local Outlier Factor (LOF)':LOF(n_neighbors=35, contamination=out_frac),
    'One-class SVM (OCSVM)': OCSVM(contamination=out_frac),
    'Principal Component Analysis (PCA)': PCA(contamination=out_frac, random_state=rs),
}

# ---- Training K-Nearest Neighbors Model ----
clf_name = 'K Nearest Neighbors (KNN)'
clf[clf_name].fit(X_train)

KNN(algorithm='auto', contamination=0.37, leaf_size=30, method='largest',
  metric='minkowski', metric_params=None, n_jobs=4, n_neighbors=5, p=2,
  radius=1.0)

In [47]:
# ---- Predictions and Scores ----

# - Getting the prediction label and outlier socres of the training data
y_train_pred = clf[clf_name].labels_ # bynary labels(0: inliers, 1: outliers)
y_train_scores = clf[clf_name].decision_scores_ # raw outlier scores (distances)
y_train_number_outliers = np.unique(y_train_pred, return_counts=True) # number of outliers

y_valid_pred = clf[clf_name].predict(X_valid) # outlier labels (0 or 1)
y_valid_scores = clf[clf_name].decision_function(X_valid) # outlier scores
y_valid_number_outliers = np.unique(y_valid_pred, return_counts=True) # number of outliers

# - Prediction Confidence
# - Outlier Labels (0 or 1) and Confidence in the Range of [0.0, 1.0]
y_valid_pred, y_valid_pred_confidence = clf[clf_name].predict(X_valid, return_confidence=True)

In [50]:
# ---- Getting Outliers ----
outliers_indexes = [
    index for index in range(0, len(y_train_pred))
    if y_train_pred[index] == 1
]

outlier_rows = y_train.loc[outliers_indexes]
outlier_rows.head()

Unnamed: 0,index,price
4,35,7895
6,120,11850
8,34,7295
14,57,18344
15,105,17075


<h1 id='1-combining-multiple-models' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Combining Multiple Models</h1>

In [55]:
# ---- Settings ----
from pyod.models.knn import KNN
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer

# ---- Models ----
k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
n_clf = len(k_list) # Number of classifiers being trained

# ---- Calculating Scores ----
train_scores = np.zeros([X_train.shape[0], n_clf])
valid_scores = np.zeros([X_valid.shape[0], n_clf])

for index in range(n_clf):
    k = k_list[index]
    clf = KNN(n_neighbors=k, method='largest')
    clf.fit(X_train)
    
    train_scores[:, index] = clf.decision_scores_
    valid_scores[:, index] = clf.decision_function(X_valid)
    
# ---- Standardizating Scores before Combination ----
train_scores_norm, valid_scores_norm = standardizer(train_scores, valid_scores)

---

Types of Combination:

> **Average** - `average scores of all detectors`;

> **Maximization** - `maximum score across all detectors`;

> **Average of Maximum (AOM)** - `divide base detectors into subgroups and take the maximum score for each subgroup. The final score is the average of all subgroup scores`;

> **Maximum of Average (MOA)** - `divide base detectors into subgroups and take the average score for each subgroup. The final score is the maximum of all subgroup scores`.

In [57]:
# ---- Combinations ----
comb_by_average = average(valid_scores_norm)
comb_by_maximization = maximization(valid_scores_norm)
comb_by_aom = aom(valid_scores_norm, 5) # 5 groups
comb_by_moa = moa(valid_scores_norm, 5) # 5 groups

comb_by_average

array([ 1.75755171,  0.44058761, -0.02151253,  1.0988025 ,  0.85565728,
       -0.51147912, -0.61755266, -0.5316046 ,  1.8179753 , -0.12353809,
       -0.25160918, -0.71355965, -0.69535552,  0.82597223, -0.8461039 ,
       -0.07791699, -0.47237863, -0.99544482,  0.81987963,  3.25990955,
       -0.84951258, -0.23601159,  0.82567168, -0.90915232,  0.28004228,
        0.43757886, -0.8023385 ,  1.75055444, -0.82385452, -0.76821742,
       -0.91879264, -0.65483715, -0.273635  , -0.34674645, -0.3022741 ,
        0.98822911, -0.81200354, -0.18906244, -0.79227208, -0.79244855,
        0.01632175, -0.36433127, -0.64715941, -1.01245975, -0.68943522,
       -0.74030481, -0.34715944, -0.38982401, -0.73156442, -0.36422027,
       -0.66006785, -0.58021869, -0.63825257,  3.0170915 , -0.02651536,
       -0.5966944 , -0.65948665, -0.62706118])

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>üì´ | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).