<div align="center">

# Method Ranking & Optimal Selection

</div>
<br>

The last notebook showed us how to apply confidences around our threshold methods. But, a question looms over unsupervised tasks, "what is the best method for my data?" The utility `RANK` in `PyThresh` attempts to assist in this by ranking all your selected options to tell which is the best performing. How does it do this? Well, for a more in-depth look at what is being done visit [Ranking](https://pythresh.readthedocs.io/en/latest/ranking.html)


# Let's get started!

To begin, we need to install pythresh and xgboost to work with the notebook


In [None]:
!pip install pythresh xgboost>=2.0.0 

We can now import a dataset to work with

In [5]:
import os
from scipy.io import loadmat
from pyod.utils.utility import standardizer

file = os.path.join('data', 'cardio.mat')
mat = loadmat(file)

X = mat['X'].astype(float)
y = mat['y'].ravel().astype(int)

X = standardizer(X)

To rank we must select all the outlier detection methods and thresholders that we want to compare

In [6]:
import numpy as np

from pyod.models.knn import KNN
from pyod.models.iforest import IForest
from pyod.models.pca import PCA

from pythresh.thresholds.karch import KARCH
from pythresh.thresholds.iqr import IQR
from pythresh.thresholds.clf import CLF
from pythresh.utils.rank import RANK


# Initialize models
clfs = [KNN(), IForest(random_state=1234), PCA()]
thres = [KARCH(), IQR(), CLF()]

# Get rankings
ranker = RANK(clfs, thres)
rankings = ranker.eval(X)


In [7]:
print(f'The predicted combo performance from best to worst are {rankings}')

The predicted combo performance from best to worst are [('PCA', 'KARCH'), ('PCA', 'IQR'), ('KNN', 'IQR'), ('IForest', 'IQR'), ('KNN', 'KARCH'), ('PCA', 'CLF'), ('IForest', 'KARCH'), ('IForest', 'CLF'), ('KNN', 'CLF')]


<br>

So we got a list of tuples showing the predicted best to worst combos. But let's validate to see how these combos actually perform.

In [4]:
from sklearn.metrics import f1_score, matthews_corrcoef

clfs_dict = {'KNN': KNN(), 'IForest': IForest(random_state=1234), 'PCA': PCA()}
thres_dict = {'KARCH': KARCH(), 'IQR': IQR(), 'CLF': CLF()}

for comb in rankings:
    
    clf = clfs_dict[comb[0]]
    clf.fit(X)
    
    scores = clf.decision_scores_ 
    
    thresh = thres_dict[comb[1]]
    thresh.fit(scores)
    
    fit_labels = thresh.labels_
    
    # How did the unsupervised task perform, lets check the stats
    metric1 = round(f1_score(y, fit_labels), 2)
    metric2 = round(matthews_corrcoef(y, fit_labels), 2)
    
    print(f'\nThe f1 and mcc score of comb {comb} are {metric1} and {metric2} respectively')


The f1 and mcc score of comb ('PCA', 'KARCH') are 0.65 and 0.61 respectively

The f1 and mcc score of comb ('PCA', 'IQR') are 0.44 and 0.42 respectively

The f1 and mcc score of comb ('KNN', 'IQR') are 0.33 and 0.32 respectively

The f1 and mcc score of comb ('IForest', 'IQR') are 0.38 and 0.38 respectively

The f1 and mcc score of comb ('KNN', 'KARCH') are 0.35 and 0.27 respectively

The f1 and mcc score of comb ('PCA', 'CLF') are 0.34 and 0.35 respectively

The f1 and mcc score of comb ('IForest', 'KARCH') are 0.55 and 0.51 respectively

The f1 and mcc score of comb ('IForest', 'CLF') are 0.44 and 0.41 respectively

The f1 and mcc score of comb ('KNN', 'CLF') are 0.25 and 0.26 respectively


<br>

Well the best model was the first ranked, that's great! But note that this is still experimental and may not work well for all datasets and combos. However, exciting news as this is an active area of research for `PyThresh` and further upgrades are planned.