# Flat

* Interesują nas tylko liście
* Proste rozwiązanie (simplicity)
* Strata informacji wynikających z hierarchi danych (relacji child-parent wierzchołków)
* Nadaje się jako baseline
![flat](images/flat.png)

# LCN 
* Tworzymy binarny klasyfikator dla każdego typu wierzchołka. Np dla Kotów, Psów, Mopsów (dzieci Psa), Dachowców (dzieci Kota). Nie licząc korzenia (np Zwierząt czyli rodzic Kotów/Psów itp)
* Mogą powstać nieścisłości jak Kot-Mops (czyli błędna klasyfikacja że na poziome rodzica jest kotem, ale liściem jest mopsem (którego rodzicem w rzeczywistości jest pies)
* Dużo klasyfikatorów
* Nie potrzeba dużej ilości pracy nad zrobienie odpowiedniego systemu przepływu przez klasyfikatory
![flat](images/node.png)

# LCPL
* Dla każdego poziomu tworzony jest osobny klasyfikator. (np. Dla {Koty,Psy,Konie}, {Dachowce,Kanapowce,Mopsy,Husky,Mustangi,Jednorożce})
* Dalej problem z nieścisłością typu Kot-Mops
* Dość intuicyjny i potrafi generalizować
* rozbudowane modele :(
* Problem z Error propagation (w jaki sposób błąd na jednym poziomie na wpływać na poziom nizej? )
![flat](images/level.png)

# LCPN - Local classfier per parent node
* multilabel dla każdego wierzchołka rodzica
* Aby uniknąć nieścisłości można zrobić aby za sklasyfikowaniu wierzchołka labelką rodzica, następnie był klasyfikowany podpowiednim klasyfikatorem który bierze pod uwagę tylko dzieci tego rodzica np: {Parent: Kot, Children: "Dachowiec, Kanapowiec"} to nie może nam wyjść nie ścisłość typu Kot-Mops bo Mops nie jest w dzieciach rodzica "Kot"
* Ale wymaga to ręcznego przygotowania takiego systemu klasyfiaktorów (takiej kaskady troszke)
![flat](images/parent.png)

# Big-Bang (global classifier)
* Jeden klasyfikator aby wszystkimi rządzić i w ciemności związać.
* Relatatywnie duży stopień skomplikowania.
* Wszystkie klasy za jednym przejściem
* często ręcznie dobierany do danych
* wady/zalety zależą od ręcznie dobranego modelu
* szybsza inferencja
![flat](images/global.png)

# Źródła:
https://towardsdatascience.com/https-medium-com-noa-weiss-the-hitchhikers-guide-to-hierarchical-classification-f8428ea1e076
https://towardsdatascience.com/hierarchical-classification-with-local-classifiers-down-the-rabbit-hole-21cdf3bd2382
https://towardsdatascience.com/hierarchical-classification-by-local-classifiers-your-must-know-tweaks-tricks-f7297702f8fc
https://towardsdatascience.com/hierarchical-performance-metrics-and-where-to-find-them-7090aaa07183


# Zbiór danych - Imclef07a - 
I. Dimitrovski, D. Kocev, S. Loskovska, S. Dzeroski. Hierchical annotation of medical images. Proceedings of the 11th International Multiconference - Information Society IS 2008, pp. 174-181, 2008
[https://sites.google.com/site/hrsvmproject/datasets-hier]

In [2]:
%load_ext autoreload
%autoreload 2
from src.data_loader import Data
from src.visualize import visualize_hierarchy
from src.experiment_runner import run_flat, run_LCPN, run_LCN, run_BigBang, run_LCPL
from sklearn.ensemble import RandomForestClassifier
from tqdm import tqdm
import sys; sys.path.append('../')
from pathlib import Path
import pandas as pd
DATA_PATH = Path(f"./data/imclef07a")
#NAME = "imclef07a"
from sklearn_hierarchical_classification.classifier import HierarchicalClassifier
dataset = Data(DATA_PATH)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
dataset.records

Unnamed: 0,labels,1,2,3,4,5,6,7,8,9,...,71,72,73,74,75,76,77,78,79,80
0,"(2, 3, 21)",0.601321,-0.282028,0.253823,-0.163729,-0.007874,0.010721,-0.021936,-0.131422,-0.488617,...,0.228252,0.211184,0.522663,0.276693,0.708575,0.175502,0.007736,0.440046,-0.407414,0.166299
1,"(2, 3, 21)",0.029893,0.003686,0.396680,0.121985,-0.007874,-0.132136,0.406635,0.011435,0.082812,...,0.228252,0.354041,0.379806,0.276693,0.422860,0.318359,-0.135121,0.154331,0.592586,0.166299
2,"(2, 3, 21)",-0.255821,0.003686,0.539537,0.407700,-0.007874,0.010721,0.120921,0.011435,0.082812,...,0.228252,0.211184,0.236949,0.419550,0.422860,0.175502,0.007736,0.582903,0.306872,0.023442
3,"(2, 3, 21)",-0.398679,-0.282028,0.110965,-0.163729,0.158792,-0.132136,-0.021936,-0.131422,0.225669,...,-0.200319,0.211184,0.236949,-0.294735,-0.148568,-0.395927,0.007736,0.725760,-0.121700,0.023442
4,"(2, 3, 21)",0.172750,-0.139171,0.253823,0.121985,-0.007874,-0.417850,0.120921,-0.559993,-0.345759,...,0.085395,-0.074531,0.379806,0.419550,-0.005711,-0.538784,-0.135121,0.011474,-0.264557,-0.119415
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11001,"(17, 47, 48)",0.315607,-0.282028,-0.317606,0.407700,-0.174541,-0.417850,-0.307650,-0.559993,0.511383,...,-0.343176,-0.074531,0.522663,-0.437592,-0.005711,-0.538784,-0.135121,-0.274240,-0.407414,-0.119415
11002,"(17, 47, 48)",0.315607,-0.282028,-0.460463,0.693414,-0.007874,-0.417850,-0.307650,-0.559993,-0.202902,...,-0.343176,-0.217388,-0.334480,-0.437592,-0.148568,-0.538784,-0.135121,-0.274240,-0.407414,-0.119415
11003,"(17, 47, 48)",0.029893,-0.282028,-0.460463,-0.163729,-0.174541,-0.417850,-0.307650,-0.559993,-0.488617,...,-0.343176,-0.217388,-0.334480,-0.437592,-0.148568,-0.538784,-0.135121,-0.274240,-0.407414,-0.119415
11004,"(17, 47, 48)",0.172750,-0.139171,0.539537,0.121985,0.158792,-0.132136,-0.307650,-0.559993,-0.488617,...,-0.343176,-0.217388,-0.334480,-0.437592,-0.148568,-0.538784,-0.135121,-0.274240,-0.407414,-0.119415


# Flat - klasyfikujemy tylko liście zwykłymi modelami

In [4]:
results_flat = pd.DataFrame()
for clf in tqdm(["NN","KNN","RANDOM_FOREST"]):
    results_flat=results_flat.append(run_flat(clf,Data(DATA_PATH)),ignore_index=True)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:24<00:00,  8.07s/it]


In [6]:
results_flat.groupby('clf').mean()

Unnamed: 0_level_0,h_fscore
clf,Unnamed: 1_level_1
KNN,0.902958
NN,0.907686
RANDOM_FOREST,0.882694


# Hierarchiczna

## LCN

In [10]:
results_lcn = pd.DataFrame()
for clf in tqdm(["KNN","RANDOM_FOREST"]):
    results_lcn=results_lcn.append(run_LCN(clf,Data(DATA_PATH)),ignore_index=True)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:40<00:00, 140.16s/it]


In [11]:
results_lcn.groupby('clf').mean()

Unnamed: 0_level_0,h_fscore
clf,Unnamed: 1_level_1
KNN,0.870512
RANDOM_FOREST,0.761675


# LCPL

In [12]:
results_lcpl = pd.DataFrame()
for clf in tqdm(["KNN","RANDOM_FOREST"]):
    results_lcpl=results_lcpl.append(run_LCPL(clf,Data(DATA_PATH)),ignore_index=True)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.43s/it]


In [13]:
results_lcpl.groupby('clf').mean()

Unnamed: 0_level_0,h_fscore
clf,Unnamed: 1_level_1
KNN,0.853037
RANDOM_FOREST,0.808084


# LCPN

In [14]:
results_lcpn = pd.DataFrame()
for clf in tqdm(["KNN","RANDOM_FOREST"]):
    results_lcpn=results_lcpn.append(run_LCPN(clf,Data(DATA_PATH)),ignore_index=True)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:53<00:00, 26.94s/it]


In [15]:
results_lcpn.groupby('clf').mean()

Unnamed: 0_level_0,h_fscore
clf,Unnamed: 1_level_1
KNN,0.912106
RANDOM_FOREST,0.883185


# Big_Bang 

In [16]:
results_bang = pd.DataFrame()
for clf in tqdm(["KNN","RANDOM_FOREST"]):
    results_bang=results_bang.append(run_BigBang(clf,Data(DATA_PATH)),ignore_index=True)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:48<00:00, 24.04s/it]


In [17]:
results_bang.groupby('clf').mean()

Unnamed: 0_level_0,h_fscore
clf,Unnamed: 1_level_1
KNN,0.857632
RANDOM_FOREST,0.735282


In [18]:
final_res = pd.concat([results_bang, results_lcpn,results_lcpl,results_lcn,results_flat], axis=0)

In [21]:
final_res.sort_values(["h_fscore"],ascending=False)

Unnamed: 0,h_fscore,clf,cls_type
0,0.912106,KNN,LCPN
0,0.907686,NN,FLAT
1,0.902958,KNN,FLAT
1,0.883185,RANDOM_FOREST,LCPN
2,0.882694,RANDOM_FOREST,FLAT
0,0.870512,KNN,LCN
0,0.857632,KNN,BIGBANG
0,0.853037,KNN,LCPL
1,0.808084,RANDOM_FOREST,LCPL
1,0.761675,RANDOM_FOREST,LCN


In [20]:
final_res.to_csv("res.csv")