# 11. Finding a stable base classifier

## 11.1. Context

All multi label classifiers require a base classifier, which is a regular binary classifier.

Throughout the development of this project, however, is was noted that some of these regular binary classifier present specific behaviors:
* The `SVC`, as shown in the notebook `8` (in which I debugged why the basic stacking was no different to the regular Binary Relevance, in spite of using more features), is **apparently insensitive to additional features** (these features being the predicted labels) and also **insensitive to different random states**.
  * This makes it not suitable as the `SVC` ends up **discarding any information regarding label correlations**.
  * Also, since changing the random state makes no different, it is probable that the base classifier is getting to some kind of **local minimum**.
    * Perhaps tunning the parameters of the `SVC` can solve this. However, this is not the focus of this project. I much prefer to have a very basic base classifier that works out-of-the-box.
* The `RandomForestClassifier` is **sensitive to the order of the columns** in the input dataset (as explained [here](https://github.com/scikit-learn/scikit-learn/issues/5394)).
  * This makes it not suitable as results obtained via `RandomForestClassifier` might not be "trusted".
  * The multilabel models being studied here are interesting, but none of them are _ground breaking_. They represent **modest improvements** by leveraging label correlations.
  * Since the improvements are small, we must be certain that they are _not_ related to the order of the columns in the input dataset.
  * The models based on `ClassifierChain` will **surely change the order of input columns** (at least for the columns that represent the predicted labels obtained from previous steps in the chain). Knowing that different orders will result in only modest improvements, it is important to have a base classifier that is not sensitive to the order of the columns.
* The `K-NearestNeighbors` might be a suitable base classifier, as it is usually fast to run. We just need to make sure that:
  * Different random states lead to different results.
  * It will not require specific tunning.
  * It is not sensitive to the order of the columns in the input dataset.

This notebook aims at finding a suitable base classifier by testing and comparing the `SVC`, `RandomForestClassifier` and `K-NearestNeighbors`.


## 11.2. Setup

In [1]:
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.dataset import load_dataset
from sklearn.svm import SVC
from skmultilearn.base.problem_transformation import ProblemTransformationBase
from typing import List, Optional, Any, Tuple, Dict
import numpy as np
import sklearn.metrics as metrics
import json
import pandas as pd
from sklearn.feature_selection import f_classif
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
import copy

from metrics.evaluation import EvaluationPipeline
from sklearn.neighbors import KNeighborsClassifier

from skmultilearn.problem_transform import BinaryRelevance

from lib.base_models import StackedGeneralization, DependantBinaryRelevance

## 11.3. Datasets

In [3]:
desired_datasets = ["scene", "emotions", "birds"]

datasets = {}
for dataset_name in desired_datasets:
    print(f"getting dataset `{dataset_name}`")
    
    full_dataset = load_dataset(dataset_name, "undivided")
    X, y, _, _ = full_dataset

    train_dataset = load_dataset(dataset_name, "train")
    X_train, y_train, _, _ = train_dataset

    test_dataset = load_dataset(dataset_name, "test")
    X_test, y_test, _, _ = test_dataset

    datasets[dataset_name] = {
        "X": X,
        "y": y,
        "X_train": X_train,
        "y_train": y_train,
        "X_test": X_test,
        "y_test": y_test,
        "rows": X.shape[0],
        "labels_count": y.shape[1]
    }


for name, info in datasets.items():
    print("===")
    print(f"information for dataset `{name}`")
    print(f"rows: {info['rows']}, labels: {info['labels_count']}")


getting dataset `scene`
scene:undivided - exists, not redownloading
scene:train - exists, not redownloading
scene:test - exists, not redownloading
getting dataset `emotions`
emotions:undivided - exists, not redownloading
emotions:train - exists, not redownloading
emotions:test - exists, not redownloading
getting dataset `birds`
birds:undivided - exists, not redownloading
birds:train - exists, not redownloading
birds:test - exists, not redownloading
===
information for dataset `scene`
rows: 2407, labels: 6
===
information for dataset `emotions`
rows: 593, labels: 6
===
information for dataset `birds`
rows: 645, labels: 19


In [4]:
X_train = datasets["scene"]["X_train"]
X_test = datasets["scene"]["X_test"]

shuffled_order = np.random.permutation(X_train.shape[1])
shuffled_X_train = X_train[:, shuffled_order]
shuffled_X_test = X_test[:, shuffled_order]

y_train = datasets["scene"]["y_train"]
y_test = datasets["scene"]["y_test"]

display(X_train.todense())
display(shuffled_X_train.todense())

matrix([[0.646467, 0.666435, 0.685047, ..., 0.247298, 0.014025, 0.029709],
        [0.770156, 0.767255, 0.761053, ..., 0.137833, 0.082672, 0.03632 ],
        [0.793984, 0.772096, 0.76182 , ..., 0.051125, 0.112506, 0.083924],
        ...,
        [0.85639 , 1.      , 1.      , ..., 0.019464, 0.022167, 0.043738],
        [0.805592, 0.80417 , 0.811438, ..., 0.346736, 0.231481, 0.332623],
        [0.855064, 0.858896, 0.911177, ..., 0.262119, 0.104471, 0.34728 ]])

matrix([[1.22000e-04, 1.44418e-01, 9.40810e-02, ..., 6.85047e-01,
         2.36605e-01, 4.77202e-01],
        [1.60470e-02, 4.07510e-02, 2.85809e-01, ..., 7.61053e-01,
         3.66136e-01, 4.49931e-01],
        [7.93500e-03, 1.31860e-02, 2.91597e-01, ..., 7.61820e-01,
         1.96590e-01, 4.50110e-01],
        ...,
        [1.73600e-02, 8.09100e-03, 1.03240e-01, ..., 1.00000e+00,
         4.73331e-01, 3.59929e-01],
        [1.66000e-04, 3.98600e-03, 4.34940e-02, ..., 8.11438e-01,
         4.00372e-01, 3.29456e-01],
        [9.20880e-02, 1.26706e-01, 1.35640e-02, ..., 9.11177e-01,
         6.31460e-02, 4.25771e-01]])

## 11.4. Procedures

In [21]:
def run_regular_order(model):
    br_model = BinaryRelevance(
        classifier=model,
        require_dense=[False, True]
    )

    br_model.fit(X_train, y_train)
    predictions = br_model.predict(X_test)

    print("accuracy")
    print(metrics.accuracy_score(y_test, predictions))

    print("hamming loss")
    print(metrics.hamming_loss(y_test, predictions))

    print("f1 score")
    print(metrics.f1_score(y_test, predictions, average="macro"))

def run_shuffled_order(model):
    br_model = BinaryRelevance(
        classifier=model,
        require_dense=[False, True]
    )

    br_model.fit(shuffled_X_train, y_train)
    predictions = br_model.predict(shuffled_X_test)

    print("accuracy")
    print(metrics.accuracy_score(y_test, predictions))

    print("hamming loss")
    print(metrics.hamming_loss(y_test, predictions))

    print("f1 score")
    print(metrics.f1_score(y_test, predictions, average="macro"))

## 11.5. Testing

### 11.5.1. KNN

Notice: KNN has **no** `random_state`, so we will only test the ordering of the columns.

In [22]:
run_regular_order(KNeighborsClassifier())

accuracy
0.596989966555184
hamming loss
0.10451505016722408
f1 score
0.6809836443612469


In [23]:
run_shuffled_order(KNeighborsClassifier())

accuracy
0.596989966555184
hamming loss
0.10451505016722408
f1 score
0.6809836443612469


### 11.5.2. Random Forest Classifier

In [32]:
run_regular_order(RandomForestClassifier(random_state=42))

accuracy
0.5367892976588629
hamming loss
0.08974358974358974
f1 score
0.684223851314326


In [33]:
run_regular_order(RandomForestClassifier(random_state=123))

accuracy
0.5225752508361204
hamming loss
0.09211259754738016
f1 score
0.673047351701591


In [34]:
run_shuffled_order(RandomForestClassifier(random_state=42))

accuracy
0.5359531772575251
hamming loss
0.09044035674470458
f1 score
0.682232461024746


In [35]:
run_shuffled_order(RandomForestClassifier(random_state=123))

accuracy
0.5367892976588629
hamming loss
0.09002229654403568
f1 score
0.6842674712540301


### 11.5.3. SVC

In [36]:
run_regular_order(SVC(random_state=42))

accuracy
0.5869565217391305
hamming loss
0.08416945373467112
f1 score
0.7237789962754925


In [37]:
run_regular_order(SVC(random_state=123))

accuracy
0.5869565217391305
hamming loss
0.08416945373467112
f1 score
0.7237789962754925


In [38]:
run_shuffled_order(SVC(random_state=42))

accuracy
0.5869565217391305
hamming loss
0.08416945373467112
f1 score
0.7237789962754925


In [39]:
run_shuffled_order(SVC(random_state=123))

accuracy
0.5869565217391305
hamming loss
0.08416945373467112
f1 score
0.7237789962754925


## 11.6. Notes so far

* `SVC` is **not** affected by the order of the columns, but it is also **not** affected by the random state, which is bad.
* `RandomForestClassifier` _is_ really affected by the order of the columns, which is bad.
* `KNN` is **not** affected by the order of the columns. We cannot test the random state. So let's run one final test inspired in the notebook `8`.

## 11.7. Testing the effects of using more features on KNN

We must be certain that KNN will be sensible to adding label correlation information.

In [5]:
baseline_binary_relevance_model = BinaryRelevance(
    classifier=KNeighborsClassifier(),
    require_dense=[False, True]
)

basic_stacking_model = StackedGeneralization(
    classifier=KNeighborsClassifier(),
)

dbr = DependantBinaryRelevance(
    classifier=KNeighborsClassifier(),
)

models = {
    "baseline_binary_relevance_model": baseline_binary_relevance_model,
    "basic_stacking_model": basic_stacking_model,
    "dependant_binary_relevance": dbr,
}

In [7]:
for model_name, model in models.items():
    print(f"# running model `{model_name}`")

    model.fit(X_train, y_train)

    predictions = model.predict(X_test)

    print("accuracy")
    print(metrics.accuracy_score(y_test, predictions))

    print("hamming loss")
    print(metrics.hamming_loss(y_test, predictions))

    print("f1 score")
    print(metrics.f1_score(y_test, predictions, average="macro"))

    print()


# running model `baseline_binary_relevance_model`
accuracy
0.596989966555184
hamming loss
0.10451505016722408
f1 score
0.6809836443612469

# running model `basic_stacking_model`
FIT: X shape is (1211, 294)
FIT: X_extended shape is (1211, 300)
accuracy
0.6078595317725752
hamming loss
0.10451505016722408
f1 score
0.6860946721513695

# running model `dependant_binary_relevance`
FIT: X shape is (1211, 294)
FIT: X_extended shape, for label 0, is (1211, 299)
FIT: X_extended shape, for label 1, is (1211, 299)
FIT: X_extended shape, for label 2, is (1211, 299)
FIT: X_extended shape, for label 3, is (1211, 299)
FIT: X_extended shape, for label 4, is (1211, 299)
FIT: X_extended shape, for label 5, is (1211, 299)
accuracy
0.6137123745819398
hamming loss
0.11259754738015608
f1 score
0.6913398386498302



## 11.8. Conclusion

The `KNN` base classifier showed to also be sensible to extra features, as using label correlations changed the performance metrics (fortunately, it improved them). This is great!

**It will therefore be chosen as the base classifier for the multilabel models**.