# 3. Evaluation pipeline

The initial results for the second model are quite mixed. Actually, they are quite bad in general. But it is a bit hard to understand why its performance is so bad without having a proper evaluation pipeline, that should use cross-validation as a way to mitigate the effects of the randomness of the train-test split.

## 3.1. Setup

In [26]:
from sklearn.model_selection import KFold, cross_validate
from skmultilearn.dataset import load_dataset
from skmultilearn.problem_transform import ClassifierChain
from sklearn.svm import SVC
import sklearn.metrics as metrics
from typing import Any, Dict


## 3.2. Experimentation

In [3]:
full_data = load_dataset("scene", "undivided")
train_data = load_dataset("scene", "train")
test_data = load_dataset("scene", "test")


scene:undivided - exists, not redownloading
scene:train - exists, not redownloading
scene:test - exists, not redownloading


In [4]:
display(train_data[:2])
display(test_data[:2])
display(full_data[:2])


(<1211x294 sparse matrix of type '<class 'numpy.float64'>'
 	with 351805 stored elements in List of Lists format>,
 <1211x6 sparse matrix of type '<class 'numpy.int64'>'
 	with 1286 stored elements in List of Lists format>)

(<1196x294 sparse matrix of type '<class 'numpy.float64'>'
 	with 347724 stored elements in List of Lists format>,
 <1196x6 sparse matrix of type '<class 'numpy.int64'>'
 	with 1299 stored elements in List of Lists format>)

(<2407x294 sparse matrix of type '<class 'numpy.float64'>'
 	with 699529 stored elements in List of Lists format>,
 <2407x6 sparse matrix of type '<class 'numpy.int64'>'
 	with 2585 stored elements in List of Lists format>)

Just a side note here: this dataset is clearly too small. It might explain the wild results we were getting in the second model..

Anyway, let's keep going forward with the evaluation pipeline.

In [5]:
X_full, y_full, _, _ = full_data

In [6]:
dense_X_full = X_full.toarray()
dense_y_full = y_full.toarray()


skf = KFold(n_splits=5)
for i, (train_index, test_index) in enumerate(skf.split(dense_X_full, dense_y_full)):
    X_train = dense_X_full[train_index]
    y_train = dense_y_full[train_index]

    X_test = dense_X_full[test_index]
    y_test = dense_y_full[test_index]


**Notice**: `StratifiedKFold` **cannot** be used as it does not support multilabel ([source](https://stackoverflow.com/questions/48508036/sklearn-stratifiedkfold-valueerror-supported-target-types-are-binary-mul)).

Instead of trying to set up everything by myself, let's use something that Scikit already provides: the `cross_validate` function, as described [here](https://scikit-learn.org/stable/modules/cross_validation.html#the-cross-validate-function-and-multiple-metric-evaluation).

In [7]:
clf = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True]
)

s = {"accuracy": make_scorer(metrics.accuracy_score),
     "hamming_loss": make_scorer(metrics.hamming_loss, greater_is_better=False)}

r = cross_validate(clf, X_full, y_full, cv=5, scoring=s, return_train_score=True)
r


In [8]:
r["test_accuracy"].mean(), r["test_accuracy"].std()

(0.6319200144926287, 0.06657267886278025)

In [9]:
r["test_hamming_loss"].mean(), r["test_hamming_loss"].std()

(-0.11195857523658352, 0.023752236934311968)

In [10]:
# quick testing, using the proposed order based on F-test

clf2 = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=[3, 0, 2, 5, 4, 1]
)

s2 = {"accuracy": metrics.make_scorer(metrics.accuracy_score),
      "hamming_loss": metrics.make_scorer(metrics.hamming_loss, greater_is_better=False)}

r2 = cross_validate(clf2, X_full, y_full, cv=5,
                   scoring=s2, return_train_score=True)
r2

{'fit_time': array([4.06053257, 4.29102755, 3.58891082, 4.0080204 , 3.63190055]),
 'score_time': array([0.65458941, 0.65905857, 0.6446743 , 0.67406464, 0.62596703]),
 'test_accuracy': array([0.1473029 , 0.39626556, 0.05613306, 0.28274428, 0.3035343 ]),
 'train_accuracy': array([0.31324675, 0.20727273, 0.3556594 , 0.26272066, 0.27050883]),
 'test_hamming_loss': array([-0.28284924, -0.20020747, -0.31185031, -0.23354123, -0.22903673]),
 'train_hamming_loss': array([-0.22995671, -0.26554113, -0.21581862, -0.25034614, -0.24575978])}

Ok, the initial results here show that the proposed Classifier Chain is indeed **consistently bad**. Regardless, now I have a better ideia of how to set up this evaluation pipeline, which I should conclude tomorrow.

## 3.3. Evaluation pipeline implementation

The idea is to have a class with a `run()` method that will use a set of parameters and metrics that all modes in this work should use.

For now, this pipeline will be described directly in this notebook. But I plan to move it to a `.py` file of its own to then use in other notebooks.

In [21]:
class EvaluationPipelineResult:
    cross_validate_result: Dict[Any, Any]

    def __init__(self, cross_validate_result: Dict[Any, Any]) -> None:
        self.cross_validate_result = cross_validate_result
    
    def describe(self) -> None:
        print("Accuracy: {:.4f} ± {:.2f}".format(
            self.cross_validate_result["test_accuracy"].mean(),
            self.cross_validate_result["test_accuracy"].std()
        ))

        print("Hamming Loss: {:.4f} ± {:.2f}".format(
            self.cross_validate_result["test_hamming_loss"].mean(),
            self.cross_validate_result["test_hamming_loss"].std()
        ))
    
    def raw(self) -> Dict[Any, Any]:
        return self.cross_validate_result

class EvaluationPipeline:
    model: Any
    n_folds: int

    def __init__(self, model: Any, n_folds: int = 5) -> None:
        # TODO: establish the model type
        self.model = model
        self.n_folds = n_folds

    def run(self, X: Any, y: Any) -> EvaluationPipelineResult:
        accuracy_scorer = metrics.make_scorer(metrics.accuracy_score)
        hamming_loss_scorer = metrics.make_scorer(
            metrics.hamming_loss, greater_is_better=False)

        scoring_set = {
            "accuracy": accuracy_scorer,
            "hamming_loss": hamming_loss_scorer,
        }

        validate_result = cross_validate(
            self.model,
            X, y,
            cv=self.n_folds,
            scoring=scoring_set,
            return_train_score=True)

        return EvaluationPipelineResult(validate_result)


In [24]:
clf3 = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=[1, 3, 2, 5, 4, 0] # whatever order
)

evaluation_pipeline = EvaluationPipeline(clf3)
result = evaluation_pipeline.run(X_full, y_full)
result.describe()

Accuracy: 0.2563 ± 0.12
Hamming Loss: -0.2482 ± 0.04


In [25]:
result.raw()

{'fit_time': array([4.27550077, 4.38973761, 3.88086176, 3.95994496, 3.68254423]),
 'score_time': array([0.73751068, 0.69558883, 0.63721609, 0.70021701, 0.63118529]),
 'test_accuracy': array([0.13900415, 0.38381743, 0.1039501 , 0.28274428, 0.37214137]),
 'train_accuracy': array([0.3225974 , 0.22597403, 0.36137072, 0.28089304, 0.28504673]),
 'test_hamming_loss': array([-0.28181189, -0.20746888, -0.3000693 , -0.23769924, -0.21413721]),
 'train_hamming_loss': array([-0.22649351, -0.2578355 , -0.21183801, -0.24048114, -0.23901004])}