# 3. Evaluation pipeline

The initial results for the second model are quite mixed. Actually, they are quite bad in general. But it is a bit hard to understand why its performance is so bad without having a proper evaluation pipeline, that should use cross-validation as a way to mitigate the effects of the randomness of the train-test split.

## 3.1. Setup

In [2]:
from sklearn.model_selection import KFold, cross_validate
from skmultilearn.dataset import load_dataset
from skmultilearn.problem_transform import ClassifierChain
from sklearn.svm import SVC
import sklearn.metrics as metrics


## 3.2. Experimentation

In [3]:
full_data = load_dataset("scene", "undivided")
train_data = load_dataset("scene", "train")
test_data = load_dataset("scene", "test")


scene:undivided - exists, not redownloading
scene:train - exists, not redownloading
scene:test - exists, not redownloading


In [4]:
display(train_data[:2])
display(test_data[:2])
display(full_data[:2])


(<1211x294 sparse matrix of type '<class 'numpy.float64'>'
 	with 351805 stored elements in List of Lists format>,
 <1211x6 sparse matrix of type '<class 'numpy.int64'>'
 	with 1286 stored elements in List of Lists format>)

(<1196x294 sparse matrix of type '<class 'numpy.float64'>'
 	with 347724 stored elements in List of Lists format>,
 <1196x6 sparse matrix of type '<class 'numpy.int64'>'
 	with 1299 stored elements in List of Lists format>)

(<2407x294 sparse matrix of type '<class 'numpy.float64'>'
 	with 699529 stored elements in List of Lists format>,
 <2407x6 sparse matrix of type '<class 'numpy.int64'>'
 	with 2585 stored elements in List of Lists format>)

Just a side note here: this dataset is clearly too small. It might explain the wild results we were getting in the second model..

Anyway, let's keep going forward with the evaluation pipeline.

In [5]:
X_full, y_full, _, _ = full_data

In [6]:
dense_X_full = X_full.toarray()
dense_y_full = y_full.toarray()


skf = KFold(n_splits=5)
for i, (train_index, test_index) in enumerate(skf.split(dense_X_full, dense_y_full)):
    X_train = dense_X_full[train_index]
    y_train = dense_y_full[train_index]

    X_test = dense_X_full[test_index]
    y_test = dense_y_full[test_index]


**Notice**: `StratifiedKFold` **cannot** be used as it does not support multilabel ([source](https://stackoverflow.com/questions/48508036/sklearn-stratifiedkfold-valueerror-supported-target-types-are-binary-mul)).

Instead of trying to set up everything by myself, let's use something that Scikit already provides: the `cross_validate` function, as described [here](https://scikit-learn.org/stable/modules/cross_validation.html#the-cross-validate-function-and-multiple-metric-evaluation).

In [7]:
clf = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True]
)

s = {"accuracy": make_scorer(metrics.accuracy_score),
     "hamming_loss": make_scorer(metrics.hamming_loss, greater_is_better=False)}

r = cross_validate(clf, X_full, y_full, cv=5, scoring=s, return_train_score=True)
r


In [8]:
r["test_accuracy"].mean(), r["test_accuracy"].std()

(0.6319200144926287, 0.06657267886278025)

In [9]:
r["test_hamming_loss"].mean(), r["test_hamming_loss"].std()

(-0.11195857523658352, 0.023752236934311968)

In [10]:
# quick testing, using the proposed order based on F-test

clf2 = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=[3, 0, 2, 5, 4, 1]
)

s2 = {"accuracy": metrics.make_scorer(metrics.accuracy_score),
      "hamming_loss": metrics.make_scorer(metrics.hamming_loss, greater_is_better=False)}

r2 = cross_validate(clf2, X_full, y_full, cv=5,
                   scoring=s2, return_train_score=True)
r2

{'fit_time': array([4.06053257, 4.29102755, 3.58891082, 4.0080204 , 3.63190055]),
 'score_time': array([0.65458941, 0.65905857, 0.6446743 , 0.67406464, 0.62596703]),
 'test_accuracy': array([0.1473029 , 0.39626556, 0.05613306, 0.28274428, 0.3035343 ]),
 'train_accuracy': array([0.31324675, 0.20727273, 0.3556594 , 0.26272066, 0.27050883]),
 'test_hamming_loss': array([-0.28284924, -0.20020747, -0.31185031, -0.23354123, -0.22903673]),
 'train_hamming_loss': array([-0.22995671, -0.26554113, -0.21581862, -0.25034614, -0.24575978])}

Ok, the initial results here show that the proposed Classifier Chain is indeed **consistently bad**. Regardless, now I have a better ideia of how to set up this evaluation pipeline, which I should conclude tomorrow.