# 8. Debugging the `BasicStacking` model

For some reason, and this is known ever since the first notebook, that the `BasicStacking` is producing the same results as the `BinaryRelevance` model, which is really weird. Also, the `StackingWithFTest`, when used with `alpha=1`, therefore working just like a `BasicStacking`, does **not** produce the same results.

The `BasicStacking` is most likely not leveraging its second layer of classifiers as it should, and this is what we are going to debug in this notebook.

## 8.1. Setup

In [1]:
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.dataset import load_dataset
from sklearn.svm import SVC
from skmultilearn.base.problem_transformation import ProblemTransformationBase
from typing import List, Optional, Any, Tuple, Dict
import numpy as np
import sklearn.metrics as metrics
import json
import pandas as pd
from sklearn.feature_selection import f_classif
from evaluation import EvaluationPipeline


## 8.2. `BasicStacking` code

After this code is successfully debugged, it should be moved to a python file of its own.

In [2]:
# TODO: move this to an actual python file

class BasicStacking(ProblemTransformationBase):
    first_layer_classifiers: BinaryRelevance
    second_layer_classifiers: BinaryRelevance

    def __init__(self, classifier: Any = None, require_dense: Optional[List[bool]] = None):
        super(BasicStacking, self).__init__(classifier, require_dense)

        self.first_layer_classifiers = BinaryRelevance(
            classifier=SVC(),
            require_dense=[False, True]
        )

        self.second_layer_classifiers = BinaryRelevance(
            classifier=SVC(),
            require_dense=[False, True]
        )
    
    def fit(self, X: Any, y: Any):
        self.first_layer_classifiers.fit(X, y)

        first_layer_predictions = self.first_layer_classifiers.predict(X)
        X_expanded = np.hstack([X.todense(), first_layer_predictions.todense()])

        self.second_layer_classifiers.fit(X_expanded, y)
    
    def predict(self, X: Any):
        first_layer_predictions = self.first_layer_classifiers.predict(X)
        X_expanded = np.hstack([X.todense(), first_layer_predictions.todense()])
        return self.second_layer_classifiers.predict(X_expanded)


## 8.3. Baseline results

Let's get the results again for the `BinaryRelevance` and the `BasicStacking`

In [3]:
desired_datasets = ["scene", "emotions", "birds"]

datasets = {}
for dataset_name in desired_datasets:
    print(f"getting dataset `{dataset_name}`")
    
    full_dataset = load_dataset(dataset_name, "undivided")
    X, y, _, _ = full_dataset

    train_dataset = load_dataset(dataset_name, "train")
    X_train, y_train, _, _ = train_dataset

    test_dataset = load_dataset(dataset_name, "test")
    X_test, y_test, _, _ = test_dataset

    datasets[dataset_name] = {
        "X": X,
        "y": y,
        "X_train": X_train,
        "y_train": y_train,
        "X_test": X_test,
        "y_test": y_test,
        "rows": X.shape[0],
        "labels_count": y.shape[1]
    }


for name, info in datasets.items():
    print("===")
    print(f"information for dataset `{name}`")
    print(f"rows: {info['rows']}, labels: {info['labels_count']}")


getting dataset `scene`
scene:undivided - exists, not redownloading
scene:train - exists, not redownloading
scene:test - exists, not redownloading
getting dataset `emotions`
emotions:undivided - exists, not redownloading
emotions:train - exists, not redownloading
emotions:test - exists, not redownloading
getting dataset `birds`
birds:undivided - exists, not redownloading
birds:train - exists, not redownloading
birds:test - exists, not redownloading
===
information for dataset `scene`
rows: 2407, labels: 6
===
information for dataset `emotions`
rows: 593, labels: 6
===
information for dataset `birds`
rows: 645, labels: 19


In [4]:
baseline_binary_relevance_model = BinaryRelevance(
    classifier=SVC(),
    require_dense=[False, True]
)

basic_stacking_model = BasicStacking()

models = {
    "baseline_binary_relevance_model": baseline_binary_relevance_model,
    "basic_stacking_model": basic_stacking_model,
}

In [5]:
evaluation_results = {}

for model_name, model in models.items():
    print(f"# running model `{model_name}`")

    evaluation_results[model_name] = {}

    n_folds = 5
    evaluation_pipeline = EvaluationPipeline(model, n_folds)

    for dataset_name, info in datasets.items():
        print(f"## running dataset `{dataset_name}`")

        result = evaluation_pipeline.run(info["X"], info["y"])
        evaluation_results[model_name][dataset_name] = result

        print(f"results obtained:")
        result.describe()


# running model `baseline_binary_relevance_model`
## running dataset `scene`
results obtained:
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09
## running dataset `emotions`
results obtained:
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01
## running dataset `birds`


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


results obtained:
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00
# running model `basic_stacking_model`
## running dataset `scene`
results obtained:
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09
## running dataset `emotions`
results obtained:
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01
## running dataset `birds`


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


results obtained:
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00


In [6]:
for model, data in evaluation_results.items():
    for dataset, result in data.items():
        print(f"model `{model}`, dataset `{dataset}`")
        result.describe()
        print()

model `baseline_binary_relevance_model`, dataset `scene`
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09

model `baseline_binary_relevance_model`, dataset `emotions`
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01

model `baseline_binary_relevance_model`, dataset `birds`
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00

model `basic_stacking_model`, dataset `scene`
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09

model `basic_stacking_model`, dataset `emotions`
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01

model `basic_stacking_model`, dataset `birds`
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00



As we can see, results are truly identical among the two models.

## 8.4. Actually debugging the `BasicStacking`

Let's start by making sure that the second layer is being used, and that it receive more features than the first layer (it is supposed to get the predictions of the first layer as features).

In [84]:
import copy

class DebuggingBasicStacking(ProblemTransformationBase):
    first_layer_classifiers: BinaryRelevance
    second_layer_classifiers: BinaryRelevance

    def __init__(self, classifier: Any = SVC(), require_dense: Optional[List[bool]] = None):
        super().__init__()

        first_base_classifier = copy.deepcopy(classifier)
        second_base_classifier = copy.deepcopy(classifier)

        print("Same object check", first_base_classifier is second_base_classifier)

        self.first_layer_classifiers = BinaryRelevance(
            classifier=first_base_classifier,
            require_dense=[False, True]
        )

        self.second_layer_classifiers = BinaryRelevance(
            classifier=second_base_classifier,
            require_dense=[False, True]
        )
    
    def fit(self, X: Any, y: Any):
        print(f"FIT: X shape is {X.shape}")
        self.first_layer_classifiers.fit(X, y)

        first_layer_predictions = self.first_layer_classifiers.predict(X)
        formatted_first_layer_predictions = first_layer_predictions.todense()
        X_expanded = np.hstack([X.todense(), formatted_first_layer_predictions])

        first_layer_sum = np.sum(np.sum(formatted_first_layer_predictions, axis=1))
        print("FIT: summing the values (for first layer):", first_layer_sum)

        print(f"FIT: X_extended shape is {X_expanded.shape}")
        self.second_layer_classifiers.fit(X_expanded, y)
    
    def predict(self, X: Any): # type: ignore
        print(f"PREDICT: X shape is {X.shape}")
        first_layer_predictions = self.first_layer_classifiers.predict(X)
        formatted_first_layer_predictions = first_layer_predictions.todense()

        X_expanded = np.hstack([X.todense(), formatted_first_layer_predictions])

        print("PREDICT: summing the values (for first layer):", np.sum(np.sum(formatted_first_layer_predictions, axis=1)))
        print(f"PREDICT: X_extended shape is {X_expanded.shape}")

        second_layer_predictions = self.second_layer_classifiers.predict(X_expanded)
        formatted_second_layer_predictions = second_layer_predictions.todense()

        print("PREDICT: summing the values (for second layer):", np.sum(np.sum(formatted_second_layer_predictions, axis=1)))

        return second_layer_predictions

In [8]:
# first test

m = DebuggingBasicStacking()
m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])

X shape is (1211, 294)
X_extended shape is (1211, 300)


In [15]:
# second test

m = DebuggingBasicStacking()
m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])

X shape is (1211, 294)
X_extended shape is (1211, 300)


In [21]:
print("first layer")
for clf in m.first_layer_classifiers.classifiers_:
    print(clf.n_features_in_)

print("second layer")
for clf in m.second_layer_classifiers.classifiers_:
    print(clf.n_features_in_)

first layer
294
294
294
294
294
294
second layer
300
300
300
300
300
300


In [67]:
# third test

m = DebuggingBasicStacking()
m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
m.predict(datasets["scene"]["X_test"])

FIT: X shape is (1211, 294)
FIT: Summing the values: 1019
FIT: X_extended shape is (1211, 300)
PREDICT: X shape is (1196, 294)
Summing the values: 901
PREDICT: X_extended shape is (1196, 300)


<1196x6 sparse matrix of type '<class 'numpy.int64'>'
	with 901 stored elements in Compressed Sparse Column format>

In [68]:
1019/(1211 * 6)

0.14024222405725295

In [65]:
901/(1196 * 6)

0.12555741360089187

In [70]:
# -- checking amount of values found in the actual dataset
# that is, values for the labels

np.sum(np.sum(datasets["scene"]["y"].todense(), axis=1))

2585

In [71]:
datasets["scene"]["y"].shape

(2407, 6)

In [72]:
2585/(2407 * 6)

0.17899182938651156

In [73]:
# fourth test

pure_br = BinaryRelevance(
    classifier=SVC(random_state=42),
    require_dense=[False, True]
)

pure_br.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = pure_br.predict(datasets["scene"]["X_test"])

display(np.sum(np.sum(preds.todense(), axis=1)))
display(preds.shape)

901

(1196, 6)

In [55]:
901/(1196*6)

0.12555741360089187

In [76]:
pure_br = BinaryRelevance(
    classifier=SVC(random_state=94589045),
    require_dense=[False, True]
)

pure_br.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = pure_br.predict(datasets["scene"]["X_test"])

display(np.sum(np.sum(preds.todense(), axis=1)))
display(preds.shape)

901

(1196, 6)

This is really weird. Regardless of the `random_state`, the results are always the same. Let's try another base classifier.

In [62]:
from sklearn.ensemble import RandomForestClassifier

rf_br = BinaryRelevance(
    classifier=RandomForestClassifier(random_state=42),
    require_dense=[False, True]
)

rf_br.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = rf_br.predict(datasets["scene"]["X_test"])

display(np.sum(np.sum(preds.todense(), axis=1)))
display(preds.shape)

791

(1196, 6)

In [63]:
rf_br = BinaryRelevance(
    classifier=RandomForestClassifier(random_state=43434242),
    require_dense=[False, True]
)

rf_br.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = rf_br.predict(datasets["scene"]["X_test"])

display(np.sum(np.sum(preds.todense(), axis=1)))
display(preds.shape)

811

(1196, 6)

In [75]:
rf_br = BinaryRelevance(
    classifier=RandomForestClassifier(),
    require_dense=[False, True]
)

rf_br.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = rf_br.predict(datasets["scene"]["X_test"])

display(np.sum(np.sum(preds.todense(), axis=1)))
display(preds.shape)

792

(1196, 6)

In [83]:
m = DebuggingBasicStacking()
m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
m.predict(datasets["scene"]["X_test"])

FIT: X shape is (1211, 294)
FIT: summing the values (for first layer): 1019
FIT: X_extended shape is (1211, 300)
PREDICT: X shape is (1196, 294)
PREDICT: summing the values (for first layer): 901
PREDICT: X_extended shape is (1196, 300)
PREDICT: summing the values (for second layer): 901


<1196x6 sparse matrix of type '<class 'numpy.int64'>'
	with 901 stored elements in Compressed Sparse Column format>

In [87]:
# fifth test

m = DebuggingBasicStacking(classifier=RandomForestClassifier(random_state=42))
m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
m.predict(datasets["scene"]["X_test"])

Same object check False
FIT: X shape is (1211, 294)
FIT: summing the values (for first layer): 1286
FIT: X_extended shape is (1211, 300)
PREDICT: X shape is (1196, 294)
PREDICT: summing the values (for first layer): 797
PREDICT: X_extended shape is (1196, 300)
PREDICT: summing the values (for second layer): 793


<1196x6 sparse matrix of type '<class 'numpy.int64'>'
	with 793 stored elements in Compressed Sparse Column format>

In [89]:
m = DebuggingBasicStacking(classifier=RandomForestClassifier(random_state=94389473))
m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
m.predict(datasets["scene"]["X_test"])

Same object check False
FIT: X shape is (1211, 294)
FIT: summing the values (for first layer): 1286
FIT: X_extended shape is (1211, 300)
PREDICT: X shape is (1196, 294)
PREDICT: summing the values (for first layer): 791
PREDICT: X_extended shape is (1196, 300)
PREDICT: summing the values (for second layer): 792


<1196x6 sparse matrix of type '<class 'numpy.int64'>'
	with 792 stored elements in Compressed Sparse Column format>

In [103]:
# sixth test

from sklearn.ensemble import StackingClassifier
from sklearn.multiclass import OneVsRestClassifier

# class FixedBR(BinaryRelevance):
#     def fit(self, X: Any, y: Any):
#         self.classes_ = [0,1,2,3,4,5]
#         super().fit(X, y)

base_rf_br = FixedBR(
    classifier=RandomForestClassifier(random_state=42),
    require_dense=[False, True]
)
# base_rf_br.classes_ = [0,1,2,3,4,5]

final_rf_br = FixedBR(
    classifier=RandomForestClassifier(random_state=42),
    require_dense=[False, True]
)
# final_rf_br.classes_ = base_rf_br.classes_
# notice: these commented out lines show attempts to set the `classes_` property
# of the BinaryRelevance class, but it does not work as expected

clf = StackingClassifier(
    estimators=[ ("base_rf_br", base_rf_br) ],
    final_estimator= final_rf_br,
    passthrough=True
    )

wrapped_clf = OneVsRestClassifier(clf)
# StackingClassifier does not natively support multilabel classification
# so we have to wrap it in a OneVsRestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)
# seen here: https://stackoverflow.com/questions/61309527/unable-to-do-stacking-for-a-multi-label-classifier

wrapped_clf.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = wrapped_clf.predict(datasets["scene"]["X_test"])



ValueError: setting an array element with a sequence.

In [104]:
# another attempt, without using `BinaryRelevance`

starting_classifier = RandomForestClassifier(random_state=42)
final_classifier = RandomForestClassifier(random_state=42)

clf = StackingClassifier(
    estimators=[ ("rf", starting_classifier) ],
    final_estimator= final_classifier,
    passthrough=True
    )

wrapped_clf = OneVsRestClassifier(clf)
# StackingClassifier does not natively support multilabel classification
# so we have to wrap it in a OneVsRestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)
# seen here: https://stackoverflow.com/questions/61309527/unable-to-do-stacking-for-a-multi-label-classifier

wrapped_clf.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = wrapped_clf.predict(datasets["scene"]["X_test"])

In [107]:
display(metrics.hamming_loss(datasets["scene"]["y_test"], preds))
display(metrics.f1_score(datasets["scene"]["y_test"], preds, average="macro"))

0.0907190635451505

0.6830985019980916

In [127]:
# trying to better understand the "architecture" being used here

starting_classifier = RandomForestClassifier(random_state=42)
final_classifier = RandomForestClassifier(random_state=42)

clf = StackingClassifier(
    estimators=[ ("rf", starting_classifier) ],
    final_estimator= final_classifier,
    passthrough=True,
    verbose=1,
    )

wrapped_clf = OneVsRestClassifier(clf)
# StackingClassifier does not natively support multilabel classification
# so we have to wrap it in a OneVsRestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)
# seen here: https://stackoverflow.com/questions/61309527/unable-to-do-stacking-for-a-multi-label-classifier

wrapped_clf.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = wrapped_clf.predict(datasets["scene"]["X_test"])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished


In [142]:
wrapped_clf.estimators_

[StackingClassifier(estimators=[('rf', RandomForestClassifier(random_state=42))],
                    final_estimator=RandomForestClassifier(random_state=42),
                    passthrough=True, verbose=1),
 StackingClassifier(estimators=[('rf', RandomForestClassifier(random_state=42))],
                    final_estimator=RandomForestClassifier(random_state=42),
                    passthrough=True, verbose=1),
 StackingClassifier(estimators=[('rf', RandomForestClassifier(random_state=42))],
                    final_estimator=RandomForestClassifier(random_state=42),
                    passthrough=True, verbose=1),
 StackingClassifier(estimators=[('rf', RandomForestClassifier(random_state=42))],
                    final_estimator=RandomForestClassifier(random_state=42),
                    passthrough=True, verbose=1),
 StackingClassifier(estimators=[('rf', RandomForestClassifier(random_state=42))],
                    final_estimator=RandomForestClassifier(random_state=42),
     

In [143]:
wrapped_clf.estimators_[0].estimators_

[RandomForestClassifier(random_state=42)]

In [144]:
wrapped_clf.estimators_[0].estimators_[0].n_features_in_

294

In [141]:
wrapped_clf.estimators_[0].final_estimator_.n_features_in_

295

In [115]:
# trying to replicate simple binary relevance results

base_classifier = RandomForestClassifier(random_state=42)
br_simulator = OneVsRestClassifier(base_classifier)
br_simulator.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = br_simulator.predict(datasets["scene"]["X_test"])

display(metrics.hamming_loss(datasets["scene"]["y_test"], preds))
display(metrics.f1_score(datasets["scene"]["y_test"], preds, average="macro"))

0.08974358974358974

0.684223851314326

In [126]:
for clf in br_simulator.estimators_:
    print(clf.n_features_in_)

294
294
294
294
294
294


In [112]:
br_simulator.multilabel_

True

In [113]:
actual_br = BinaryRelevance(
    classifier=RandomForestClassifier(random_state=42),
    require_dense=[False, True]
)

actual_br.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = actual_br.predict(datasets["scene"]["X_test"])

display(metrics.hamming_loss(datasets["scene"]["y_test"], preds))
display(metrics.f1_score(datasets["scene"]["y_test"], preds, average="macro"))

0.08974358974358974

0.684223851314326

In [114]:
my_stacking = DebuggingBasicStacking(
    classifier=RandomForestClassifier(random_state=42),
    require_dense=[False, True]
)

my_stacking.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])
preds = my_stacking.predict(datasets["scene"]["X_test"])

display(metrics.hamming_loss(datasets["scene"]["y_test"], preds))
display(metrics.f1_score(datasets["scene"]["y_test"], preds, average="macro"))

Same object check False
FIT: X shape is (1211, 294)
FIT: summing the values (for first layer): 1286
FIT: X_extended shape is (1211, 300)
PREDICT: X shape is (1196, 294)
PREDICT: summing the values (for first layer): 791
PREDICT: X_extended shape is (1196, 300)
PREDICT: summing the values (for second layer): 790


0.08960423634336678

0.684628539219014

### First test: check shape of the features (shape of X)

A few `print`s were added to the code to revel the shape of `X`, both at the first layer and at the second layer. Result:

```
X shape is (1211, 294)
X_extended shape is (1211, 300)
```

**So, the second layer is indeed receiving more features than the first layer. This is good.**

### Second test: check if the base classifier itself is being trained with the new features

The existing properties being set by `BinaryRelevance` already make it possible to investigate each classifier of either the first or the second layer.

The property `n_features_in`, from the [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), inform how many features were observed during `fit()`. This is the property we are going to use to check if the base classifiers are being trained with the new features.

```
first layer
294
294
294
294
294
294
second layer
300
300
300
300
300
300
```

**The base classifiers from the second layer are really receiving all the new features. This is good.**

### Third test: check if the new features have values

```
FIT: X shape is (1211, 294)
FIT: Summing the values: 1019
FIT: X_extended shape is (1211, 300)
PREDICT: X shape is (1196, 294)
Summing the values: 901
PREDICT: X_extended shape is (1196, 300)
```

**The extended features really have values, and they are not all zeros. This is good**. We can also see that they differ from `fit` to `predict`, which is also showing that we are _not_ doing a simple mistake such as using the same data in both places.

The share of zeros, however, is something to consider. For the `fit` phase, only 14% of the values have a value different from zero. For the `predict` phase, this share is 12%. Both values are less than the share found in the full dataset, which is around ~17%.

Is it possible that the "stacking part" is making the model drop its performance? **It might be worthy to check a pure `BinaryRelevance` output**.

At first, iy might be tempting to think that the problem is the dataset being used. But remember that in the first notebook, we conduct a very similar test, comparing the output of `BinaryRelevance` and `BasicStacking`, and the results were identical for all datasets tested.

### Fourth test: testing a pure `BinaryRelevance` model

~A pure `BinaryRelevance` manages to be _worse_ than the first layer of the `BasicStacking` model. This is really weird. It only gets 901 labels, instead of 1019.~ A pure `BinaryRelevance` is performing exactly equal to the first layer of the `BasicStacking`, which is actually expected. What is weird is that the number of labels found is exactly the same: 901. **What is weird is that both the first and second layer return the exact same number of labels, regardless of the extra features being added to the second layer.**

Could it be something with the `random_stage`? **But even after changing the `random_state` to absurd numbers, the results are always the same**. This is totally weird. It is like `SVC` is caching the results or something like that. Or maybe `BinaryRelevance` is keeping a cache of some sort. This is weird, as the source code of `scikit-multilearn` does _not_ describe any cache. It uses `copy.deepcopy`. Maybe this function is the one doing the cache thing? I made a quick Google search about this, but I found nothing conclusive.

Or maybe this is down to how `SVC` works. With all its default parameters, it really _is_ stopping at the same point. ~But why it finds more labels when being used via the `BasicStacking` model? This cannot be explained by the `SVC` internal workings alone.~ Actually it is finding the same quantity of labels. I made some confusion because I compared the labels found during `fit` to those found during `predict`. If I compare the labels from `predict` only, then the results are exactly the same.

When using different base classifiers, such as the `RandomForestClassifier`, the results are different. **It finds _less_ labels than `SVC`, but at least the results differ when running with different `random_state`s**.

Perhaps the way forward is to use `RandomForestClassifier` instead of `SVC` as the base classifier.

### Fifth test: using `RandomForestClassifier` as the base classifier

The usage of `RandomForestClassifier` show that, by using other base classifier, it is possible to achieve different results for the first and second layer. In this case, the `BasicStacking` really differs from the `BinaryRelevance` model, which is what we wanted to achieve in the first place.

The results are still **not** that great, as even the `RandomForestClassifier` generate actually little difference from the first to the second layer.

### Sixth test: use scikit's `StackingClassifier`

The purpose of this is to sort of "sanity check" my own implementation of _stacking_. However, it is **not** possible to conduct such a test as the `BinaryRelevance` does not seem to be really ready to be used in the `StackingClassifier` pipeline. Or, as the `StackingClassifier` does not support multilabel and we have to wrap in on a `OneVsRestClassifier`, maybe it does not really work for the `BinaryRelevance` as we are _not_ training one `BinaryRelevance` for each label (each class). Instead, we should be training one `BinaryRelevance` for all labels.

With that in mind, I also tried to just use the base classifier, which is the `RandomForestClassifier`, instead of the `BinaryRelevance`. In this case the pipeline works, and in theory, the `OneVsRest` make it behave just like a `BinaryRelevance`.

The final scores are actually quite good. Much better than either the `BinaryRelevance` or my `BasicStacking`, or the `StackingWithFTest`. It is **not** immediately clear the actual architecture that is going on in this case. So I am not sure if I am comparing apples to apples here.

To actually get info for this, I tried to replicate the simple `BinaryRelevance` model, with `OneVsRestClassifier`, and then replicate the actual stacking, and checking if the results are similar in all places.

When investigating the `OneVsRestClassifier`, by using it as a wrapper to the `StackingClassifier`, it shows that is not doing a stacking as per the regular way of having a first and a second layer. Instead, it trains one "stacked classifier" for each label. Therefore, we have 6 "stacked classifier". And on each "stacked classifier", the first base classifier (in this case, a `RandomForestClassifier`) receives 294 features, while the final base classifier takes 295 features, the additional feature being the prediction of the first base classifier.

When attempting to replicate the `BinaryRelevance` with `OneVsRest`, we actually get the very same results. For `BinaryRelevance`, we get 0.684223851314326. For `OneVsRest` (with a single layer), we get 0.684223851314326. Which is the same.

For the stacking... When using `OneVsRest`, we get 0.6830985019980916. For my own stacking implementation, we get 0.684628539219014, which is different and it is a little bit more. It is also a little bit over the `BinaryRelevance`.

This alone does _not_ confirm that I am doing stacking correctly. But I can confirm that the `StackingClassifier` / `OneVsRest` **cannot** replicate the stacking expected, where all the predicted labels from the first layer are used as input for the second layer.

However, considering that the metrics are so far are adding up (the results are different on each iteration), and that my `BasicStacking` is so far doing better (even if just a bit) than all the rest, shows that I am on the right track.

Considering that all of this is not really the focus of my research, right now I see no reason in trying many other alternatives. The only two things I still want to try is to compare metrics to `StackingWithFTest` with `alpha=1`, and to make sure that the `BasicStacking` is really expanding all labels but the label being predicted.

### Seventh test: make sure that the `BasicStacking` expands all labels but the label being predicted

...

### Eighth test: compare metrics to `StackingWithFTest` when `alpha=1`

...