# 8. Debugging the `BasicStacking` model

For some reason, and this is known ever since the first notebook, that the `BasicStacking` is producing the same results as the `BinaryRelevance` model, which is really weird. Also, the `StackingWithFTest`, when used with `alpha=1`, therefore working just like a `BasicStacking`, does **not** produce the same results.

The `BasicStacking` is most likely not leveraging its second layer of classifiers as it should, and this is what we are going to debug in this notebook.

## 8.1. Setup

In [1]:
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.dataset import load_dataset
from sklearn.svm import SVC
from skmultilearn.base.problem_transformation import ProblemTransformationBase
from typing import List, Optional, Any, Tuple, Dict
import numpy as np
import sklearn.metrics as metrics
import json
import pandas as pd
from sklearn.feature_selection import f_classif
from evaluation import EvaluationPipeline


## 8.2. `BasicStacking` code

After this code is successfully debugged, it should be moved to a python file of its own.

In [2]:
# TODO: move this to an actual python file

class BasicStacking(ProblemTransformationBase):
    first_layer_classifiers: BinaryRelevance
    second_layer_classifiers: BinaryRelevance

    def __init__(self, classifier: Any = None, require_dense: Optional[List[bool]] = None):
        super(BasicStacking, self).__init__(classifier, require_dense)

        self.first_layer_classifiers = BinaryRelevance(
            classifier=SVC(),
            require_dense=[False, True]
        )

        self.second_layer_classifiers = BinaryRelevance(
            classifier=SVC(),
            require_dense=[False, True]
        )
    
    def fit(self, X: Any, y: Any):
        self.first_layer_classifiers.fit(X, y)

        first_layer_predictions = self.first_layer_classifiers.predict(X)
        X_expanded = np.hstack([X.todense(), first_layer_predictions.todense()])

        self.second_layer_classifiers.fit(X_expanded, y)
    
    def predict(self, X: Any):
        first_layer_predictions = self.first_layer_classifiers.predict(X)
        X_expanded = np.hstack([X.todense(), first_layer_predictions.todense()])
        return self.second_layer_classifiers.predict(X_expanded)


## 8.3. Baseline results

Let's get the results again for the `BinaryRelevance` and the `BasicStacking`

In [3]:
desired_datasets = ["scene", "emotions", "birds"]

datasets = {}
for dataset_name in desired_datasets:
    print(f"getting dataset `{dataset_name}`")
    
    full_dataset = load_dataset(dataset_name, "undivided")
    X, y, _, _ = full_dataset

    train_dataset = load_dataset(dataset_name, "train")
    X_train, y_train, _, _ = train_dataset

    test_dataset = load_dataset(dataset_name, "test")
    X_test, y_test, _, _ = test_dataset

    datasets[dataset_name] = {
        "X": X,
        "y": y,
        "X_train": X_train,
        "y_train": y_train,
        "X_test": X_test,
        "y_test": y_test,
        "rows": X.shape[0],
        "labels_count": y.shape[1]
    }


for name, info in datasets.items():
    print("===")
    print(f"information for dataset `{name}`")
    print(f"rows: {info['rows']}, labels: {info['labels_count']}")


getting dataset `scene`
scene:undivided - exists, not redownloading
scene:train - exists, not redownloading
scene:test - exists, not redownloading
getting dataset `emotions`
emotions:undivided - exists, not redownloading
emotions:train - exists, not redownloading
emotions:test - exists, not redownloading
getting dataset `birds`
birds:undivided - exists, not redownloading
birds:train - exists, not redownloading
birds:test - exists, not redownloading
===
information for dataset `scene`
rows: 2407, labels: 6
===
information for dataset `emotions`
rows: 593, labels: 6
===
information for dataset `birds`
rows: 645, labels: 19


In [4]:
baseline_binary_relevance_model = BinaryRelevance(
    classifier=SVC(),
    require_dense=[False, True]
)

basic_stacking_model = BasicStacking()

models = {
    "baseline_binary_relevance_model": baseline_binary_relevance_model,
    "basic_stacking_model": basic_stacking_model,
}

In [5]:
evaluation_results = {}

for model_name, model in models.items():
    print(f"# running model `{model_name}`")

    evaluation_results[model_name] = {}

    n_folds = 5
    evaluation_pipeline = EvaluationPipeline(model, n_folds)

    for dataset_name, info in datasets.items():
        print(f"## running dataset `{dataset_name}`")

        result = evaluation_pipeline.run(info["X"], info["y"])
        evaluation_results[model_name][dataset_name] = result

        print(f"results obtained:")
        result.describe()


# running model `baseline_binary_relevance_model`
## running dataset `scene`
results obtained:
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09
## running dataset `emotions`
results obtained:
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01
## running dataset `birds`


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


results obtained:
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00
# running model `basic_stacking_model`
## running dataset `scene`
results obtained:
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09
## running dataset `emotions`
results obtained:
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01
## running dataset `birds`


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


results obtained:
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00


In [6]:
for model, data in evaluation_results.items():
    for dataset, result in data.items():
        print(f"model `{model}`, dataset `{dataset}`")
        result.describe()
        print()

model `baseline_binary_relevance_model`, dataset `scene`
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09

model `baseline_binary_relevance_model`, dataset `emotions`
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01

model `baseline_binary_relevance_model`, dataset `birds`
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00

model `basic_stacking_model`, dataset `scene`
Accuracy: 0.5268 ± 0.13
Hamming Loss: -0.1020 ± 0.03
F1 score: 0.4207 ± 0.09

model `basic_stacking_model`, dataset `emotions`
Accuracy: 0.0135 ± 0.01
Hamming Loss: -0.3033 ± 0.02
F1 score: 0.0576 ± 0.01

model `basic_stacking_model`, dataset `birds`
Accuracy: 0.4636 ± 0.05
Hamming Loss: -0.0534 ± 0.00
F1 score: 0.0128 ± 0.00



As we can see, results are truly identical among the two models.

## 8.4. Actually debugging the `BasicStacking`

Let's start by making sure that the second layer is being used, and that it receive more features than the first layer (it is supposed to get the predictions of the first layer as features).

In [14]:
class DebuggingBasicStacking(ProblemTransformationBase):
    first_layer_classifiers: BinaryRelevance
    second_layer_classifiers: BinaryRelevance

    def __init__(self, classifier: Any = None, require_dense: Optional[List[bool]] = None):
        super().__init__()

        self.first_layer_classifiers = BinaryRelevance(
            classifier=SVC(),
            require_dense=[False, True]
        )

        self.second_layer_classifiers = BinaryRelevance(
            classifier=SVC(),
            require_dense=[False, True]
        )
    
    def fit(self, X: Any, y: Any):
        print(f"X shape is {X.shape}")
        self.first_layer_classifiers.fit(X, y)

        first_layer_predictions = self.first_layer_classifiers.predict(X)
        X_expanded = np.hstack([X.todense(), first_layer_predictions.todense()])

        print(f"X_extended shape is {X_expanded.shape}")
        self.second_layer_classifiers.fit(X_expanded, y)
    
    def predict(self, X: Any):
        print(f"PREDICT: X shape is {X.shape}")
        first_layer_predictions = self.first_layer_classifiers.predict(X)
        X_expanded = np.hstack([X.todense(), first_layer_predictions.todense()])
        print(f"PREDICT: X_extended shape is {X_expanded.shape}")
        return self.second_layer_classifiers.predict(X_expanded)

In [8]:
# first test
# m = DebuggingBasicStacking()
# m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])

X shape is (1211, 294)
X_extended shape is (1211, 300)


In [15]:
# second test

m = DebuggingBasicStacking()
m.fit(datasets["scene"]["X_train"], datasets["scene"]["y_train"])

X shape is (1211, 294)
X_extended shape is (1211, 300)


In [21]:
print("first layer")
for clf in m.first_layer_classifiers.classifiers_:
    print(clf.n_features_in_)

print("second layer")
for clf in m.second_layer_classifiers.classifiers_:
    print(clf.n_features_in_)

first layer
294
294
294
294
294
294
second layer
300
300
300
300
300
300


### First test: check shape of the features (shape of X)

A few `print`s were added to the code to revel the shape of `X`, both at the first layer and at the second layer. Result:

```
X shape is (1211, 294)
X_extended shape is (1211, 300)
```

**So, the second layer is indeed receiving more features than the first layer. This is good.**

### Second test: check if the base classifier itself is being trained with the new features

The existing properties being set by `BinaryRelevance` already make it possible to investigate each classifier of either the first or the second layer.

The property `n_features_in`, from the [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), inform how many features were observed during `fit()`. This is the property we are going to use to check if the base classifiers are being trained with the new features.

```
first layer
294
294
294
294
294
294
second layer
300
300
300
300
300
300
```

**The base classifiers from the second layer are really receiving all the new features. This is good.**

### Third test: check if the new features have values

...

### Fourth test: attempt to use `Stacking` from scikit to reimplement the `BasicStacking`

...