# 2. Second model experiments

This second model is a direct influence of the **first model**, which uses a stacking architecture with a label selection that takes place between the first and the second level of binary classifiers. This label selection uses the **F-test** to select the most relevant labels that will be used as _features_ in the second level.

Since the initial results of this model are quite interesting, I had this ideia of using the same concept, but for a **classifier chain** architecture.

## 2.1. Setup

In [2]:
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.dataset import load_dataset, available_data_sets
from sklearn.svm import SVC
from skmultilearn.base.problem_transformation import ProblemTransformationBase
from typing import List, Optional, Any, Tuple, Dict
import numpy as np
import sklearn.metrics as metrics
import json
import pandas as pd
from sklearn.feature_selection import f_classif


## 2.2. Baseline model

The baseline model will be the default **Classifier Chain** available in [the scikit-multilearn package](http://scikit.ml/api/skmultilearn.problem_transform.cc.html).

In [3]:
train_data = load_dataset("scene", "train")
test_data = load_dataset("scene", "test")
# let's use the same "scene" dataset, that was used in the previous notebook (`1_first_model_experiments.ipynb`)

X_train, y_train, _, _ = train_data
X_test, y_test, _, _ = test_data

classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True]
)

classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

baseline_cc_accuracy = metrics.accuracy_score(y_test, predictions)
baseline_cc_hamming_loss = metrics.hamming_loss(y_test, predictions)

print("Accuracy score: ", baseline_cc_accuracy)
print("Hamming loss: ", baseline_cc_hamming_loss)

scene:train - exists, not redownloading
scene:test - exists, not redownloading
Accuracy score:  0.6780936454849499
Hamming loss:  0.09085841694537347


Taking this baseline classifier chain implementation, and also the same dataset, let's check if using the **F-test** to order the classifiers will result in better scores.

Something that should be mentioned is that this is _already better_ than the **Stacking With F-test**.

The documentation for this `ClassifierChain` class states that "for L labels it trains L classifiers ordered in a chain according to the **Bayesian chain rule**". However, the source code shows nothing about this... The ordering is a very simple sequential list of how the labels are already organized in the training data. Here is the code:

```python
# class ClassifierChain(ProblemTransformationBase):
# ...
    def _order(self):
        if self.order is not None:
            return self.order

        try:
            return list(range(self._label_count))
        except AttributeError:
            raise NotFittedError("This Classifier Chain has not been fit yet")
```

So, the ordering is just the order of the labels in the dataset. This is weird. Anyway, the good thing is that we can use this `ClassifierChain` as the base class for our implementation, and simply pass a custom ordering to it.

Something else to consider, however, is that it will simply iterate over the classifiers following the order we pass to it. I was thinking of something akin to a graph, where the path taken in the chain would be dependant on the output of the previous classifier (instead of simply accumulating the predictions of all classifiers). For instance:

* The start of the chain is the label `l1`.
* For a given instance, it predicts `l1 = 1`. Then take a path towards `l2`, with which it has a high correlation according to the **F-test**.
* For another instance, it predicts `l1 = 0`. Then take a path towards `l3`, with which it has no correlation, according to the **F-test**.

This would be more similar to what is done in the stacking method.

It is something to consider, but for now, let's just use the default implementation.

## 2.3. Ordering labels via F-test

At first, let's study a bit the F-test calculation, taking what was built in the previous notebook. Then we can proceed with actually implementing the model.

In [32]:
class CalculateLabelsCorrelationWithFTest():
    def __init__(
        self,
        alpha: float = 0.5,
    ):
        if alpha < 0.0 or alpha > 1.0:
            raise Exception("alpha must be >= 0.0 and <= 1.0")

        self.alpha = alpha
        self.correlated_labels_map = pd.DataFrame()
        self.labels_count = 0

    def fit(self, X: Any, y: Any):
        self.labels_count = y.shape[1]

        f_tested_label_pairs = self.calculate_f_test_for_all_label_pairs(y)
        
        self.correlated_labels_map = self.get_map_of_correlated_labels(
            f_tested_label_pairs)

        return self.correlated_labels_map

    def calculate_f_test_for_all_label_pairs(self, label_classifications: Any) -> List[Dict[str, Any]]:
        results = []

        for i in range(0, self.labels_count):
            for j in range(0, self.labels_count):
                if i == j:
                    continue

                X = label_classifications.todense()[:, i]
                base_label = self.convert_matrix_to_array(X)

                y = label_classifications.todense()[:, j]
                against_label = self.convert_matrix_to_vector(y)

                f_test_result = f_classif(base_label, against_label)[0]

                results.append({
                    "label_being_tested": i,
                    "against_label": j,
                    "f_test_result": float(f_test_result)
                })

        return results

    def convert_matrix_to_array(self, matrix: Any):
        return np.asarray(matrix).reshape(-1, 1)

    def convert_matrix_to_vector(self, matrix: Any):
        return np.asarray(matrix).reshape(-1)

    def get_map_of_correlated_labels(self, f_test_results: List[Dict[str, Any]]) -> pd.DataFrame:
        temp_df = pd.DataFrame(f_test_results)

        sorted_temp_df = temp_df.sort_values(
            by=["label_being_tested", "f_test_result"],
            ascending=[True, False])
        # ordering in descending order by the F-test result,
        # following what the main article describes

        selected_features = []

        for i in range(0, self.labels_count):
            mask = sorted_temp_df["label_being_tested"] == i
            split_df = sorted_temp_df[mask].reset_index(drop=True)

            big_f = split_df["f_test_result"].sum()
            max_cum_f = self.alpha * big_f

            cum_f = 0
            for _, row in split_df.iterrows():
                cum_f += row["f_test_result"]
                if cum_f > max_cum_f:
                    break

                selected_features.append({
                    "for_label": i,
                    "expand_this_label": int(row["against_label"]),
                    "f_test_result": float(row["f_test_result"]),
                })

        cols = ["for_label", "expand_this_label", "f_test_result"]
        return pd.DataFrame(selected_features, columns=cols)



In [14]:
ccf = CalculateLabelsCorrelationWithFTest(alpha=1)
res = ccf.fit(X_train, y_train)
res

Unnamed: 0,for_label,expand_this_label,f_test_result
0,0,2,56.728266
1,0,3,56.36868
2,0,1,45.657072
3,0,5,33.173567
4,0,4,30.068018
5,1,4,59.336173
6,1,0,45.657072
7,1,5,44.889245
8,1,2,38.222988
9,1,3,37.984223


In [16]:
res.sort_values("f_test_result", ascending=True)

Unnamed: 0,for_label,expand_this_label,f_test_result
19,3,4,11.054443
24,4,3,11.054443
18,3,2,25.999023
14,2,3,25.999023
4,0,4,30.068018
23,4,0,30.068018
29,5,0,33.173567
3,0,5,33.173567
17,3,1,37.984223
9,1,3,37.984223


The initial idea was to start with the "most uncorrelated label at all". That is: the label that shares the lowest correlation with all other labels.

However, this does not make any sense, all all label correlations are calculated in pairs. So I can simply start with either label from the pair that presents the highest correlation. Then, from that, walk through the pairs always taking that next pair that shares the same label, has the highest correlation and was not yet used.

Let's try to do the ordering manually, then we think about how to implement it.

In [25]:
res.sort_values("f_test_result", ascending=False)

m = res["for_label"] == 2
res[m]

# 5 -> 4 -> 1 -> 0 -> 2 -> 3

Unnamed: 0,for_label,expand_this_label,f_test_result
10,2,0,56.728266
11,2,5,55.765976
12,2,4,54.713588
13,2,1,38.222988
14,2,3,25.999023


For this dataset, it seems like the expected ordering would be `5 -> 4 -> 1 -> 0 -> 2 -> 3`.

The initial pair, `5 -> 4`, could also be `4 -> 5`. What if we start with `4 -> 5`? Let's see how the ordering would end up being.

In [31]:
res.sort_values("f_test_result", ascending=False)

m = res["for_label"] == 3
res[m]

# 4 -> 5 -> 2 -> 0 -> 3 -> 1


Unnamed: 0,for_label,expand_this_label,f_test_result
15,3,0,56.36868
16,3,5,52.267693
17,3,1,37.984223
18,3,2,25.999023
19,3,4,11.054443


Ok, the order changes considerably. Let's test each of them on the model and see if it makes a difference

In [33]:
first_chain = [5, 4, 1, 0, 2, 3]

first_chain_classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=first_chain,
)

first_chain_classifier.fit(X_train, y_train)
first_chain_predictions = first_chain_classifier.predict(X_test)

first_chain_accuracy = metrics.accuracy_score(y_test, first_chain_predictions)
first_chain_hamming_loss = metrics.hamming_loss(y_test, first_chain_predictions)

print("Accuracy score: ", first_chain_accuracy)
print("Hamming loss: ", first_chain_hamming_loss)


Accuracy score:  0.027591973244147156
Hamming loss:  0.3241360089186176


Terrible results...

In [34]:
second_chain = [4, 5, 2, 0, 3, 1]

second_chain_classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=second_chain,
)

second_chain_classifier.fit(X_train, y_train)
second_chain_predictions = second_chain_classifier.predict(X_test)

second_chain_accuracy = metrics.accuracy_score(y_test, second_chain_predictions)
second_chain_hamming_loss = metrics.hamming_loss(
    y_test, second_chain_predictions)

print("Accuracy score: ", second_chain_accuracy)
print("Hamming loss: ", second_chain_hamming_loss)


Accuracy score:  0.15802675585284282
Hamming loss:  0.2653288740245262


Slightly better but still way worse than the default ordering. 

## 2.4. Understanding the initial results

Let's try to dive deeper into these results, by first attempting to reproduce the initial results by passing the most simply ordering possible, that in the theory the `ClassifierChain` already uses.

In [36]:
label_count = y_train.shape[1]
simple_order = list(range(label_count))

print(f"this is the simple order: {simple_order}")

simple_order_classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=simple_order,
)

simple_order_classifier.fit(X_train, y_train)
simple_order_predictions = simple_order_classifier.predict(X_test)

simple_order_accuracy = metrics.accuracy_score(
    y_test, simple_order_predictions)
simple_order_hamming_loss = metrics.hamming_loss(
    y_test, simple_order_predictions)

print("Accuracy score: ", simple_order_accuracy)
print("Hamming loss: ", simple_order_hamming_loss)

this is the simple order: [0, 1, 2, 3, 4, 5]
Accuracy score:  0.6780936454849499
Hamming loss:  0.09085841694537347


Really interesting results. What else can we try here? Maybe reversing the order? Maybe also reversing the order considering the F-test?

In [38]:
def test_ordering(order: List[int]):
    print(f"testing order: {order}")

    classifier = ClassifierChain(
        classifier=SVC(),
        require_dense=[False, True],
        order=order,
    )

    classifier.fit(X_train, y_train)
    preds = classifier.predict(X_test)

    acc = metrics.accuracy_score(
        y_test, preds)
    hamming_loss = metrics.hamming_loss(
        y_test, preds)

    print("Accuracy score: ", acc)
    print("Hamming loss: ", hamming_loss)
    print("===")


In [46]:
# simple orders
test_ordering([0, 1, 2, 3, 4, 5]) # default
test_ordering([5, 4, 3, 2, 1, 0]) # reversed

# f-test orders
test_ordering(first_chain) # default
test_ordering(list(reversed(first_chain)))  # reversed

test_ordering(second_chain)  # default
test_ordering(list(reversed(second_chain)))  # reversed


testing order: [0, 1, 2, 3, 4, 5]
Accuracy score:  0.6780936454849499
Hamming loss:  0.09085841694537347
===
testing order: [5, 4, 3, 2, 1, 0]
Accuracy score:  0.051839464882943144
Hamming loss:  0.3171683389074693
===
testing order: [5, 4, 1, 0, 2, 3]
Accuracy score:  0.027591973244147156
Hamming loss:  0.3241360089186176
===
testing order: [3, 2, 0, 1, 4, 5]
Accuracy score:  0.22240802675585283
Hamming loss:  0.26184503901895206
===
testing order: [4, 5, 2, 0, 3, 1]
Accuracy score:  0.15802675585284282
Hamming loss:  0.2653288740245262
===
testing order: [1, 3, 0, 2, 5, 4]
Accuracy score:  0.06438127090301003
Hamming loss:  0.31981605351170567
===


No good results at all. Let's try to brute force this.

In [47]:
from itertools import permutations


In [52]:
len(list(permutations([0, 1, 2, 3, 4, 5])))

720

Maybe too many combinations to try. Let's first try with other datasets. It could be something specific to this dataset.

In [53]:
emotions_train_data = load_dataset("emotions", "train")
emotions_test_data = load_dataset("emotions", "test")

X_train_emotions, y_train_emotions, _, _ = emotions_train_data
X_test_emotions, y_test_emotions, _, _ = emotions_test_data

emotions:train - exists, not redownloading
emotions:test - exists, not redownloading


In [54]:
emotions_classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
)

emotions_classifier.fit(X_train_emotions, y_train_emotions)
emotions_preds = emotions_classifier.predict(X_test_emotions)

emotions_accuracy_score = metrics.accuracy_score(
    y_test_emotions, emotions_preds)
emotions_hamming_loss = metrics.hamming_loss(
    y_test_emotions, emotions_preds)

print("Accuracy score: ", emotions_accuracy_score)
print("Hamming loss: ", emotions_hamming_loss)
print("===")


Accuracy score:  0.01485148514851485
Hamming loss:  0.3250825082508251
===


Bad initial result. Let's try with the F-test ordering.

In [55]:
emotions_ccf = CalculateLabelsCorrelationWithFTest(alpha=1)
emotions_res = emotions_ccf.fit(X_train_emotions, y_train_emotions)
emotions_res

Unnamed: 0,for_label,expand_this_label,f_test_result
0,0,2,112.144625
1,0,3,57.578285
2,0,4,45.272937
3,0,5,25.675458
4,0,1,0.01143
5,1,5,58.952362
6,1,4,48.962573
7,1,3,26.183449
8,1,2,7.703987
9,1,0,0.01143


In [69]:
emotions_res.sort_values("f_test_result", ascending=False)

m = emotions_res["for_label"] == 1
emotions_res[m]

# fist chain:   2 -> 5 -> 1 -> 4 -> 3 -> 0
# second chain: 5 -> 2 -> 0 -> 3 -> 4 -> 1


Unnamed: 0,for_label,expand_this_label,f_test_result
5,1,5,58.952362
6,1,4,48.962573
7,1,3,26.183449
8,1,2,7.703987
9,1,0,0.01143


In [70]:
def test_emotions_order(order: List[int]):
    print(f"Testing order: {order}")

    emotions_classifier = ClassifierChain(
        classifier=SVC(),
        require_dense=[False, True],
        order=order,
    )

    emotions_classifier.fit(X_train_emotions, y_train_emotions)
    emotions_preds = emotions_classifier.predict(X_test_emotions)

    emotions_accuracy_score = metrics.accuracy_score(
        y_test_emotions, emotions_preds)
    emotions_hamming_loss = metrics.hamming_loss(
        y_test_emotions, emotions_preds)

    print("Accuracy score: ", emotions_accuracy_score)
    print("Hamming loss: ", emotions_hamming_loss)
    print("===")


In [71]:
test_emotions_order([2, 5, 1, 4, 3, 0])  # default
test_emotions_order(list(reversed([2, 5, 1, 4, 3, 0])))  # reversed

test_emotions_order([5, 2, 0, 3, 4, 1])  # default
test_emotions_order(list(reversed([5, 2, 0, 3, 4, 1])))  # reversed


Testing order: [2, 5, 1, 4, 3, 0]
Accuracy score:  0.0
Hamming loss:  0.35148514851485146
===
Testing order: [0, 3, 4, 1, 5, 2]
Accuracy score:  0.0049504950495049506
Hamming loss:  0.34983498349834985
===
Testing order: [5, 2, 0, 3, 4, 1]
Accuracy score:  0.0049504950495049506
Hamming loss:  0.3432343234323432
===
Testing order: [1, 4, 3, 0, 2, 5]
Accuracy score:  0.0
Hamming loss:  0.3250825082508251
===


The initial results were based. But these are even worse. Let's check one more dataset.

In [72]:
birds_train_data = load_dataset("birds", "train")
birds_test_data = load_dataset("birds", "test")

X_train_birds, y_train_birds, _, _ = birds_train_data
X_test_birds, y_test_birds, _, _ = birds_test_data

birds_classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
)

birds_classifier.fit(X_train_birds, y_train_birds)
birds_preds = birds_classifier.predict(X_test_birds)

birds_accuracy_score = metrics.accuracy_score(
    y_test_birds, birds_preds)
birds_hamming_loss = metrics.hamming_loss(
    y_test_birds, birds_preds)

print("Accuracy score: ", birds_accuracy_score)
print("Hamming loss: ", birds_hamming_loss)
print("===")


birds:train - does not exists downloading
Downloaded birds-train
birds:test - does not exists downloading
Downloaded birds-test
Accuracy score:  0.47058823529411764
Hamming loss:  0.05328336320677855
===


In [95]:
birds_ccf = CalculateLabelsCorrelationWithFTest(alpha=1)
birds_res = birds_ccf.fit(X_train_birds, y_train_birds)

birds_res.sort_values("f_test_result", ascending=False)

m = birds_res["for_label"] == 0
birds_res[m]

# fist chain:   14 -> 13 -> 16 -> 9 -> 18 -> 10 -> 4 -> 15 -> 17 -> 8 -> 2 -> 3 -> 7 -> 12 -> 1 -> 11 -> 6 -> 5 -> 0
# second chain: 13 -> 14


Unnamed: 0,for_label,expand_this_label,f_test_result
0,0,2,5.773432
1,0,5,4.14065
2,0,10,1.773774
3,0,1,0.869565
4,0,7,0.621812
5,0,6,0.5997
6,0,9,0.422915
7,0,12,0.421607
8,0,14,0.421607
9,0,18,0.372255


In [96]:
def test_birds_order(order: List[int]):
    print(f"Testing order: {order}")

    birds_classifier = ClassifierChain(
        classifier=SVC(),
        require_dense=[False, True],
        order=order,
    )

    birds_classifier.fit(X_train_birds, y_train_birds)
    birds_preds = birds_classifier.predict(X_test_birds)

    birds_accuracy_score = metrics.accuracy_score(
        y_test_birds, birds_preds)
    birds_hamming_loss = metrics.hamming_loss(
        y_test_birds, birds_preds)

    print("Accuracy score: ", birds_accuracy_score)
    print("Hamming loss: ", birds_hamming_loss)
    print("===")


In [98]:
test_birds_order([14, 13, 16, 9, 18, 10, 4, 15, 17, 8, 2, 3, 7, 12, 1, 11, 6, 5, 0])  # default

Testing order: [14, 13, 16, 9, 18, 10, 4, 15, 17, 8, 2, 3, 7, 12, 1, 11, 6, 5, 0]
Accuracy score:  0.4674922600619195
Hamming loss:  0.05426103959589376
===


Since there are far too many labels, I only manually ordered a single chain.

We can see that, for the birds, the performance decreased a bit. But **just a bit**. It wasn't fully terrible as in the previous datasets.

Given that this dataset has far more labels, I am starting to believe that the problem is the **number of labels**. For a dataset that has too few labels, the ordering is a quite sensitive thing, and the results might be explained by a simple matter of luck. It could be that that particular dataset, with that particular split between test and train, lead to that particular result with that order of labels.

So, I see a few necessary improvements to take:
* Stablish a proper evaluation pipeline that uses cross validation. This should mitigate the problem of datasets being too sensitive to a particular split and ordering.
* Implement a custom `ClassifierChain` base method that logs the model performance on each label. This should help us to understand if the performance is increasing or decreasing as the chain progresses.
* Fully implement the proposed ordering method. Even if the results are still bad, this study should count towards the final report. It is an effort that I am making, it is being documented, and it is a valid result, even if negative.
* Finally, just for the sake of testing, do an alternative ordering method that actually considers the lowest correlation when building the chain.