# 2. Second model experiments

This second model is a direct influence of the **first model**, which uses a stacking architecture with a label selection that takes place between the first and the second level of binary classifiers. This label selection uses the **F-test** to select the most relevant labels that will be used as _features_ in the second level.

Since the initial results of this model are quite interesting, I had this ideia of using the same concept, but for a **classifier chain** architecture.

## 2.1. Setup

In [2]:
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.dataset import load_dataset, available_data_sets
from sklearn.svm import SVC
from skmultilearn.base.problem_transformation import ProblemTransformationBase
from typing import List, Optional, Any, Tuple, Dict
import numpy as np
import sklearn.metrics as metrics
import json
import pandas as pd
from sklearn.feature_selection import f_classif


## 2.2. Baseline model

The baseline model will be the default **Classifier Chain** available in [the scikit-multilearn package](http://scikit.ml/api/skmultilearn.problem_transform.cc.html).

In [3]:
train_data = load_dataset("scene", "train")
test_data = load_dataset("scene", "test")
# let's use the same "scene" dataset, that was used in the previous notebook (`1_first_model_experiments.ipynb`)

X_train, y_train, _, _ = train_data
X_test, y_test, _, _ = test_data

classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True]
)

classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

baseline_cc_accuracy = metrics.accuracy_score(y_test, predictions)
baseline_cc_hamming_loss = metrics.hamming_loss(y_test, predictions)

print("Accuracy score: ", baseline_cc_accuracy)
print("Hamming loss: ", baseline_cc_hamming_loss)

scene:train - exists, not redownloading
scene:test - exists, not redownloading
Accuracy score:  0.6780936454849499
Hamming loss:  0.09085841694537347


Taking this baseline classifier chain implementation, and also the same dataset, let's check if using the **F-test** to order the classifiers will result in better scores.

Something that should be mentioned is that this is _already better_ than the **Stacking With F-test**.

The documentation for this `ClassifierChain` class states that "for L labels it trains L classifiers ordered in a chain according to the **Bayesian chain rule**". However, the source code shows nothing about this... The ordering is a very simple sequential list of how the labels are already organized in the training data. Here is the code:

```python
# class ClassifierChain(ProblemTransformationBase):
# ...
    def _order(self):
        if self.order is not None:
            return self.order

        try:
            return list(range(self._label_count))
        except AttributeError:
            raise NotFittedError("This Classifier Chain has not been fit yet")
```

So, the ordering is just the order of the labels in the dataset. This is weird. Anyway, the good thing is that we can use this `ClassifierChain` as the base class for our implementation, and simply pass a custom ordering to it.

Something else to consider, however, is that it will simply iterate over the classifiers following the order we pass to it. I was thinking of something akin to a graph, where the path taken in the chain would be dependant on the output of the previous classifier (instead of simply accumulating the predictions of all classifiers). For instance:

* The start of the chain is the label `l1`.
* For a given instance, it predicts `l1 = 1`. Then take a path towards `l2`, with which it has a high correlation according to the **F-test**.
* For another instance, it predicts `l1 = 0`. Then take a path towards `l3`, with which it has no correlation, according to the **F-test**.

This would be more similar to what is done in the stacking method.

It is something to consider, but for now, let's just use the default implementation.

## 2.3. Ordering labels via F-test

At first, let's study a bit the F-test calculation, taking what was built in the previous notebook. Then we can proceed with actually implementing the model.

In [32]:
class CalculateLabelsCorrelationWithFTest():
    def __init__(
        self,
        alpha: float = 0.5,
    ):
        if alpha < 0.0 or alpha > 1.0:
            raise Exception("alpha must be >= 0.0 and <= 1.0")

        self.alpha = alpha
        self.correlated_labels_map = pd.DataFrame()
        self.labels_count = 0

    def fit(self, X: Any, y: Any):
        self.labels_count = y.shape[1]

        f_tested_label_pairs = self.calculate_f_test_for_all_label_pairs(y)
        
        self.correlated_labels_map = self.get_map_of_correlated_labels(
            f_tested_label_pairs)

        return self.correlated_labels_map

    def calculate_f_test_for_all_label_pairs(self, label_classifications: Any) -> List[Dict[str, Any]]:
        results = []

        for i in range(0, self.labels_count):
            for j in range(0, self.labels_count):
                if i == j:
                    continue

                X = label_classifications.todense()[:, i]
                base_label = self.convert_matrix_to_array(X)

                y = label_classifications.todense()[:, j]
                against_label = self.convert_matrix_to_vector(y)

                f_test_result = f_classif(base_label, against_label)[0]

                results.append({
                    "label_being_tested": i,
                    "against_label": j,
                    "f_test_result": float(f_test_result)
                })

        return results

    def convert_matrix_to_array(self, matrix: Any):
        return np.asarray(matrix).reshape(-1, 1)

    def convert_matrix_to_vector(self, matrix: Any):
        return np.asarray(matrix).reshape(-1)

    def get_map_of_correlated_labels(self, f_test_results: List[Dict[str, Any]]) -> pd.DataFrame:
        temp_df = pd.DataFrame(f_test_results)

        sorted_temp_df = temp_df.sort_values(
            by=["label_being_tested", "f_test_result"],
            ascending=[True, False])
        # ordering in descending order by the F-test result,
        # following what the main article describes

        selected_features = []

        for i in range(0, self.labels_count):
            mask = sorted_temp_df["label_being_tested"] == i
            split_df = sorted_temp_df[mask].reset_index(drop=True)

            big_f = split_df["f_test_result"].sum()
            max_cum_f = self.alpha * big_f

            cum_f = 0
            for _, row in split_df.iterrows():
                cum_f += row["f_test_result"]
                if cum_f > max_cum_f:
                    break

                selected_features.append({
                    "for_label": i,
                    "expand_this_label": int(row["against_label"]),
                    "f_test_result": float(row["f_test_result"]),
                })

        cols = ["for_label", "expand_this_label", "f_test_result"]
        return pd.DataFrame(selected_features, columns=cols)



In [14]:
ccf = CalculateLabelsCorrelationWithFTest(alpha=1)
res = ccf.fit(X_train, y_train)
res

Unnamed: 0,for_label,expand_this_label,f_test_result
0,0,2,56.728266
1,0,3,56.36868
2,0,1,45.657072
3,0,5,33.173567
4,0,4,30.068018
5,1,4,59.336173
6,1,0,45.657072
7,1,5,44.889245
8,1,2,38.222988
9,1,3,37.984223


In [16]:
res.sort_values("f_test_result", ascending=True)

Unnamed: 0,for_label,expand_this_label,f_test_result
19,3,4,11.054443
24,4,3,11.054443
18,3,2,25.999023
14,2,3,25.999023
4,0,4,30.068018
23,4,0,30.068018
29,5,0,33.173567
3,0,5,33.173567
17,3,1,37.984223
9,1,3,37.984223


The initial idea was to start with the "most uncorrelated label at all". That is: the label that shares the lowest correlation with all other labels.

However, this does not make any sense, all all label correlations are calculated in pairs. So I can simply start with either label from the pair that presents the highest correlation. Then, from that, walk through the pairs always taking that next pair that shares the same label, has the highest correlation and was not yet used.

Let's try to do the ordering manually, then we think about how to implement it.

In [25]:
res.sort_values("f_test_result", ascending=False)

m = res["for_label"] == 2
res[m]

# 5 -> 4 -> 1 -> 0 -> 2 -> 3

Unnamed: 0,for_label,expand_this_label,f_test_result
10,2,0,56.728266
11,2,5,55.765976
12,2,4,54.713588
13,2,1,38.222988
14,2,3,25.999023


For this dataset, it seems like the expected ordering would be `5 -> 4 -> 1 -> 0 -> 2 -> 3`.

The initial pair, `5 -> 4`, could also be `4 -> 5`. What if we start with `4 -> 5`? Let's see how the ordering would end up being.

In [31]:
res.sort_values("f_test_result", ascending=False)

m = res["for_label"] == 3
res[m]

# 4 -> 5 -> 2 -> 0 -> 3 -> 1


Unnamed: 0,for_label,expand_this_label,f_test_result
15,3,0,56.36868
16,3,5,52.267693
17,3,1,37.984223
18,3,2,25.999023
19,3,4,11.054443


Ok, the order changes considerably. Let's test each of them on the model and see if it makes a difference

In [33]:
first_chain = [5, 4, 1, 0, 2, 3]

first_chain_classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=first_chain,
)

first_chain_classifier.fit(X_train, y_train)
first_chain_predictions = first_chain_classifier.predict(X_test)

first_chain_accuracy = metrics.accuracy_score(y_test, first_chain_predictions)
first_chain_hamming_loss = metrics.hamming_loss(y_test, first_chain_predictions)

print("Accuracy score: ", first_chain_accuracy)
print("Hamming loss: ", first_chain_hamming_loss)


Accuracy score:  0.027591973244147156
Hamming loss:  0.3241360089186176


Terrible results...

In [34]:
second_chain = [4, 5, 2, 0, 3, 1]

second_chain_classifier = ClassifierChain(
    classifier=SVC(),
    require_dense=[False, True],
    order=second_chain,
)

second_chain_classifier.fit(X_train, y_train)
second_chain_predictions = second_chain_classifier.predict(X_test)

second_chain_accuracy = metrics.accuracy_score(y_test, second_chain_predictions)
second_chain_hamming_loss = metrics.hamming_loss(
    y_test, second_chain_predictions)

print("Accuracy score: ", second_chain_accuracy)
print("Hamming loss: ", second_chain_hamming_loss)


Accuracy score:  0.15802675585284282
Hamming loss:  0.2653288740245262


Slightly better but still way worse than the default ordering.
Let's try to dive deeper into that.