# Understanding Multilabel Classification

`Multi-Label Classification: An Overview Grigorios Tsoumakas, Ioannis Katakis`

> Traditional single-label classification is concerned with learning from a set of examples that are
associated with a single label 

- If the number of label = 2 its binary 
- If number of labels > 2 multi-class problem

>In multi-label classification, the examples are associated with a set of labels Y ⊆ L. In the past, multilabel classification was mainly motivated by the tasks of text categorization and medical diagnosis.
Text documents usually belong to more than one conceptual class. For example, a newspaper article
concerning the reactions of the Christian church to the release of the Da Vinci Code film can be
classified into both of the categories Society\Religion and Arts\Movies. Similarly in medical diagnosis,
a patient may be suffering for example from diabetes and prostate cancer at the same time.

![multi-label-table](external_images_for_notebooks/multi-label-table.png)

## Methods

https://scikit-learn.org/stable/modules/multiclass.html

We can group the existing methods for multi-label classification into two main categories: a) problem
transformation methods, and b) algorithm adaptation methods. 

Problem transformation methods 
- The first one (dubbed PT1) subjectively or randomly selects one of the multiple labels of each multi-label instance and discards the rest, while the second one (dubbed PT2) simply discards every multi-label instance from the multi-label data set. 
- These two problem transformation methods discard a lot of the information content of the original multilabel data set and are therefore not considered further in this work.
- The third problem transformation method that we will mention (dubbed PT3), considers each different set of labels that exist in the multi-label data set as a single label. 
    - One label for sports, one for sports and politics, one for science and politics etc
- The **most common** problem transformation method (dubbed PT4) learns |L| binary classifiers Hl: X → {l, ¬l} , one for each different label l in L. It transforms the original data set into |L| data sets Dl that contain all examples of the original data set, labelled as l if the labels of the original example contained l and as ¬l otherwise. It is the same solution used in order to deal with a single-label multiclass problem using a binary classifier. 

![multilabel_method_pt4](external_images_for_notebooks/multilabel_methd_pt4.png)

## Performance Measures

`A Unified View of Multi-Label Performance Measures: Xi-Zhu Wu 1 Zhi-Hua Zhou 1`

> The fraction of misclassified labels

https://stats.stackexchange.com/questions/336820/what-is-a-hamming-loss-will-we-consider-it-for-an-imbalanced-binary-classifier

> The hamming loss (HL) is the fraction of the wrong labels to the total number of labels. Hence, for the binary case (imbalanced or not), HL=1-Accuracy as you wrote.

> The HL thus presents one clear single-performance-value for multiple-label case in contrast to the precision/recall/f1 that can be evaluated only for independent binary classifiers for each label.

https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff

https://medium.com/towards-artificial-intelligence/understanding-multi-label-classification-model-and-accuracy-metrics-1b2a8e2648ca

> Let’s see the scenario for the multi-label case using our example dataset. If question with id 241465 is classified with labels: ‘modeling’, ‘central-limit-theorem’, ‘degrees-of-freedom’ then what we can say? Actual class labels in the dataset were ‘statistical-significance’, ‘modeling’, ‘central-limit-theorem’, ‘degrees-of-freedom’ and ‘spurious-correlation’. Neither it is completely wrong prediction nor it is completely right. If we go for traditional correct vs total ratio based accuracy metric, definitely we won’t be able to judge the classifier. We need something to judge the partial correctness of a multi-label classifier.

### Hamming Loss Metric

> Instead of counting no of correctly classified data instance, Hamming Loss calculates loss generated in the bit string of class labels during prediction. It does XOR operation between the original binary string of class labels and predicted class labels for a data instance and calculates the average across the dataset. 

> hamming loss’ value ranges from 0 to 1. As it is a loss metric, its interpretation is reverse in nature unlike normal accuracy ratio. Lesser value of hamming loss indicates a better classifier.