<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| Luca Mossina and [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://supaerodatascience.github.io/machine-learning/">https://supaerodatascience.github.io/machine-learning/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">An application of SVMs in Multi-Label Classification (MLC)</div>

We'll see an application which is both harder and less common than binary classification, that of **multi-label classification** (MLC).  
Given a list of possible labels, the problem consists in finding one **or more** labels associated to a data point.  
For instance, imagine extracting the key topics from a newspaper article, or classifing the elements composing an image. Possibly many labels can be associated to each item.

Given a set of labels $\mathcal{L} = \{l_1, l_2, ..., l_k\} \in \{0,1\}^k$, we want to map elements of a feature space $\mathcal{X}$ to a subset of $\mathcal{L}$:  

$$h : \mathcal{X} \longrightarrow \mathcal{P}(\mathcal{L})$$

The two typical approaches for such problems are known as **Binary Relevance** (BR) and **Label Powerset** (LP).  

 - BR: each label in $\mathcal{L}$ is a binary classification problem, $h_{i} : \mathcal{X} \longrightarrow l_{i}, l_{i} \in \{0,1\}, i = 1, ..., |\mathcal{L}|$.  
 This method ignores any correlation between labels (supposes them independent).

 - LP: transforms a problem of MLC into one of multiclass classification, mapping elements $x \in \mathcal{X}$ directly to $s \in \mathcal{P}(\mathcal{L})$.  
 This method becomes rapidly inapplicable as the number of elemnts in $\mathcal{P}(\mathcal{L})$ grows exponentially with the number of labels.
 
If you are curious on the topic of MLC, you are encouraged to read these references:  
J. Read, P. Reutemann, B. Pfahringer, and Geoff Holmes. **MEKA: A multi-label/multi-target extension to Weka**. Journal of Machine Learning Research, 17(21):1-5, 2016.  
G. Tsoumakas and I. Katakis. **Multi-label classification: An overview**. International Journal on Data Warehousing and Mining, 3(3):1-13, 2007.  
G. Tsoumakas, I. Katakis, and I. Vlahavas. **Mining multi-label data**. Data mining and knowledge discovery handbook, pages 667-685. Springer, 2010.
 
Many other variations exist, but for today we'll focus on BR, the most straightforward to implement. What we will start implementing below is a good start if you want to explore what is done in:  
J. Read, B. Pfahringer, G. Holmes, and E. Frank. **Classifier chains for multi-label classification**. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 254-269, 2009.

The equivalent approach for LP is found in:  
G. Tsoumakas, I. Katakis, and I. Vlahavas. **Random k-labelsets for multi-label classification**. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079-1089, 2011.

For this exercise, we will use a biology dataset from [Elisseeff and Weston 2001]: this dataset contains micro-array expressions and phylogenetic profiles for 2417 yeast genes. Each gene is annotated with a subset of 14 functional categories (e.g. metabolism, energy, etc.) of the top level of the functional catalogue.

<div class="alert alert-warning">

**Exercice**<br>
<ul>

<li> find a suitable package to load the file at `yeast.arff`.  <br>
    Hint: <a href=https://docs.scipy.org/doc/scipy/reference/io.html>scipy.io</a> and _read the doc_.<br>
<li> Store the data in a pandas dataframe.<br>
    Hint: columns of classes will be encoded as 'utf-8', we need integers, look for 'str.decode('utf-8')'
<li> check dataset: you should have 2417 samples $\times$ 117 columns (103 features + 14 labels)
</ul>
</div>

In [34]:
import pandas as pd

from scipy.io import arff
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit

from utils import shuffle_and_split

In [35]:
path_to_yeast = '../data/yeast/yeast.arff'

arff_file = arff.loadarff(path_to_yeast)
df = pd.DataFrame(arff_file[0])

assert(df.shape == (2417, 117))
df

Unnamed: 0,Class1,Class2,Class3,Class4,Class5,Class6,Class7,Class8,Class9,Class10,...,Att94,Att95,Att96,Att97,Att98,Att99,Att100,Att101,Att102,Att103
0,b'0',b'0',b'1',b'1',b'0',b'0',b'0',b'0',b'0',b'0',...,0.039048,-0.018712,-0.034711,-0.038675,-0.039102,0.017429,-0.052659,-0.042402,0.118473,0.125632
1,b'0',b'0',b'0',b'0',b'0',b'0',b'1',b'1',b'0',b'0',...,-0.001198,0.030594,-0.021814,0.010430,-0.013809,-0.009248,-0.027318,-0.014191,0.022783,0.123785
2,b'0',b'1',b'1',b'0',b'0',b'0',b'0',b'0',b'0',b'0',...,0.195777,0.022294,0.012583,0.002233,-0.002072,-0.010981,0.007615,-0.063378,-0.084181,-0.034402
3,b'0',b'0',b'1',b'1',b'0',b'0',b'0',b'0',b'0',b'0',...,0.001189,-0.066241,-0.046999,-0.066604,-0.055773,-0.041941,0.051066,0.004976,0.193972,0.131866
4,b'1',b'1',b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0',...,-0.035045,-0.080882,0.028468,-0.073576,0.050630,0.084832,-0.019570,-0.021650,-0.068326,-0.091155
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2412,b'1',b'1',b'1',b'0',b'0',b'0',b'0',b'0',b'0',b'0',...,0.030530,-0.025955,-0.023820,-0.027757,-0.021961,0.007489,-0.027194,-0.023020,0.021652,0.137082
2413,b'1',b'1',b'1',b'0',b'0',b'0',b'0',b'0',b'0',b'0',...,0.000522,-0.007117,-0.025083,-0.011406,-0.017818,-0.004101,-0.046161,-0.011562,0.060844,0.137526
2414,b'1',b'1',b'0',b'0',b'0',b'0',b'0',b'0',b'0',b'0',...,-0.053868,-0.057654,0.006976,-0.057880,0.171701,0.045545,-0.051639,-0.038713,-0.026947,0.005620
2415,b'0',b'1',b'1',b'0',b'0',b'0',b'1',b'1',b'0',b'0',...,0.040829,-0.055524,0.018662,-0.053545,-0.056904,-0.045714,-0.039205,-0.019985,0.280843,0.143382


<div class="alert alert-warning">

**Exercice**<br>
<ul>
<li> Manually, fit a SVM classifier for each label in the dataset
<li> Apply a cross-validation of 60 ∕ 40: 60% of datapoints to train the model, 40% to test it  <br>
   Remember: it is good practice to <b>randomly shuffle</b> the data, in case the data are ordered w.r.t. some data-dependent criterion.
<li> Report some performance measure
</ul>
</div>

In [36]:
df = df.replace({b'0': 0, b'1': 1})

label_names = [column for column in df.columns if 'Class' in column]
feature_names = [column for column in df.columns if 'Att' in column]

accuracy_kernel_label = {label: {} for label in label_names}


for label in label_names:
    available_kernel = ['poly', 'linear', 'rbf', 'sigmoid']
    accuracy_kernel = {kernel: 0 for kernel in available_kernel}

    X = df[feature_names]
    y = df[label]
    
    X_train, y_train, X_test, y_test = shuffle_and_split(X, y, 0.6)
    
    for kernel in available_kernel:
        svm_model = svm.SVC(kernel=kernel, C= 1.)
        svm_model.fit(X_train, y_train)
        
        y_pred = svm_model.predict(X_test)
        accuracy_kernel[kernel] = accuracy_score(y_test, y_pred)

    accuracy_kernel_label[label] = accuracy_kernel
    
print(accuracy_kernel_label)

{'Class1': {'poly': 0.7807652533609101, 'linear': 0.7859358841778697, 'rbf': 0.7942088934850051, 'sigmoid': 0.749741468459152}, 'Class2': {'poly': 0.656670113753878, 'linear': 0.6204756980351603, 'rbf': 0.6546018614270941, 'sigmoid': 0.5873836608066184}, 'Class3': {'poly': 0.7197518097207859, 'linear': 0.7094105480868665, 'rbf': 0.7290589451913133, 'sigmoid': 0.6659772492244054}, 'Class4': {'poly': 0.7280248190279214, 'linear': 0.7228541882109617, 'rbf': 0.7445708376421923, 'sigmoid': 0.656670113753878}, 'Class5': {'poly': 0.781799379524302, 'linear': 0.7538779731127198, 'rbf': 0.7942088934850051, 'sigmoid': 0.6597724922440538}, 'Class6': {'poly': 0.7766287487073423, 'linear': 0.749741468459152, 'rbf': 0.7724922440537746, 'sigmoid': 0.7383660806618407}, 'Class7': {'poly': 0.828335056876939, 'linear': 0.8200620475698035, 'rbf': 0.827300930713547, 'sigmoid': 0.8159255429162358}, 'Class8': {'poly': 0.8086866597724922, 'linear': 0.8035160289555325, 'rbf': 0.8066184074457083, 'sigmoid': 0.7

In [37]:
best_kernel_scores = {}

for label, kernel_scores in accuracy_kernel_label.items():
    best_kernel = max(kernel_scores, key=kernel_scores.get)
    best_score = kernel_scores[best_kernel]
    
    best_kernel_scores[label] = (best_kernel, best_score)

print(best_kernel_scores)

{'Class1': ('rbf', 0.7942088934850051), 'Class2': ('poly', 0.656670113753878), 'Class3': ('rbf', 0.7290589451913133), 'Class4': ('rbf', 0.7445708376421923), 'Class5': ('rbf', 0.7942088934850051), 'Class6': ('poly', 0.7766287487073423), 'Class7': ('poly', 0.828335056876939), 'Class8': ('poly', 0.8086866597724922), 'Class9': ('poly', 0.9286452947259566), 'Class10': ('poly', 0.8934850051706308), 'Class11': ('poly', 0.8800413650465356), 'Class12': ('linear', 0.7425025853154085), 'Class13': ('linear', 0.734229576008273), 'Class14': ('poly', 0.9917269906928645)}


In [38]:
# Define the range of C values to search over
available_kernel = ['poly', 'linear', 'rbf', 'sigmoid']

# Create a dictionary to store the best C values for each label
best_kernel_values = {}

for label in label_names:
    # Define X and y for the current label
    X_label = df[feature_names]
    y_label = df[label]
    
    # Define the SVM model
    svm_model = svm.SVC(C= 1.)  # Use the best kernel

    # Specify the desired training-to-testing ratio
    train_size = 0.6
    
    # Set up Stratified Shuffle Split with the desired ratio
    cv = StratifiedShuffleSplit(n_splits=5, train_size=train_size, test_size=1-train_size, random_state=42)
    
    # Perform GridSearchCV to find the best C value
    param_grid = {'kernel': available_kernel}
    grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=cv, n_jobs=-1, scoring='accuracy')
    grid_search.fit(X_label, y_label)
    
    # Store the best C value in the dictionary
    best_kernel_values[label] = grid_search.best_params_['kernel']

print(best_kernel_values)


{'Class1': 'rbf', 'Class2': 'rbf', 'Class3': 'rbf', 'Class4': 'rbf', 'Class5': 'rbf', 'Class6': 'poly', 'Class7': 'poly', 'Class8': 'poly', 'Class9': 'poly', 'Class10': 'poly', 'Class11': 'poly', 'Class12': 'poly', 'Class13': 'poly', 'Class14': 'poly'}


In [39]:
# Define the range of C values to search over
C_values = [0.01, 0.1, 1, 10, 100]

# Create a dictionary to store the best C values for each label
best_C_values = {}

for label in label_names:
    # Define X and y for the current label
    X_label = df[feature_names]
    y_label = df[label]
    
    # Define the SVM model
    svm_model = svm.SVC(kernel=best_kernel_values[label])  # Use the best kernel

    # Specify the desired training-to-testing ratio
    train_size = 0.6
    
    # Set up Stratified Shuffle Split with the desired ratio
    cv = StratifiedShuffleSplit(n_splits=5, train_size=train_size, test_size=1-train_size, random_state=42)
    
    # Perform GridSearchCV to find the best C value
    param_grid = {'C': C_values}
    grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=cv, n_jobs=-1, scoring='accuracy')
    grid_search.fit(X_label, y_label)
    
    # Store the best C value in the dictionary
    best_C_values[label] = grid_search.best_params_['C']

print(best_C_values)

{'Class1': 1, 'Class2': 1, 'Class3': 1, 'Class4': 1, 'Class5': 1, 'Class6': 1, 'Class7': 1, 'Class8': 1, 'Class9': 0.01, 'Class10': 1, 'Class11': 1, 'Class12': 1, 'Class13': 1, 'Class14': 0.01}


We try both mixed (with targeted C):

In [40]:
# Define the range of C values to search over
C_values = [0.005, 0.01, 0.05, 0.08, 0.1, 0.5, 0.75, 1, 2, 3]
available_kernel = ['poly', 'linear', 'rbf', 'sigmoid']

# Create a dictionary to store the best C values for each label
best_kernel_and_C_values = {}

for label in label_names:
    # Define X and y for the current label
    X_label = df[feature_names]
    y_label = df[label]
    
    # Define the SVM model
    svm_model = svm.SVC()

    # Specify the desired training-to-testing ratio
    train_size = 0.6
    
    # Set up Stratified Shuffle Split with the desired ratio
    cv = StratifiedShuffleSplit(n_splits=5, train_size=train_size, test_size=1-train_size, random_state=42)
    
    # Perform GridSearchCV to find the best C value
    param_grid = {'C': C_values, 'kernel': available_kernel}
    grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=cv, n_jobs=-1, scoring='accuracy')
    grid_search.fit(X_label, y_label)
    
    # Store the best C value in the dictionary
    best_kernel_and_C_values[label] = (grid_search.best_params_['kernel'], grid_search.best_params_['C'])

In [41]:
print(best_kernel_and_C_values)

{'Class1': ('rbf', 1), 'Class2': ('rbf', 1), 'Class3': ('rbf', 0.5), 'Class4': ('rbf', 1), 'Class5': ('rbf', 1), 'Class6': ('rbf', 2), 'Class7': ('rbf', 2), 'Class8': ('poly', 1), 'Class9': ('poly', 2), 'Class10': ('rbf', 2), 'Class11': ('poly', 2), 'Class12': ('poly', 1), 'Class13': ('poly', 1), 'Class14': ('poly', 0.005)}


In [42]:
from sklearn.metrics import precision_score, recall_score, f1_score, hamming_loss
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import numpy as np

# Initialize dictionaries to store performance metrics for each label
precision_scores = {}
recall_scores = {}
f1_scores = {}
hamming_losses = {}

# Iterate over each label
for label in label_names:
    X_label = df[feature_names]
    y_label = df[label]

    X_label, y_label = shuffle(X_label, y_label, random_state=42)
    
    # Split the dataset into a training and test set
    X_train, X_test, y_train, y_test = train_test_split(X_label, y_label, test_size=0.4, random_state=42)
    
    best_kernel, best_C = best_kernel_and_C_values[label]
    
    # Create a separate SVM model for the current label
    svm_model = svm.SVC(kernel=best_kernel, C=best_C)
    
    # Train the model on the training set for the current label
    svm_model.fit(X_train, y_train)
    
    # Make predictions on the test set for the current label
    y_pred = svm_model.predict(X_test)
    
    # Calculate precision, recall, F1-score, and Hamming Loss for the current label
    precision_scores[label] = precision_score(y_test, y_pred, zero_division=1)
    recall_scores[label] = recall_score(y_test, y_pred, zero_division=1)
    f1_scores[label] = f1_score(y_test, y_pred, zero_division=1)
    hamming_losses[label] = hamming_loss(y_test, y_pred)

# Calculate micro-averaged performance metrics
micro_precision = np.mean(list(precision_scores.values()))
micro_recall = np.mean(list(recall_scores.values()))
micro_f1 = np.mean(list(f1_scores.values()))
micro_hamming = np.mean(list(hamming_losses.values()))

print("Micro-Averaged Precision:", micro_precision)
print("Micro-Averaged Recall:", micro_recall)
print("Micro-Averaged F1-Score:", micro_f1)
print("Micro-Averaged Hamming Loss:", micro_hamming)
print('\n\n')

# Print performance metrics for each label
for label in label_names:
    print(f"Label: {label}")
    print(f"Model chosen : (kernel: {best_kernel_and_C_values[label][0]}, C: {best_kernel_and_C_values[label][1]})")
    print("Precision:", precision_scores[label])
    print("Recall:", recall_scores[label])
    print("F1-Score:", f1_scores[label])
    print("Hamming Loss:", hamming_losses[label])
    print('\n-----------------------------\n')


Micro-Averaged Precision: 0.8028199574029516
Micro-Averaged Recall: 0.357655718431777
Micro-Averaged F1-Score: 0.39793720199452565
Micro-Averaged Hamming Loss: 0.18496085093810016



Label: Class1
Model chosen : (kernel: rbf, C: 1)
Precision: 0.770949720670391
Recall: 0.4524590163934426
F1-Score: 0.5702479338842975
Hamming Loss: 0.21509824198552224

-----------------------------

Label: Class2
Model chosen : (kernel: rbf, C: 1)
Precision: 0.6886792452830188
Recall: 0.522673031026253
F1-Score: 0.5943012211668929
Hamming Loss: 0.30920372285418823

-----------------------------

Label: Class3
Model chosen : (kernel: rbf, C: 0.5)
Precision: 0.7197802197802198
Recall: 0.6717948717948717
F1-Score: 0.6949602122015915
Hamming Loss: 0.2378490175801448

-----------------------------

Label: Class4
Model chosen : (kernel: rbf, C: 1)
Precision: 0.7981220657276995
Recall: 0.4748603351955307
F1-Score: 0.5954465849387041
Hamming Loss: 0.23888314374353672

-----------------------------

Label: Class5


**Congratulations**, you reached the end of the practice session! 