<center><h1>SVM-Multi-class & Multi-Label Classification</h1></center>

## 1. Multi-class and Multi-Label Classification Using Support Vector Machines

Import packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, make_scorer, hamming_loss
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from tqdm import tqdm
from scipy.stats import mode

### (a) Download the Anuran Calls (MFCCs) Data Set

In [2]:
Frogs = pd.read_csv('../Data/Frogs_MFCCs.csv')

In [3]:
print(Frogs.shape)
Frogs.head()

(7195, 26)


Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species,RecordID
0,1.0,0.152936,-0.105586,0.200722,0.317201,0.260764,0.100945,-0.150063,-0.171128,0.124676,...,-0.108351,-0.077623,-0.009568,0.057684,0.11868,0.014038,Leptodactylidae,Adenomera,AdenomeraAndre,1
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,-0.090974,-0.05651,-0.035303,0.02014,0.082263,0.029056,Leptodactylidae,Adenomera,AdenomeraAndre,1
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,-0.050691,-0.02359,-0.066722,-0.025083,0.099108,0.077162,Leptodactylidae,Adenomera,AdenomeraAndre,1
3,1.0,0.224392,0.118985,0.329432,0.372088,0.361005,0.015501,-0.194347,-0.098181,0.270375,...,-0.136009,-0.177037,-0.130498,-0.054766,-0.018691,0.023954,Leptodactylidae,Adenomera,AdenomeraAndre,1
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.1727,0.266434,...,-0.048885,-0.053074,-0.08855,-0.031346,0.10861,0.079244,Leptodactylidae,Adenomera,AdenomeraAndre,1


In [4]:
Frogs_train, Frogs_test = train_test_split(Frogs, test_size=0.3, random_state=42)

In [5]:
# Frogs['RecordID'].unique()

### (b) Train a classifier for each label

#### (i) Research

* **Exact Match**: this uses Exact Match Ratio to extend the accuracy used in single label case for multi-label prediction. The predicted labels and the true labels are exactly matched when the set of labels predicted for a sample exactly match the corresponding set of true labels. <br>

* **Hamming score**: often derived from the Hamming Loss, offers a less strict metric by considering the fraction of the correct labels to the total number of labels. It is particularly useful in evaluating the performance of a model on how well it classifies each label. It measures the fraction of the correctly predicted labels to the total number of labels across all samples. It is an adaptation of the Hamming Loss, which counts the proportion of wrongly predicted labels.
  $$ \text{Hamming Score} = 1 - \text{Hamming Loss} $$
  $$ \text{Hamming Loss} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{L} \sum_{l=1}^{L} \mathbb{1}(y_{i,l} \neq \hat{y}_{i,l}) $$

    The Hamming Score closer to 1 indicates better performance, reflecting a higher proportion of correctly predicted labels. 

#### (ii) Train a SVM for each of the labels

In [9]:
# Separating features and labels 
X_train = Frogs_train.iloc[:, :-4]
X_test = Frogs_test.iloc[:, :-4]
labels = ['Family', 'Genus', 'Species']

In [10]:
# parameter grid
param_grid = {
    'estimator__C': np.logspace(-1, 4, 5),
    'estimator__gamma': np.linspace(0.1, 2, 5)
}

# scorer
scorer = make_scorer(accuracy_score)

raw_results = {}
y_pred_full = np.empty((X_test.shape[0], 0), dtype=int)


# Training SVM for each label on raw data
for label in labels:
    print(f"Processing {label}")

    y_train = Frogs_train[label]
    y_test = Frogs_test[label]
    
    svm_raw = OneVsRestClassifier(SVC(kernel='rbf'))
    grid_search_raw = GridSearchCV(svm_raw, param_grid, scoring=scorer, cv=10)
    grid_search_raw.fit(X_train, y_train)
    raw_best_params = grid_search_raw.best_params_
    
    # Evaluate on test set
    y_pred_raw = grid_search_raw.predict(X_test)
    raw_test_score = accuracy_score(y_test, y_pred_raw)
    raw_hamming_loss = hamming_loss(y_test, y_pred_raw)
    
    # results 
    raw_results[label] = {
        'best_params': raw_best_params,
        'test_score': raw_test_score,
        'hamming_loss': raw_hamming_loss
    }
    print(f"Done with {label}")

Processing Family
Done with Family
Processing Genus
Done with Genus
Processing Species
Done with Species


In [11]:
# results
for label, result in raw_results.items():
    print(f"Label: {label}, Best Params: {result['best_params']}, Test Score: {result['test_score']}, Hamming Loss: {result['hamming_loss']}")

Label: Family, Best Params: {'estimator__C': 31.622776601683793, 'estimator__gamma': 2.0}, Test Score: 0.9958314034275128, Hamming Loss: 0.0041685965724872626
Label: Genus, Best Params: {'estimator__C': 31.622776601683793, 'estimator__gamma': 2.0}, Test Score: 0.9911996294580825, Hamming Loss: 0.008800370541917554
Label: Species, Best Params: {'estimator__C': 31.622776601683793, 'estimator__gamma': 2.0}, Test Score: 0.9916628068550255, Hamming Loss: 0.008337193144974525


#### (iii) Repeat 1(b)ii with L1-penalized SVMs

In [15]:
# Parameter grid
param_grid = {
    'onevsrestclassifier__estimator__C': np.logspace(-1, 4, 5)
}

l1_results = {}
y_pred_full = np.empty((X_test.shape[0], 0))  # For collecting all label predictions


# Training L1-penalized SVM for each label on standardized data
for label in labels:
    print(f"Processing {label}")

    y_train = Frogs_train[label]
    y_test = Frogs_test[label]
    
    # Create a pipeline for standardization and L1-penalized SVM
    l1_svm = make_pipeline(StandardScaler(), OneVsRestClassifier(LinearSVC(penalty='l1', dual=False, max_iter=30000)))
    grid_search_l1 = GridSearchCV(l1_svm, param_grid, scoring='accuracy', cv=10)
    grid_search_l1.fit(X_train, y_train)
    l1_best_params = grid_search_l1.best_params_
    
    # Evaluate on test set
    y_pred = grid_search_l1.predict(X_test)
    y_pred_full = np.column_stack((y_pred_full, y_pred))  # Collecting predictions
    
    l1_test_score = accuracy_score(y_test, y_pred)
    l1_hamming_loss = hamming_loss(y_test, y_pred)
    
    # Results for L1-penalized SVM
    l1_results[label] = {
        'best_params': l1_best_params,
        'test_score': l1_test_score,
        'hamming_loss': l1_hamming_loss
    }
    
    print(f'Label: {label}, Best Params: {l1_best_params}, Test Score: {l1_test_score}, Hamming Loss: {l1_hamming_loss}')

Processing Family
Label: Family, Best Params: {'onevsrestclassifier__estimator__C': 1.7782794100389228}, Test Score: 0.9282075034738305, Hamming Loss: 0.07179249652616952
Processing Genus




Label: Genus, Best Params: {'onevsrestclassifier__estimator__C': 31.622776601683793}, Test Score: 0.9416396479851783, Hamming Loss: 0.058360352014821676
Processing Species




Label: Species, Best Params: {'onevsrestclassifier__estimator__C': 1.7782794100389228}, Test Score: 0.9587772116720704, Hamming Loss: 0.041222788327929596


#### (iv) Repeat 1(b)iii by using SMOTE or any other method for imbalance

In [16]:
# Parameter grid
param_grid = {
    'onevsrestclassifier__estimator__C': np.logspace(-1, 4, 5)
}

# Scorer
scorer = make_scorer(accuracy_score)


smote_results = {}

# using SMOTE for class balancing
for label in labels:
    print(f"Processing {label}")
   
    y_train_label = Frogs_train[label]
    y_test_label = Frogs_test[label]

    # Create a pipeline with SMOTE and L1-penalized SVM
    pipeline = make_pipeline_imb(StandardScaler(), SMOTE(random_state=42), 
                                 OneVsRestClassifier(LinearSVC(penalty='l1', dual=False, max_iter=30000)))
    grid_search = GridSearchCV(pipeline, param_grid, scoring=scorer, cv=10)
    grid_search.fit(X_train, y_train_label)
    
    best_params = grid_search.best_params_
    y_pred = grid_search.predict(X_test)
    test_score = accuracy_score(y_test_label, y_pred)
    hamming_loss_value = hamming_loss(y_test_label, y_pred)
    
    # results
    smote_results[label] = {
        'best_params': best_params,
        'test_score': test_score,
        'hamming_loss': hamming_loss_value
    }
    print(f"Done with {label}")

# Print results
for label, result in smote_results.items():
    print(f"Label: {label}")
    print(f"  Best Params: {result['best_params']}, Test Score: {result['test_score']}, Hamming Loss: {result['hamming_loss']}")

Processing Family
Done with Family
Processing Genus
Done with Genus
Processing Species
Done with Species
Label: Family
  Best Params: {'onevsrestclassifier__estimator__C': 31.622776601683793}, Test Score: 0.9092172301991662, Hamming Loss: 0.09078276980083372
Label: Genus
  Best Params: {'onevsrestclassifier__estimator__C': 562.341325190349}, Test Score: 0.9018063918480778, Hamming Loss: 0.09819360815192218
Label: Species
  Best Params: {'onevsrestclassifier__estimator__C': 1.7782794100389228}, Test Score: 0.9573876794812413, Hamming Loss: 0.042612320518758684


## 2. K-Means Clustering on a Multi-Class and Multi-Label Data Set

### Results based on 50 Monte Carlo. We combine parts (a), (b) and (c) together

In [38]:
# the code without Monte Carlo
X = Frogs.drop(['Family', 'Genus', 'Species', 'RecordID'], axis=1)
true_labels = Frogs[['Family', 'Genus', 'Species']]

max_k = 50
optimal_k_list, hamming_distance, hamming_score = [], [], []

for simulation in tqdm(range(50), desc="Finding optimal k"):
    # Finding the optimal k for each simulation
    silhouette_scores = []
    for k in range(2, max_k + 1):
        kmeans = KMeans(n_clusters=k, random_state=simulation, n_init=1).fit(X)
        labels = kmeans.labels_
        silhouette_scores.append(silhouette_score(X, labels))
    optimal_k = np.argmax(silhouette_scores) + 2  # Adjust for 0-indexing
    optimal_k_list.append(optimal_k)
    # Re-cluster with the most common optimal k
    kmeans = KMeans(n_clusters=optimal_k, random_state=simulation, n_init=1).fit(X)
    labels = kmeans.labels_
    ## determine the majority label
    Frogs['Cluster'] = labels
    majority_labels = Frogs.groupby('Cluster').agg(lambda x: x.value_counts().idxmax())
    ## calculate the hamming distance and score
    total_hamming_distance = 0
    for label in ['Family', 'Genus', 'Species']:
        # Calculate Hamming distance by comparing the predicted majority label to the actual labels
        predicted_labels = Frogs['Cluster'].map(majority_labels[label])
        total_hamming_distance += hamming_loss(true_labels[label], predicted_labels)
    hamming_distance.append(total_hamming_distance / 3)
    hamming_score.append(1.0 - total_hamming_distance / 3)

Finding optimal k: 100%|██████████| 50/50 [29:38<00:00, 35.57s/it]


In [44]:
results = pd.DataFrame()
results['optimal_k'] = optimal_k_list
results['hamming_distance'] = hamming_distance
results['hamming_score'] = hamming_score
results

Unnamed: 0,optimal_k,hamming_distance,hamming_score
0,2,0.352791,0.647209
1,3,0.28126,0.71874
2,2,0.352791,0.647209
3,3,0.28126,0.71874
4,3,0.28126,0.71874
5,3,0.28126,0.71874
6,2,0.352791,0.647209
7,2,0.352791,0.647209
8,2,0.352791,0.647209
9,2,0.352791,0.647209


In [48]:
np.mean(results['hamming_distance']), np.std(results['hamming_distance'])

(0.3141644660643965, 0.035650944390119566)

The optimal K values, hamming distance and hamming scores are reported in the above table. The average hamming distance is 0.31 and the standard deviation of hamming distance is 0.036.

### Reference

* [multilabel-classification](https://stats.stackexchange.com/questions/233275/multilabel-classification-metrics-on-scikit)
* [multilabel_classification_metrics](https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics)
* [multi-label-classifiers](https://towardsdatascience.com/evaluating-multi-label-classifiers-a31be83da6ea)
* [tqdm](https://pypi.org/project/tqdm/)
* [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
* [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)
* [hamming_loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html)
* [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)
* [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)
* [Over_sampling.SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)
* [silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)
* [squareform](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html)
* [linkage](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)
* [dendrogram](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html)
* [dendrogram II](https://stackoverflow.com/questions/41416498/dendrogram-or-other-plot-from-distance-matrix)