<center><h1>SVM</h1></center>

## 1. Multi-class and Multi-Label Classification Using Support Vector Machines

Import packages

In [None]:
import numpy as np
import pandas as pd

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, hamming_loss, silhouette_samples, silhouette_score
from sklearn.model_selection import train_test_split, GridSearchCV 
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC

import warnings
warnings.filterwarnings("ignore")
from sklearn.exceptions import ConvergenceWarning
# warnings.filterwarnings("ignore", category=ConvergenceWarning)

### (a) Download the Anuran Calls (MFCCs) Data Set

In [2]:
# Read data
df = pd.read_csv("../../data/Anuran_Calls(MFCCs)/Frogs_MFCCs.csv")
# drop the RecordID column
df.drop('RecordID', axis=1, inplace=True)
df

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species
0,1.0,0.152936,-0.105586,0.200722,0.317201,0.260764,0.100945,-0.150063,-0.171128,0.124676,...,-0.024017,-0.108351,-0.077623,-0.009568,0.057684,0.118680,0.014038,Leptodactylidae,Adenomera,AdenomeraAndre
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,0.012022,-0.090974,-0.056510,-0.035303,0.020140,0.082263,0.029056,Leptodactylidae,Adenomera,AdenomeraAndre
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,0.083536,-0.050691,-0.023590,-0.066722,-0.025083,0.099108,0.077162,Leptodactylidae,Adenomera,AdenomeraAndre
3,1.0,0.224392,0.118985,0.329432,0.372088,0.361005,0.015501,-0.194347,-0.098181,0.270375,...,-0.050224,-0.136009,-0.177037,-0.130498,-0.054766,-0.018691,0.023954,Leptodactylidae,Adenomera,AdenomeraAndre
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.172700,0.266434,...,0.062837,-0.048885,-0.053074,-0.088550,-0.031346,0.108610,0.079244,Leptodactylidae,Adenomera,AdenomeraAndre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7190,1.0,-0.554504,-0.337717,0.035533,0.034511,0.443451,0.093889,-0.100753,0.037087,0.081075,...,-0.000861,0.069430,0.071001,0.021591,0.052449,-0.021860,-0.079860,Hylidae,Scinax,ScinaxRuber
7191,1.0,-0.517273,-0.370574,0.030673,0.068097,0.402890,0.096628,-0.116460,0.063727,0.089034,...,0.006457,0.061127,0.068978,0.017745,0.046461,-0.015418,-0.101892,Hylidae,Scinax,ScinaxRuber
7192,1.0,-0.582557,-0.343237,0.029468,0.064179,0.385596,0.114905,-0.103317,0.070370,0.081317,...,0.008696,0.082474,0.077771,-0.009688,0.027834,-0.000531,-0.080425,Hylidae,Scinax,ScinaxRuber
7193,1.0,-0.519497,-0.307553,-0.004922,0.072865,0.377131,0.086866,-0.115799,0.056979,0.089316,...,0.001924,0.051796,0.069073,0.017963,0.041803,-0.027911,-0.096895,Hylidae,Scinax,ScinaxRuber


In [3]:
# train test split
df_train, df_test = train_test_split(df, 
                                     train_size = 0.7,
                                     shuffle = True,
                                     random_state = 50 )
print(df_train.shape)
print(df_test.shape)

# Xs and y label
X_train = df_train.iloc[:, :-3]
y_train = df_train.iloc[:, -3:]
y_Family_train = df_train.iloc[:, -3]
y_Genus_train = df_train.iloc[:, -2]
y_Species_train = df_train.iloc[:, -1]
y_labels_train = [y_Family_train, y_Genus_train, y_Species_train]

X_test = df_test.iloc[:, :-3]
y_test = df_test.iloc[:, -3:]
y_Family_test = df_test.iloc[:, -3]
y_Genus_test = df_test.iloc[:, -2]
y_Species_test = df_test.iloc[:, -1]
y_labels_test = [y_Family_test, y_Genus_test, y_Species_test]

(5036, 25)
(2159, 25)


### (b) Train a classifier for each label

In [4]:
# summary for all methods in Question 1
summary = {}

#### (i) Research

**References**:

1. [Metrics for Multilabel Classification](https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics)
2. [Multilabel classification metrics on scikit](https://stats.stackexchange.com/questions/233275/multilabel-classification-metrics-on-scikit)

---
1. **Exact Match**
    * Ignore the partially correct predictions (consider them incorrect) , and extend the *accuracy* used in single label case for multi-label prediction.
    ![equation](./pics/ExactMatchRatio.png)
  
1. **Hamming Loss**
    * Hamming loss takes into account the prediction error (an incorrect label is predicted) and missing error (a relevant label not predicted), normalized over total number of classes and total number of examples.
    ![equation](./pics/HammingLoss.png)
    

#### (ii) Train a SVM for each of the labels

**References**:

1. [SVC doc](https://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html)
2. [LinearSVC Doc](https://scikit-learn.org/dev/modules/generated/sklearn.svm.LinearSVC.html)

##### Gaussian SVC without Standardization

In [5]:
# Define a function to train SVM model
def get_SVC_model(C_values, gamma_values, X_train):
    # Define parameters
    params = {
        'C': C_values,           # param C
        'gamma': gamma_values,   # param gamma
        'decision_function_shape': ['ovr'],
        'kernel': ['rbf'],
        'max_iter': [10000]
    }
    # GridSearchCV model
    model_SVC_cv = GridSearchCV(estimator=SVC(), param_grid=params, cv=10, n_jobs=-1, verbose=0)
    # Fit classifier for 3 labels
    models = {}
    for y in y_labels_train:
        model_SVC_cv.fit(X_train, y)
        cv_score = model_SVC_cv.best_score_
        best_params = model_SVC_cv.best_params_  # get the best parameters
        print("Label: ", y.name)
        print("Cross-validation score: ", cv_score)
        print("Best parameters: \n\t", best_params)
        # store best SVC model
        models[y.name] = model_SVC_cv.best_estimator_
        
    return models

##### 
In my code, I do the grid search for 3 labels separately, it might get 3 different classifiers (with different params)  for the 3 labels.

Now start experimenting on large and small parameters 

In [None]:
# SVM for large params
# parameters
C_large = [10**i for i in range(0, 5)]
gamma_large = [10**i for i in range(0, 5)]
# print(C_large)

# run model # for each label 
get_SVC_model(C_large, gamma_large, X_train)

# time: 2m

Label:  Family
Cross-validation score:  0.9922540155890056
Best parameters: 
	 {'C': 100, 'decision_function_shape': 'ovr', 'gamma': 1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Genus
Cross-validation score:  0.9906663195430591
Best parameters: 
	 {'C': 100, 'decision_function_shape': 'ovr', 'gamma': 1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Species
Cross-validation score:  0.989475448893938
Best parameters: 
	 {'C': 10, 'decision_function_shape': 'ovr', 'gamma': 1, 'kernel': 'rbf', 'max_iter': 10000}


{'Family': SVC(C=100, gamma=1, max_iter=10000),
 'Genus': SVC(C=100, gamma=1, max_iter=10000),
 'Species': SVC(C=10, gamma=1, max_iter=10000)}

In [None]:
# SVM for small params
# parameters
C_small = [10**i for i in range(-5, 0)]
gamma_small = [10**i for i in range(-5, 0)]
# print(C_small)

# run model # for each label 
get_SVC_model(C_small, gamma_small, X_train)

# time: 37s

Label:  Family
Cross-validation score:  0.8761017229953613
Best parameters: 
	 {'C': 0.1, 'decision_function_shape': 'ovr', 'gamma': 0.1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Genus
Cross-validation score:  0.8065949540850138
Best parameters: 
	 {'C': 0.1, 'decision_function_shape': 'ovr', 'gamma': 0.1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Species
Cross-validation score:  0.8445221527975008
Best parameters: 
	 {'C': 0.1, 'decision_function_shape': 'ovr', 'gamma': 0.1, 'kernel': 'rbf', 'max_iter': 10000}


{'Family': SVC(C=0.1, gamma=0.1, max_iter=10000),
 'Genus': SVC(C=0.1, gamma=0.1, max_iter=10000),
 'Species': SVC(C=0.1, gamma=0.1, max_iter=10000)}

In [None]:
# Decide final model
C_values = np.logspace(-1, 2, 20)
# print(C)
gamma_values = np.linspace(0.1, 1, 20)
# print(gamma)
model_SVC_final = get_SVC_model(C_values, gamma_values, X_train)

# time: 2m

Label:  Family
Cross-validation score:  0.9922540155890056
Best parameters: 
	 {'C': np.float64(48.32930238571752), 'decision_function_shape': 'ovr', 'gamma': np.float64(0.9526315789473684), 'kernel': 'rbf', 'max_iter': 10000}
Label:  Genus
Cross-validation score:  0.9906671084603491
Best parameters: 
	 {'C': np.float64(11.288378916846883), 'decision_function_shape': 'ovr', 'gamma': np.float64(0.8578947368421053), 'kernel': 'rbf', 'max_iter': 10000}
Label:  Species
Cross-validation score:  0.9896742560509957
Best parameters: 
	 {'C': np.float64(16.23776739188721), 'decision_function_shape': 'ovr', 'gamma': np.float64(0.9526315789473684), 'kernel': 'rbf', 'max_iter': 10000}


In [None]:
# calculate metrics
def Calc_metrics(models):
    # predict 3 labels separately
    pred_cols = {}
    for y in y_labels_train:
        # train the final model
        models[y.name].fit(X_train, y)
        # predictions
        pred_cols[y.name] = models[y.name].predict(X_test)
    # combine into a dataFrame y_pred
    y_pred = pd.DataFrame(pred_cols, index=y_test.index)

    # calculate metrics with y_test
    EMscore_test = (y_test == y_pred).all(axis=1).mean() # Exact Match 
    # HLscore_test = hamming_loss(y_test, y_pred)# Hamming Loss
    # manually calculate haaming loss
    HLscore_test = (y_test != y_pred).sum().sum() / y_test.shape[0] * y_test.shape[1]

    return [EMscore_test.item(), HLscore_test.item()]

summary["Gaussian SVC"] = Calc_metrics(model_SVC_final)
print("Exact Match Ratio and Hamming Loss for Gaussian SVC: \n", summary["Gaussian SVC"])

# time: 0s

Exact Match Ratio and Hamming Loss for Gaussian SVC: 
 [0.9851783232978231, 0.09587772116720704]


##### Gaussian SVC with Standardization

In [None]:
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.fit_transform(X_test)

# SVM for large params
get_SVC_model(C_large, gamma_large, X_train_std)
# SVM for small params
get_SVC_model(C_small, gamma_small, X_train_std)

# time 4m

Label:  Family
Cross-validation score:  0.9104464482943608
Best parameters: 
	 {'C': 10, 'decision_function_shape': 'ovr', 'gamma': 1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Genus
Cross-validation score:  0.8629887342611001
Best parameters: 
	 {'C': 10, 'decision_function_shape': 'ovr', 'gamma': 1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Species
Cross-validation score:  0.8419384486730411
Best parameters: 
	 {'C': 10, 'decision_function_shape': 'ovr', 'gamma': 1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Family
Cross-validation score:  0.9718029126826343
Best parameters: 
	 {'C': 0.1, 'decision_function_shape': 'ovr', 'gamma': 0.1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Genus
Cross-validation score:  0.9491665088832086
Best parameters: 
	 {'C': 0.1, 'decision_function_shape': 'ovr', 'gamma': 0.1, 'kernel': 'rbf', 'max_iter': 10000}
Label:  Species
Cross-validation score:  0.9525411025908044
Best parameters: 
	 {'C': 0.1, 'decision_function_shape': 'ovr', 'gamma': 0.1, '

{'Family': SVC(C=0.1, gamma=0.1, max_iter=10000),
 'Genus': SVC(C=0.1, gamma=0.1, max_iter=10000),
 'Species': SVC(C=0.1, gamma=0.1, max_iter=10000)}

In [None]:
# Decide final model
C = np.logspace(-1, 1, 20)
# print(C)
gamma = np.linspace(0.05, 1, 20)
# print(gamma)
model_SVC_std_final = get_SVC_model(C, gamma, X_train_std)

# metrics
summary["Gaussian SVC with std"] = Calc_metrics(model_SVC_std_final)
print("Exact Match Ratio and Hamming Loss for Gaussian SVC with standarzation: \n", summary["Gaussian SVC with std"])

# time 10m

Label:  Family
Cross-validation score:  0.9928492536842437
Best parameters: 
	 {'C': np.float64(10.0), 'decision_function_shape': 'ovr', 'gamma': np.float64(0.05), 'kernel': 'rbf', 'max_iter': 10000}
Label:  Genus
Cross-validation score:  0.9906659250844143
Best parameters: 
	 {'C': np.float64(6.158482110660261), 'decision_function_shape': 'ovr', 'gamma': np.float64(0.05), 'kernel': 'rbf', 'max_iter': 10000}
Label:  Species
Cross-validation score:  0.9898730632080535
Best parameters: 
	 {'C': np.float64(6.158482110660261), 'decision_function_shape': 'ovr', 'gamma': np.float64(0.05), 'kernel': 'rbf', 'max_iter': 10000}
Exact Match Ratio and Hamming Loss for Gaussian SVC with standarzation: 
 [0.936544696618805, 0.37378415933302456]


In [None]:
# # based on the previous result, narrow down again
# NO NEED
# # Decide final model
# C = np.linspace(7, 10, 20)
# # print(C)
# gamma = np.linspace(0.005, 1, 20)
# # print(gamma)
# model_SVC_std_final = get_SVC_model(C, gamma, X_train_std)

# # metrics
# summary["Gaussian SVC with std"] = Calc_metrics(model_SVC_std_final)
# print("Exact Match Ratio and Hamming Loss for Gaussian SVC with standarzation: \n", summary["Gaussian SVC with std"])

# # time 12m

#### (iii) Repeat 1(b)ii with L1-penalized SVMs

In [13]:
# Define a function to train SVM-L1 model
def get_SVC_L1_model(C, X_train):
    # Define parameters
    params = {
        'C': C,           # param C
        'multi_class': ['ovr'], # same as before
        "penalty": ['l1'],
        'dual': [False], # using L1, n_samples > n_features => have to set 'dual' to false
        'max_iter': [20000]
    }
    model_SVC_L1 = GridSearchCV(estimator=LinearSVC(), param_grid=params, cv=10, n_jobs=8)
    # Fit classifier for 3 labels
    models = {}
    for y in y_labels_train:
        model_SVC_L1.fit(X_train, y)
        cv_score = model_SVC_L1.best_score_
        best_params = model_SVC_L1.best_params_  # get the best parameters
        print("Label: ", y.name)
        print("Cross-validation score: ", cv_score)
        print("Best parameters: \n\t", best_params)
        # store best SVC L1 model
        models[y.name] = model_SVC_L1.best_estimator_
        
    return models

In [None]:
# Large params # re-use large C from Q1 ii.
get_SVC_L1_model(C_large, X_train_std)
# small params 
get_SVC_L1_model(C_small, X_train_std)

# time 1m

Label:  Family
Cross-validation score:  0.9378554072391051
Best parameters: 
	 {'C': 1, 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Label:  Genus
Cross-validation score:  0.9491700590110133
Best parameters: 
	 {'C': 10, 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Label:  Species
Cross-validation score:  0.9590954274353877
Best parameters: 
	 {'C': 1, 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Label:  Family
Cross-validation score:  0.9362669222758686
Best parameters: 
	 {'C': 0.1, 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Label:  Genus
Cross-validation score:  0.9380479030578434
Best parameters: 
	 {'C': 0.1, 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Label:  Species
Cross-validation score:  0.9499633153460192
Best parameters: 
	 {'C': 0.1, 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}


{'Family': LinearSVC(C=0.1, dual=False, max_iter=20000, penalty='l1'),
 'Genus': LinearSVC(C=0.1, dual=False, max_iter=20000, penalty='l1'),
 'Species': LinearSVC(C=0.1, dual=False, max_iter=20000, penalty='l1')}

In [None]:
# find final model
C = np.logspace(-1, 1, 20)
model_SVC_L1_final = get_SVC_L1_model(C, X_train_std)
# metrics
summary["SVC with L1"] = Calc_metrics(model_SVC_L1_final)
print("Exact Match Ratio and Hamming Loss for SVC with L1 penalty: \n", summary["SVC with L1"])

# time 2m 37s

Label:  Family
Cross-validation score:  0.9378554072391051
Best parameters: 
	 {'C': np.float64(0.8858667904100825), 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Label:  Genus
Cross-validation score:  0.9491700590110133
Best parameters: 
	 {'C': np.float64(10.0), 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Label:  Species
Cross-validation score:  0.9590954274353877
Best parameters: 
	 {'C': np.float64(0.8858667904100825), 'dual': False, 'max_iter': 20000, 'multi_class': 'ovr', 'penalty': 'l1'}
Exact Match Ratio and Hamming Loss for SVC with L1 penalty: 
 [0.9087540528022232, 0.5016211208893007]


#### (iv) Repeat 1(b)iii by using SMOTE or any other method for imbalance

In [16]:
# check if imbalanced
for y in y_labels_train:
    print(y.value_counts())

Family
Leptodactylidae    3119
Hylidae            1486
Dendrobatidae       386
Bufonidae            45
Name: count, dtype: int64
Genus
Adenomera        2921
Hypsiboas        1094
Ameerega          386
Dendropsophus     216
Leptodactylus     198
Scinax             97
Osteocephalus      79
Rhinella           45
Name: count, dtype: int64
Species
AdenomeraHylaedactylus    2443
HypsiboasCordobae          778
AdenomeraAndre             478
Ameeregatrivittata         386
HypsiboasCinerascens       316
HylaMinuta                 216
LeptodactylusFuscus        198
ScinaxRuber                 97
OsteocephalusOophagus       79
Rhinellagranulosa           45
Name: count, dtype: int64


In [None]:
def get_smote_SVM_model(C_values, X_train):
    params = {
        'svm__penalty': ['l1'],
        'svm__dual': [False],
        'svm__C': C_values,
        'svm__multi_class': ['ovr'],
        'svm__max_iter': [20000]
    }

    # pipeline: smote and SVM
    pipeline = Pipeline(steps=[['smote', SMOTE(random_state=42)],
                               ['svm', LinearSVC(penalty = 'l1', dual = False, max_iter = 6000)]])

    # GridSearchCV model
    model_smote_SVM_cv = GridSearchCV(estimator=pipeline, param_grid=params, cv=10, verbose=0, n_jobs=-1)

    # Fit  classifier for 3 labels
    models = {}
    for y in y_labels_train:
        model_smote_SVM_cv.fit(X_train, y)
        cv_score = model_smote_SVM_cv.best_score_
        best_params = model_smote_SVM_cv.best_params_  # get the best parameters
        print("Label: ", y.name)
        print("Cross-validation score: ", cv_score)
        print("Best parameters: \n\t", best_params)
        # store best SVC model
        models[y.name] = model_smote_SVM_cv.best_estimator_
        
    return models

get_smote_SVM_model(C_large, X_train_std)
get_smote_SVM_model(C_small, X_train_std)

# time: 11m

Label:  Family
Cross-validation score:  0.9191868629492885
Best parameters: 
	 {'svm__C': 10, 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Label:  Genus
Cross-validation score:  0.915604783994446
Best parameters: 
	 {'svm__C': 10, 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Label:  Species
Cross-validation score:  0.9577045662532739
Best parameters: 
	 {'svm__C': 10, 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Label:  Family
Cross-validation score:  0.9172015525892265
Best parameters: 
	 {'svm__C': 0.1, 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Label:  Genus
Cross-validation score:  0.9126262267663858
Best parameters: 
	 {'svm__C': 0.1, 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Label:  Species
Cross-validation score:  0.9563140995298053
Best 

{'Family': Pipeline(steps=[('smote', SMOTE(random_state=42)),
                 ['svm',
                  LinearSVC(C=0.1, dual=False, max_iter=20000, penalty='l1')]]),
 'Genus': Pipeline(steps=[('smote', SMOTE(random_state=42)),
                 ['svm',
                  LinearSVC(C=0.1, dual=False, max_iter=20000, penalty='l1')]]),
 'Species': Pipeline(steps=[('smote', SMOTE(random_state=42)),
                 ['svm',
                  LinearSVC(C=0.1, dual=False, max_iter=20000, penalty='l1')]])}

In [18]:
# Deicde final model
C_values = np.logspace(-1, 1, 20)
model_smote_SVM_final = get_smote_SVM_model(C_values, X_train_std)

summary["SVM with L1 and smote"] = Calc_metrics(model_smote_SVM_final)
print("Exact Match Ratio and Hamming Loss for Gaussian SVC: \n", summary["SVM with L1 and smote"])

# time: 26m

Label:  Family
Cross-validation score:  0.9195840828047588
Best parameters: 
	 {'svm__C': np.float64(2.3357214690901213), 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Label:  Genus
Cross-validation score:  0.915604783994446
Best parameters: 
	 {'svm__C': np.float64(10.0), 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Label:  Species
Cross-validation score:  0.9577045662532739
Best parameters: 
	 {'svm__C': np.float64(6.158482110660261), 'svm__dual': False, 'svm__max_iter': 20000, 'svm__multi_class': 'ovr', 'svm__penalty': 'l1'}
Exact Match Ratio and Hamming Loss for Gaussian SVC: 
 [0.8550254747568319, 0.6808707735062529]


In [21]:
summary = pd.DataFrame(summary)
summary.index = ["Exact Match Ratio", "Hamming Loss"]
summary

Unnamed: 0,Gaussian SVC,Gaussian SVC with std,SVC with L1,SVM with L1 and smote
Exact Match Ratio,0.985178,0.936545,0.908754,0.855025
Hamming Loss,0.095878,0.373784,0.501621,0.680871


##### Observations:

* *Gaussian SVC* has the best metrics result (both for Exact Ratio and Hamming Loss).
* For *SVC with L1* and *SVM with L1 and smote*, the Exact Ratio score is high but Hamming Loss is also high. This might be because  when predictions are wrong, they tend to be very wrong.
* Conclusion:  *Gaussian SVC* has the best performance.

## 2. K-Means Clustering on a Multi-Class and Multi-Label Data Set

### (a) Use k-means clustering

**References**:

* [KMeans Doc](https://scikit-learn.org/1.5/modules/generated/sklearn.cluster.KMeans.html)

In [19]:
X = df.iloc[:, :-3]
y = df.iloc[:, -3:]

In [70]:
# find the best # of clusters using silhouette score.
def find_best_k(X_train, max_clusters, random_state=None):
    best_k, max_score = 2, -1
    best_model = None
    best_labels = None
    
    for n in range(2, max_clusters + 1):
        model_KMeans = KMeans(n_clusters=n, random_state=random_state)
        labels = model_KMeans.fit_predict(X_train)
        sil_result = silhouette_score(X_train, labels)
        
        if sil_result > max_score:
            max_score = sil_result 
            best_k = n
            best_model = model_KMeans
            best_labels = labels    # best cluster assignments
            
    return best_model, best_k, best_labels

for trial in range(5):
    print("Trial", trial + 1)
    model_KMeans, best_k, cluster_labels = find_best_k(X, 50, random_state=trial)
    print("Best K: ", best_k)
    print(cluster_labels)

Trial 1
Best K:  4
[1 1 1 ... 1 1 1]
Trial 2
Best K:  4
[1 1 1 ... 1 1 1]
Trial 3
Best K:  4
[3 3 3 ... 3 3 3]
Trial 4
Best K:  6
[3 3 3 ... 3 3 3]
Trial 5
Best K:  3
[2 2 2 ... 2 2 2]


### (b) Determine which family is the majority

In [78]:
def get_majority_labels(cluster_labels, y_train, n_clusters):
    # Get majority labels for each cluster
    major_labels = {} # pd.DataFrame(columns=y_train.columns)
    
    for n in range(n_clusters):
        idx, = np.where(cluster_labels == n)
        cluster_samples = y_train.iloc[idx, :]
        pred_rows = {}
        # loop thru 3 labels
        for label in y_train.columns:
            # Count the most common class for this label in curr cluster
            value_counts = cluster_samples[label].value_counts()
            if not value_counts.empty:
                cur_major = value_counts.index[0]
            else:
                cur_major = None  # If cluster empty
            pred_rows[label] = cur_major
            
        major_labels[n] = pred_rows  # Store predictions for current cluster
    
    # Convert to DataFrame
    major_labels = pd.DataFrame.from_dict(major_labels, orient='index')

    # Calculate hamming metrics
    # predicted_labels = major_labels.iloc[cluster_labels].values
    # ham_loss = hamming_loss(y_train, predicted_labels)    # error
    # ham_dist = ham_loss * y_train.shape[1]  # Convert loss to distance
    miss_labels = 0
    n_samples = y_train.shape[0]
    n_labels = y_train.shape[1]
    # loop thru all instances
    for c in range(len(major_labels)):
        idx, = np.where(cluster_labels == c)
        for label in y_train.loc[idx].values:
            miss = (label != major_labels.loc[c].values)
            miss_labels += np.sum(miss)
    
    ham_dist = miss_labels / n_samples
    ham_loss = miss_labels / (n_samples * n_labels)
    
    return major_labels, ham_loss, ham_dist


In [89]:
def run_clustering_analysis(X_train, y_train, max_clusters=50, trials=50):
    best_results = {'ham_loss': float('inf')}
    
    for trial in range(trials):
        print(f"Trial {trial + 1}/{trials}")
        # Find best k, best model and clustering
        model_KMeans, best_k, cluster_labels = find_best_k(X_train, max_clusters, random_state=trial)
        print("Best K: ", best_k)
        major_labels, ham_loss, ham_dist = get_majority_labels(cluster_labels, y_train, best_k)
        display(major_labels)
        print("Hamming Distance: ", round(ham_dist, 4))
        print("Hamming Loss: ", round(ham_loss, 4), "\n")

        # Record and refresh best results
        if ham_loss < best_results['ham_loss']:
            best_results = {
                'ham_loss': ham_loss,
                'ham_dist': ham_dist,
                'k': best_k,
                'model': model_KMeans,
                'major_labels': major_labels,
                'trial': trial + 1
            }

    return best_results

# Run analysis
results = {}
best_results = run_clustering_analysis(X, y)

Trial 1/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Hylidae,Hypsiboas,HypsiboasCinerascens
3,Leptodactylidae,Adenomera,AdenomeraAndre


Hamming Distance:  0.7022
Hamming Loss:  0.2341 

Trial 2/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.7358
Hamming Loss:  0.2453 

Trial 3/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCinerascens
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.7012
Hamming Loss:  0.2337 

Trial 4/50
Best K:  6


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Leptodactylidae,Adenomera,AdenomeraAndre
4,Hylidae,Hypsiboas,HypsiboasCinerascens
5,Hylidae,Leptodactylus,LeptodactylusFuscus


Hamming Distance:  0.5632
Hamming Loss:  0.1877 

Trial 5/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Dendrobatidae,Ameerega,Ameeregatrivittata


Hamming Distance:  0.7155
Hamming Loss:  0.2385 

Trial 6/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Leptodactylidae,Adenomera,AdenomeraAndre
2,Hylidae,Hypsiboas,HypsiboasCinerascens
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.7012
Hamming Loss:  0.2337 

Trial 7/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Dendrobatidae,Ameerega,Ameeregatrivittata


Hamming Distance:  0.5582
Hamming Loss:  0.1861 

Trial 8/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8903
Hamming Loss:  0.2968 

Trial 9/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Dendrobatidae,Ameerega,Ameeregatrivittata


Hamming Distance:  0.7155
Hamming Loss:  0.2385 

Trial 10/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Dendrobatidae,Ameerega,Ameeregatrivittata
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.5582
Hamming Loss:  0.1861 

Trial 11/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.7012
Hamming Loss:  0.2337 

Trial 12/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraHylaedactylus


Hamming Distance:  0.8903
Hamming Loss:  0.2968 

Trial 13/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Dendrobatidae,Ameerega,Ameeregatrivittata
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.5582
Hamming Loss:  0.1861 

Trial 14/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Dendrobatidae,Ameerega,Ameeregatrivittata
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.6673
Hamming Loss:  0.2224 

Trial 15/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8905
Hamming Loss:  0.2968 

Trial 16/50
Best K:  2


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8956
Hamming Loss:  0.2985 

Trial 17/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCinerascens
2,Hylidae,Hypsiboas,HypsiboasCordobae
3,Dendrobatidae,Ameerega,Ameeregatrivittata


Hamming Distance:  0.6673
Hamming Loss:  0.2224 

Trial 18/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Leptodactylidae,Adenomera,AdenomeraAndre
2,Hylidae,Hypsiboas,HypsiboasCordobae
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.7012
Hamming Loss:  0.2337 

Trial 19/50
Best K:  2


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8956
Hamming Loss:  0.2985 

Trial 20/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.7022
Hamming Loss:  0.2341 

Trial 21/50
Best K:  6


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCinerascens
4,Leptodactylidae,Adenomera,AdenomeraAndre
5,Hylidae,Leptodactylus,LeptodactylusFuscus


Hamming Distance:  0.5622
Hamming Loss:  0.1874 

Trial 22/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraAndre
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCordobae
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.7358
Hamming Loss:  0.2453 

Trial 23/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.7012
Hamming Loss:  0.2337 

Trial 24/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Dendrobatidae,Ameerega,Ameeregatrivittata
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.6673
Hamming Loss:  0.2224 

Trial 25/50
Best K:  6


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCinerascens
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Dendrobatidae,Ameerega,Ameeregatrivittata
3,Leptodactylidae,Adenomera,AdenomeraAndre
4,Hylidae,Hypsiboas,HypsiboasCordobae
5,Leptodactylidae,Adenomera,AdenomeraAndre


Hamming Distance:  0.497
Hamming Loss:  0.1657 

Trial 26/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8903
Hamming Loss:  0.2968 

Trial 27/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCinerascens
2,Hylidae,Hypsiboas,HypsiboasCordobae
3,Leptodactylidae,Adenomera,AdenomeraAndre


Hamming Distance:  0.7002
Hamming Loss:  0.2334 

Trial 28/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Dendrobatidae,Ameerega,Ameeregatrivittata
3,Leptodactylidae,Adenomera,AdenomeraAndre


Hamming Distance:  0.5582
Hamming Loss:  0.1861 

Trial 29/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraHylaedactylus


Hamming Distance:  0.8905
Hamming Loss:  0.2968 

Trial 30/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCinerascens
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8402
Hamming Loss:  0.2801 

Trial 31/50
Best K:  2


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8956
Hamming Loss:  0.2985 

Trial 32/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Dendrobatidae,Ameerega,Ameeregatrivittata


Hamming Distance:  0.5582
Hamming Loss:  0.1861 

Trial 33/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.8473
Hamming Loss:  0.2824 

Trial 34/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Dendrobatidae,Ameerega,Ameeregatrivittata
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.5582
Hamming Loss:  0.1861 

Trial 35/50
Best K:  6


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraAndre
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Leptodactylus,LeptodactylusFuscus
3,Leptodactylidae,Adenomera,AdenomeraAndre
4,Hylidae,Hypsiboas,HypsiboasCinerascens
5,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.5626
Hamming Loss:  0.1875 

Trial 36/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8903
Hamming Loss:  0.2968 

Trial 37/50
Best K:  6


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCinerascens
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCordobae
3,Leptodactylidae,Adenomera,AdenomeraAndre
4,Leptodactylidae,Leptodactylus,LeptodactylusFuscus
5,Leptodactylidae,Adenomera,AdenomeraAndre


Hamming Distance:  0.5619
Hamming Loss:  0.1873 

Trial 38/50
Best K:  6


Unnamed: 0,Family,Genus,Species
0,Hylidae,Leptodactylus,LeptodactylusFuscus
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCinerascens
3,Leptodactylidae,Adenomera,AdenomeraAndre
4,Hylidae,Hypsiboas,HypsiboasCordobae
5,Leptodactylidae,Adenomera,AdenomeraAndre


Hamming Distance:  0.5644
Hamming Loss:  0.1881 

Trial 39/50
Best K:  2


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8956
Hamming Loss:  0.2985 

Trial 40/50
Best K:  2


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8956
Hamming Loss:  0.2985 

Trial 41/50
Best K:  2


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8956
Hamming Loss:  0.2985 

Trial 42/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Dendrobatidae,Ameerega,Ameeregatrivittata
2,Hylidae,Hypsiboas,HypsiboasCordobae
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.667
Hamming Loss:  0.2223 

Trial 43/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1,Hylidae,Hypsiboas,HypsiboasCinerascens
2,Dendrobatidae,Ameerega,Ameeregatrivittata
3,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.6673
Hamming Loss:  0.2224 

Trial 44/50
Best K:  5


Unnamed: 0,Family,Genus,Species
0,Dendrobatidae,Ameerega,Ameeregatrivittata
1,Hylidae,Hypsiboas,HypsiboasCinerascens
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCordobae
4,Leptodactylidae,Adenomera,AdenomeraHylaedactylus


Hamming Distance:  0.5979
Hamming Loss:  0.1993 

Trial 45/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Leptodactylidae,Adenomera,AdenomeraAndre
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.7009
Hamming Loss:  0.2336 

Trial 46/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Dendrobatidae,Ameerega,Ameeregatrivittata
3,Leptodactylidae,Adenomera,AdenomeraAndre


Hamming Distance:  0.5582
Hamming Loss:  0.1861 

Trial 47/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.8903
Hamming Loss:  0.2968 

Trial 48/50
Best K:  4


Unnamed: 0,Family,Genus,Species
0,Dendrobatidae,Ameerega,Ameeregatrivittata
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCordobae
3,Hylidae,Hypsiboas,HypsiboasCinerascens


Hamming Distance:  0.6673
Hamming Loss:  0.2224 

Trial 49/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Dendrobatidae,Ameerega,Ameeregatrivittata
1,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
2,Hylidae,Hypsiboas,HypsiboasCordobae


Hamming Distance:  0.7155
Hamming Loss:  0.2385 

Trial 50/50
Best K:  3


Unnamed: 0,Family,Genus,Species
0,Hylidae,Hypsiboas,HypsiboasCordobae
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraHylaedactylus


Hamming Distance:  0.8905
Hamming Loss:  0.2968 



### (n) Calculate the average Hamming distance, Hamming score, and Hamming loss

In [90]:
print("avg hamming distance: ", round(np.mean(best_results['ham_dist']), 4))
print("avg hamming Loss: ", round(np.mean(best_results['ham_loss']), 4))
print("avg hamming score: ", round(1-np.mean(best_results['ham_loss']), 4))

avg hamming distance:  0.497
avg hamming Loss:  0.1657
avg hamming score:  0.8343


## 3. ISLR 12.6.2

##### Question
![12.6.2](./pics/12.6.2.png)

Answer

![answer](./pics/answer.jpg)