# Introduction

DSCI 552 - Machine Learning for Data Science

Homework 5

Matheus Schmitz

USC ID: 5039286453

![2.JPG](attachment:2.JPG)

# Imports

In [1]:
# tqdm is a progress bar
# Quite useful to know things are running the the processing time is long
!pip install tqdm



In [2]:
# Data Manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# K-Means
from sklearn.cluster import KMeans, MiniBatchKMeans

# Metrics
from sklearn.metrics import silhouette_score, hamming_loss

# Label Encoding
from sklearn.preprocessing import LabelEncoder

# Progress Bar
from tqdm.notebook import tqdm

# Warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset

In [3]:
# Read csv
df = pd.read_csv('../data/Frogs_MFCCs.csv')
print(f'df.shape: {df.shape}')
df.head(3)

df.shape: (7195, 26)


Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species,RecordID
0,1.0,0.152936,-0.105586,0.200722,0.317201,0.260764,0.100945,-0.150063,-0.171128,0.124676,...,-0.108351,-0.077623,-0.009568,0.057684,0.11868,0.014038,Leptodactylidae,Adenomera,AdenomeraAndre,1
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,-0.090974,-0.05651,-0.035303,0.02014,0.082263,0.029056,Leptodactylidae,Adenomera,AdenomeraAndre,1
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,-0.050691,-0.02359,-0.066722,-0.025083,0.099108,0.077162,Leptodactylidae,Adenomera,AdenomeraAndre,1


In [4]:
# Split features and labels
df_features = df.iloc[:, :-4]
df_labels = df.iloc[:, -4:-1]

### (a) K-Means Clustering

![2a.JPG](attachment:2a.JPG)

In [5]:
# KMeans: Takes about 3 minutes to train on all k values
# MiniBatchKMeans: Takes about 1 minute to train on all k values

# Dictionary to store silhouette score for each k
silhouettes = {}

# Train, predict and score KMean on each k
for k in tqdm(range(2,51)):
    kmeans = KMeans(n_clusters=k)
    #kmeans = MiniBatchKMeans(n_clusters=k)
    clusters = kmeans.fit_predict(df_features)
    silhouettes[k] = silhouette_score(df_features, clusters)

HBox(children=(FloatProgress(value=0.0, max=49.0), HTML(value='')))




In [6]:
# Get the best K value and the associated Silhouette Score
best_k = max(silhouettes, key=lambda key: silhouettes[key])
print(f'Best K: {best_k}')
best_silhouette = silhouettes[best_k]
print(f'Silhouette Score: {best_silhouette:.5f}')

Best K: 4
Silhouette Score: 0.37885


### (b) Majority Labels per Cluster

![2b.JPG](attachment:2b.JPG)

In [7]:
# Instance a K-Means clusterer using the best_k
kmeans = KMeans(n_clusters=best_k)
#kmeans = MiniBatchKMeans(n_clusters=best_k)

# Train the K-Means and predict the clusters
clusters = kmeans.fit_predict(df_features)

# Add the predicted clusters to the dataframe with labels
df_labels['Cluster'] = clusters

# Group the dataframe by cluster
df_clusters = df_labels.groupby('Cluster')

# For each of the labels, check the most frequent class (the mode)
cluster_family = df_clusters['Family'].agg(pd.Series.mode)
cluster_genus = df_clusters['Genus'].agg(pd.Series.mode)
cluster_species = df_clusters['Species'].agg(pd.Series.mode)

# Summarize all on a dataframe
majority_classes = pd.DataFrame(data=[cluster_family, cluster_genus, cluster_species]).T
majority_classes

Unnamed: 0_level_0,Family,Genus,Species
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Hylidae,Hypsiboas,HypsiboasCinerascens
1,Hylidae,Hypsiboas,HypsiboasCordobae
2,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
3,Dendrobatidae,Ameerega,Ameeregatrivittata


### (c) Hamming Distance, Hamming Score, Hamming Loss

![2c.JPG](attachment:2c.JPG)


**Hamming Distance** - From Scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

The Hamming distance between 1-D arrays u and v, is simply the proportion of disagreeing components in u and v.

From this I assume the hamming distance between N-D arrays is the sum of the distances between their "inner" 1-D arrays.

&nbsp;

**Hamming Loss** - From Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html

The Hamming loss is the fraction of labels that are incorrectly predicted.

Hence Scikit-Learn will average the loss over all labels. Therefore in order to obtain the Hamming Distance one can simply multiply sklearn's hamming_loss by the number of labels in the data.

&nbsp;

**Hamming Score** is the inverse of the hamming loss.



In [8]:
# Use the majority_classes dataframe to assign predicted classes to each sample
df_labels['pred_Family'] = df_labels['Cluster'].map(majority_classes['Family'])
df_labels['pred_Genus'] = df_labels['Cluster'].map(majority_classes['Genus'])
df_labels['pred_Species'] = df_labels['Cluster'].map(majority_classes['Species'])

# Need to convert labels from strings to numeric in order to calculate hamming metrics
# LabelBinazer and OneHotEncoder cannot be used as they would double the error as [0, 1] and [1, 0] have a hamming distance of 2
# While [0] and [3] have a hamming distance of one, which is the correct since classes do not have a hierarchy
LE = LabelEncoder()
df_labels['true_Family_encoded'] = LE.fit_transform(df_labels['Family'])
df_labels['pred_Family_encoded'] = LE.transform(df_labels['pred_Family'])
df_labels['true_Genus_encoded'] = LE.fit_transform(df_labels['Genus'])
df_labels['pred_Genus_encoded'] = LE.transform(df_labels['pred_Genus'])
df_labels['true_Species_encoded'] = LE.fit_transform(df_labels['Species'])
df_labels['pred_Species_encoded'] = LE.transform(df_labels['pred_Species'])

# Extract the true and predicted labels as arrays so they can be compared
true_labels_encoded = [data[['true_Family_encoded', 'true_Genus_encoded', 'true_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]
pred_labels_encoded = [data[['pred_Family_encoded', 'pred_Genus_encoded', 'pred_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]

# Calculate metrics
cluster_hamming_loss = [hamming_loss(np.vstack(true_labels_encoded).flatten(), np.vstack(pred_labels_encoded).flatten())]
cluster_hamming_score = [1-loss for loss in cluster_hamming_loss]
cluster_hamming_dist = [loss*len(majority_classes.columns) for loss in cluster_hamming_loss]

# Print average metrics
print(f'Average Hamming Loss:     {np.mean(cluster_hamming_loss):.5f}')
print(f'Average Hamming Score:    {np.mean(cluster_hamming_score):.5f}')
print(f'Average Hamming Distance: {np.mean(cluster_hamming_dist):.5f}')

Average Hamming Loss:     0.22242
Average Hamming Score:    0.77758
Average Hamming Distance: 0.66727


### Monte-Carlo Simulation

![2.JPG](attachment:2.JPG)

In [9]:
# Read csv
df = pd.read_csv('../data/Frogs_MFCCs.csv')

# List to store the hamming distance in each iteration
hammings = []

# Perform the previous procedure (a + b + c) 50 times:
for iteration in tqdm(range(1, 51), desc='Monte-Carlo Simulation', ncols='90%'):
    
    # Split features and labels
    df_features = df.iloc[:, :-4]
    df_labels = df.iloc[:, -4:-1]
    
    #-------------------------------------------------------#
    #   (A) K-MEANS CLUSTERING                              #
    #-------------------------------------------------------#
    
    # Dictionary to store silhouette score for each k
    silhouettes = {}

    # Train, predict and score KMean on each k
    # Note here we change the highest K to 10, based on the previous finding of best_k = 4
    for k in range(2,11):
    #for k in tqdm(range(2,51), desc='K-Means K ∈ {2, 3, ..., 50}', ncols='66%'):
        kmeans = KMeans(n_clusters=k, random_state=iteration)
        #kmeans = MiniBatchKMeans(n_clusters=k, random_state=iteration)
        clusters = kmeans.fit_predict(df_features)
        silhouettes[k] = silhouette_score(df_features, clusters)
        
    # Get the best K value and the associated Silhouette Score
    best_k = max(silhouettes, key=lambda key: silhouettes[key])
    print(f'Iteration {iteration} | Best K: {best_k}')
    best_silhouette = silhouettes[best_k]
    print(f'Iteration {iteration} | Silhouette Score: {best_silhouette:.5f}')  

    
    #-------------------------------------------------------#
    #    (B) MAJORITY LABELS PER CLUSTER                    #
    #-------------------------------------------------------#
    
    # Instance a K-Means clusterer using the best_k
    kmeans = KMeans(n_clusters=best_k, random_state=iteration)
    #kmeans = MiniBatchKMeans(n_clusters=best_k, random_state=iteration)

    # Train the K-Means and predict the clusters
    clusters = kmeans.fit_predict(df_features)

    # Add the predicted clusters to the dataframe with labels
    df_labels['Cluster'] = clusters

    # Group the dataframe by cluster
    df_clusters = df_labels.groupby('Cluster')

    # For each of the labels, check the most frequent class (the mode)
    cluster_family = df_clusters['Family'].agg(pd.Series.mode)
    cluster_genus = df_clusters['Genus'].agg(pd.Series.mode)
    cluster_species = df_clusters['Species'].agg(pd.Series.mode)

    # Summarize all on a dataframe
    majority_classes = pd.DataFrame(data=[cluster_family, cluster_genus, cluster_species]).T
    
    
    #-------------------------------------------------------#
    #   (c) HAMMING DISTANCE, HAMMING SCORE, HAMMING LOSS   #
    #-------------------------------------------------------#

    # Use the majority_classes dataframe to assign predicted classes to each sample
    df_labels['pred_Family'] = df_labels['Cluster'].map(majority_classes['Family'])
    df_labels['pred_Genus'] = df_labels['Cluster'].map(majority_classes['Genus'])
    df_labels['pred_Species'] = df_labels['Cluster'].map(majority_classes['Species'])

    # Need to convert labels from strings to numeric in order to calculate hamming metrics
    # LabelBinazer and OneHotEncoder cannot be used as they would double the error as [0, 1] and [1, 0] have a hamming distance of 2
    # While [0] and [3] have a hamming distance of one, which is the correct since classes do not have a hierarchy
    LE = LabelEncoder()
    df_labels['true_Family_encoded'] = LE.fit_transform(df_labels['Family'])
    df_labels['pred_Family_encoded'] = LE.transform(df_labels['pred_Family'])
    df_labels['true_Genus_encoded'] = LE.fit_transform(df_labels['Genus'])
    df_labels['pred_Genus_encoded'] = LE.transform(df_labels['pred_Genus'])
    df_labels['true_Species_encoded'] = LE.fit_transform(df_labels['Species'])
    df_labels['pred_Species_encoded'] = LE.transform(df_labels['pred_Species'])

    # Extract the true and predicted labels as arrays so they can be compared
    true_labels_encoded = [data[['true_Family_encoded', 'true_Genus_encoded', 'true_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]
    pred_labels_encoded = [data[['pred_Family_encoded', 'pred_Genus_encoded', 'pred_Species_encoded']].values for clster, data in df_labels.groupby('Cluster')]

    # Calculate metrics
    cluster_hamming_loss = [hamming_loss(np.vstack(true_labels_encoded).flatten(), np.vstack(pred_labels_encoded).flatten())]
    cluster_hamming_score = [1-loss for loss in cluster_hamming_loss]
    cluster_hamming_dist = [loss*len(majority_classes.columns) for loss in cluster_hamming_loss]

    # Print average metrics
    print(f'Iteration {iteration} | Average Hamming Loss:     {np.mean(cluster_hamming_loss):.5f}')
    print(f'Iteration {iteration} | Average Hamming Score:    {np.mean(cluster_hamming_score):.5f}')
    print(f'Iteration {iteration} | Average Hamming Distance: {np.mean(cluster_hamming_dist):.5f}')
    print()
    
    #-------------------------------------------------------#
    #   ITERATION METRICS                                   #
    #-------------------------------------------------------# 
    
    mean_hamming_distance = np.mean(cluster_hamming_dist)
    hammings.append(mean_hamming_distance)    

HBox(children=(FloatProgress(value=0.0, description='Monte-Carlo Simulation', layout=Layout(flex='2'), max=50.…

Iteration 1 | Best K: 4
Iteration 1 | Silhouette Score: 0.37875
Iteration 1 | Average Hamming Loss:     0.22242
Iteration 1 | Average Hamming Score:    0.77758
Iteration 1 | Average Hamming Distance: 0.66727

Iteration 2 | Best K: 4
Iteration 2 | Silhouette Score: 0.37875
Iteration 2 | Average Hamming Loss:     0.22242
Iteration 2 | Average Hamming Score:    0.77758
Iteration 2 | Average Hamming Distance: 0.66727

Iteration 3 | Best K: 4
Iteration 3 | Silhouette Score: 0.37875
Iteration 3 | Average Hamming Loss:     0.22242
Iteration 3 | Average Hamming Score:    0.77758
Iteration 3 | Average Hamming Distance: 0.66727

Iteration 4 | Best K: 4
Iteration 4 | Silhouette Score: 0.37875
Iteration 4 | Average Hamming Loss:     0.22242
Iteration 4 | Average Hamming Score:    0.77758
Iteration 4 | Average Hamming Distance: 0.66727

Iteration 5 | Best K: 4
Iteration 5 | Silhouette Score: 0.37875
Iteration 5 | Average Hamming Loss:     0.22242
Iteration 5 | Average Hamming Score:    0.77758
Iter

Iteration 40 | Best K: 4
Iteration 40 | Silhouette Score: 0.37885
Iteration 40 | Average Hamming Loss:     0.22164
Iteration 40 | Average Hamming Score:    0.77836
Iteration 40 | Average Hamming Distance: 0.66491

Iteration 41 | Best K: 4
Iteration 41 | Silhouette Score: 0.38405
Iteration 41 | Average Hamming Loss:     0.23336
Iteration 41 | Average Hamming Score:    0.76664
Iteration 41 | Average Hamming Distance: 0.70007

Iteration 42 | Best K: 4
Iteration 42 | Silhouette Score: 0.37875
Iteration 42 | Average Hamming Loss:     0.22242
Iteration 42 | Average Hamming Score:    0.77758
Iteration 42 | Average Hamming Distance: 0.66727

Iteration 43 | Best K: 4
Iteration 43 | Silhouette Score: 0.37875
Iteration 43 | Average Hamming Loss:     0.22242
Iteration 43 | Average Hamming Score:    0.77758
Iteration 43 | Average Hamming Distance: 0.66727

Iteration 44 | Best K: 4
Iteration 44 | Silhouette Score: 0.37875
Iteration 44 | Average Hamming Loss:     0.22242
Iteration 44 | Average Hammin

In [10]:
print('Monte-Carlo Simulation Results:')
print(f'Hamming Distance |            average = {np.mean(hammings):.5f}')
print(f'Hamming Distance | standard deviation = {np.std(hammings):.5f}')

Monte-Carlo Simulation Results:
Hamming Distance |            average = 0.67240
Hamming Distance | standard deviation = 0.01722
