# LECTURE 3: Intro Machine Learning

Instructions:
-

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.
3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.
5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?
6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.
7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

In [2]:
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd

# fetch dataset 
iris = fetch_ucirepo(id=53) 
  
# data (as pandas dataframes) 
X = iris.data.features 
y = iris.data.targets 

df = pd.DataFrame(iris.data.original, columns=iris.headers)
df


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [7]:
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score

# Extract features
X = df[['sepal length', 'sepal width', 'petal length', 'petal width']]

# Fit KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Predicted clusters
y_pred = kmeans.labels_

# Actual labels
y_true = df['class'].replace({'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2})


  super()._check_params_vs_input(X, default_n_init=10)


In [8]:
# Compute ARI
ari = adjusted_rand_score(y_true, y_pred)

# Compute NMI
nmi = normalized_mutual_info_score(y_true, y_pred)

# Compute FMI
fmi = fowlkes_mallows_score(y_true, y_pred)

print(f"Adjusted Rand Index (ARI): {ari}")
print(f"Normalized Mutual Information (NMI): {nmi}")
print(f"Folkes-Mallows Index (FMI): {fmi}")


Adjusted Rand Index (ARI): 0.7302382722834697
Normalized Mutual Information (NMI): 0.7581756800057784
Folkes-Mallows Index (FMI): 0.8208080729114153


In [9]:
from kmodes.kmodes import KModes
from scipy.cluster.hierarchy import linkage, fcluster

# K-Modes clustering
km = KModes(n_clusters=3, init='Huang', random_state=42)
km_clusters = km.fit_predict(X)

# Hierarchical clustering
Z = linkage(X, method='ward')
hier_clusters = fcluster(Z, 3, criterion='maxclust')

# Compare using ARI, NMI, and FMI for K-Modes
ari_kmodes = adjusted_rand_score(y_true, km_clusters)
nmi_kmodes = normalized_mutual_info_score(y_true, km_clusters)
fmi_kmodes = fowlkes_mallows_score(y_true, km_clusters)

# Compare using ARI, NMI, and FMI for Hierarchical
ari_hier = adjusted_rand_score(y_true, hier_clusters)
nmi_hier = normalized_mutual_info_score(y_true, hier_clusters)
fmi_hier = fowlkes_mallows_score(y_true, hier_clusters)

print(f"KModes ARI: {ari_kmodes}, NMI: {nmi_kmodes}, FMI: {fmi_kmodes}")
print(f"Hierarchical ARI: {ari_hier}, NMI: {nmi_hier}, FMI: {fmi_hier}")


KModes ARI: 0.09216062573782456, NMI: 0.13269189892685265, FMI: 0.43880184250711013
Hierarchical ARI: 0.7311985567707746, NMI: 0.7700836616487869, FMI: 0.8221697785442927
