# Lesson 3
Instructions:
-

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.
3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.
5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?
6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.
7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

### Fetch Data

In [None]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd

soybean= fetch_ucirepo(id=91) 
zoo = fetch_ucirepo(id=111) 
heart_disease = fetch_ucirepo(id=45) 
dermatology = fetch_ucirepo(id=33)
breast_cancer = fetch_ucirepo(id=15)
mushroom = fetch_ucirepo(id=73)

In [2]:
import pandas as pd

datasets = [soybean, zoo, heart_disease, dermatology, breast_cancer, mushroom]
dataset_names = ['soybean', 'zoo', 'heart_disease', 'dermatology', 'breast_cancer', 'mushroom']
dfs = []

for dataset, name in zip(datasets, dataset_names):
    X = dataset.data.features
    y = dataset.data.targets
    df = pd.merge(X, y, left_index=True, right_index=True)
    df = df.dropna()
    dfs.append((name, df))

soybean_df, zoo_df, heart_disease_df, dermatology_df, breast_cancer_df, mushroom_df = dfs

In [4]:
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from sklearn.metrics.cluster import contingency_matrix
import numpy as np

def folkes_mallows_index(true_labels, pred_labels):
    contingency = contingency_matrix(true_labels, pred_labels)
    tp = np.sum(contingency * (contingency - 1)) / 2
    tp_fp = np.sum(np.sum(contingency, axis=0) * (np.sum(contingency, axis=0) - 1)) / 2
    tp_fn = np.sum(np.sum(contingency, axis=1) * (np.sum(contingency, axis=1) - 1)) / 2
    precision = tp / tp_fp
    recall = tp / tp_fn
    fm_index = np.sqrt(precision * recall)
    return fm_index

results = []

for name, df in dfs:
    true_labels = df.iloc[:, -1].values
    pred_labels = df.iloc[:, -2].values 
    
    ari = adjusted_rand_score(true_labels, pred_labels)
    
    nmi = normalized_mutual_info_score(true_labels, pred_labels)
    
    fmi = folkes_mallows_index(true_labels, pred_labels)
    
    results.append((name, ari, nmi, fmi))

results_df = pd.DataFrame(results, columns=['Dataset', 'ARI', 'NMI', 'FMI'])

results_df


Unnamed: 0,Dataset,ARI,NMI,FMI
0,soybean,0.443357,0.590101,0.673148
1,zoo,0.129071,0.182616,0.437065
2,heart_disease,0.248801,0.157248,0.551633
3,dermatology,0.006635,0.186523,0.080122
4,breast_cancer,0.275454,0.205396,0.71976
5,mushroom,0.068546,0.069612,0.453842


### When to use each? 

The Adjusted Rand Index (ARI) measures how similar two clusterings are, taking into account random chance. It's good for comparing predicted and true clusterings in datasets. Normalized Mutual Information (NMI) also compares clusterings, but it's better when the number of clusters might vary between them. Folkes-Mallows Index (FMI) looks at clustering quality in terms of precision and recall, giving a balanced view. ARI helps when you want to see how well predictions match true clusters, NMI handles different numbers of clusters, and FMI gives a balanced look at clustering quality based on precision and recall.

### Data Clustering & Compare Performance

In [6]:
from kmodes.kmodes import KModes

km = KModes(n_clusters=5, init='Cao', n_init=5, verbose=1)
km_labels = km.fit_predict(df)

# Perform clustering using Hierarchical Clustering
hc = AgglomerativeClustering(n_clusters=5, linkage='ward')
hc_labels = hc.fit_predict(df)

# Compute FMI, ARI, and NMI scores for Kmodes
fmi_km = folkes_mallows_index(true_labels, km_labels)
ari_km = adjusted_rand_score(true_labels, km_labels)
nmi_km = normalized_mutual_info_score(true_labels, km_labels)

# Compute FMI, ARI, and NMI scores for Hierarchical Clustering
fmi_hc = folkes_mallows_index(true_labels, hc_labels)
ari_hc = adjusted_rand_score(true_labels, hc_labels)
nmi_hc = normalized_mutual_info_score(true_labels, hc_labels)

# Display the results
results = {
    "Kmodes": {"FMI": fmi_km, "ARI": ari_km, "NMI": nmi_km},
    "Hierarchical Clustering": {"FMI": fmi_hc, "ARI": ari_hc, "NMI": nmi_hc}
}

results

Initialization method and algorithm are deterministic. Setting n_init to 1.
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 430, cost: 30269.0
Run 1, iteration: 2/100, moves: 317, cost: 30079.0
Run 1, iteration: 3/100, moves: 56, cost: 30079.0


NameError: name 'AgglomerativeClustering' is not defined

### FMI, ARI, & NMI Evaluation

The discrepancy in the values of FMI (Folkes-Mallows Index) compared to ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information) could be due to differences in their calculation methods and what they measure. FMI evaluates clustering quality based on precision and recall, providing a balanced assessment. ARI measures the similarity between two clusterings, adjusting for chance agreement, while NMI measures mutual dependence normalized to be between 0 and 1. FMI tends to be higher than ARI and NMI because it considers both precision and recall, which can yield higher values, especially when there's a good balance between precision and recall. However, it's essential to consider the specific characteristics of the dataset and the evaluation goals when interpreting these metrics, as they may behave differently depending on the dataset's properties and clustering algorithms used.


