Instructions:
-

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.
3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.
5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?
6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.
7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

In [24]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
zoo = fetch_ucirepo(id=111) 
  
# data (as pandas dataframes) 
X = zoo.data.features 
y = zoo.data.targets 
  
# metadata 
print(zoo.metadata) 
  
# variable information 
print(zoo.variables) 


{'uci_id': 111, 'name': 'Zoo', 'repository_url': 'https://archive.ics.uci.edu/dataset/111/zoo', 'data_url': 'https://archive.ics.uci.edu/static/public/111/data.csv', 'abstract': 'Artificial, 7 classes of animals', 'area': 'Biology', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 101, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': [], 'target_col': ['type'], 'index_col': ['animal_name'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1990, 'last_updated': 'Fri Sep 15 2023', 'dataset_doi': '10.24432/C5R59V', 'creators': ['Richard Forsyth'], 'intro_paper': None, 'additional_info': {'summary': 'A simple database containing 17 Boolean-valued attributes.  The "type" attribute appears to be the class attribute.  Here is a breakdown of which animals are in which type: (I find it unusual that there are 2 instances of "frog" and one of "girl"!)', 'purpose': None, 'funded_by': None, 'inst

In [32]:
import numpy as np
import itertools
import networkx as nx
from sklearn.cluster import KMeans
from sklearn.manifold import SpectralEmbedding

def jaccard_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

def ochiai_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = np.sqrt(len(set1) * len(set2))
    return intersection / denominator if denominator != 0 else 0

def overlap_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    min_length = min(len(set1), len(set2))
    return intersection / min_length if min_length != 0 else 0

def dice_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    dice_denominator = len(set1) + len(set2)
    return 2 * intersection / dice_denominator if dice_denominator != 0 else 0

def graph_based_representation(data, num_components, coefficient='jaccard'):
    num_samples, num_features = data.shape
    similarity_matrix = np.zeros((num_features, num_features))
    for i, j in itertools.combinations(range(num_features), 2):
        if coefficient == 'jaccard':
            similarity_matrix[i, j] = jaccard_coefficient(set(data[:, i]), set(data[:, j]))
        elif coefficient == 'ochiai':
            similarity_matrix[i, j] = ochiai_coefficient(set(data[:, i]), set(data[:, j]))
        elif coefficient == 'overlap':
            similarity_matrix[i, j] = overlap_coefficient(set(data[:, i]), set(data[:, j]))
        elif coefficient == 'dice':
            similarity_matrix[i, j] = dice_coefficient(set(data[:, i]), set(data[:, j]))
        similarity_matrix[j, i] = similarity_matrix[i, j]
    G = nx.from_numpy_array(similarity_matrix)
    embedding = SpectralEmbedding(n_components=num_components)
    representation_matrix = embedding.fit_transform(similarity_matrix)
    return representation_matrix

def joint_operation(data, representation_matrix):
    return np.dot(data, representation_matrix)

def mean_operation(data, representation_matrix):
    return np.mean(np.dot(data, representation_matrix), axis=1)

def perform_clustering(data, k):
    kmeans = KMeans(n_clusters=k)
    return kmeans.fit_predict(data)


In [36]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
import networkx as nx
from sklearn.manifold import SpectralEmbedding
import numpy as np
import itertools

def graph_based_representation(data, num_components):
    num_samples, num_features = data.shape
    similarity_matrix = np.zeros((num_features, num_features))
    for i, j in itertools.combinations(range(num_features), 2):
        set1 = set(data[:, i])
        set2 = set(data[:, j])
        intersection = len(set1.intersection(set2))
        union = len(set1.union(set2))
        similarity_matrix[i, j] = intersection / union if union != 0 else 0
        similarity_matrix[j, i] = similarity_matrix[i, j]
    G = nx.from_numpy_array(similarity_matrix)
    embedding = SpectralEmbedding(n_components=num_components)
    representation_matrix = embedding.fit_transform(similarity_matrix)
    return representation_matrix

def joint_operation(data, representation_matrix):
    return np.dot(data, representation_matrix)

def mean_operation(data, representation_matrix):
    return np.mean(np.dot(data, representation_matrix), axis=1)

def perform_clustering(data, k):
    kmeans = KMeans(n_clusters=k)
    return kmeans.fit_predict(data)

p = 10
k = 3  

dataset = zoo

results = []

try:
    X = dataset.data.features 
    y = dataset.data.targets.values.flatten()  # Extract values and flatten the array

    enc = OneHotEncoder(sparse_output=False)
    X_encoded = enc.fit_transform(X)

    representation_matrix = graph_based_representation(X_encoded, p)
    
    integrated_data = joint_operation(X_encoded, representation_matrix)
    
    labels = perform_clustering(integrated_data, k)

    ARI = adjusted_rand_score(y, labels)
    NMI = normalized_mutual_info_score(y, labels)
    FMI = fowlkes_mallows_score(y, labels)
    
    results.append(['zoo_df', ARI, NMI, FMI])

except ValueError as ve:
    print(f"ValueError: {ve}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

results_df = pd.DataFrame(results, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_df)


  Dataset       ARI       NMI       FMI
0  zoo_df  0.529176  0.664413  0.697793


  super()._check_params_vs_input(X, default_n_init=10)


In [45]:
from kmodes.kmodes import KModes
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

k = 3
results = []

try:
    zoo_df = pd.DataFrame(zoo.data.features, columns=zoo.metadata.features)
    zoo_df['type'] = zoo.data.targets

    if zoo_df.isnull().values.any():
        raise ValueError("Dataset contains missing values. Please handle them before proceeding.")

    X = zoo_df.drop(columns=['type']) 
    y = zoo_df['type'].values         

    if not all(pd.api.types.is_numeric_dtype(dtype) or pd.api.types.is_categorical_dtype(dtype) for dtype in X.dtypes):
        raise ValueError("All features must be numeric or categorical for OneHotEncoder.")

    enc = OneHotEncoder(sparse_output=False)
    X_encoded = enc.fit_transform(X)

    kmodes = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
    kmodes_labels = kmodes.fit_predict(X_encoded)  # Removed toarray() call
    ARI_kmodes = adjusted_rand_score(y, kmodes_labels)
    NMI_kmodes = normalized_mutual_info_score(y, kmodes_labels)
    FMI_kmodes = fowlkes_mallows_score(y, kmodes_labels)
    results.append(['Kmodes', ARI_kmodes, NMI_kmodes, FMI_kmodes])

    Z = linkage(X_encoded.toarray(), method='ward')
    hierarchical_labels = fcluster(Z, k, criterion='maxclust')
    ARI_hierarchical = adjusted_rand_score(y, hierarchical_labels)
    NMI_hierarchical = normalized_mutual_info_score(y, hierarchical_labels)
    FMI_hierarchical = fowlkes_mallows_score(y, hierarchical_labels)
    results.append(['Hierarchical', ARI_hierarchical, NMI_hierarchical, FMI_hierarchical])

except ValueError as ve:
    print(f"ValueError: {ve}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

results_df = pd.DataFrame(results, columns=["Method", "ARI", "NMI", "FMI"])
print(results_df)


An unexpected error occurred: 'numpy.ndarray' object has no attribute 'toarray'
   Method      ARI      NMI       FMI
0  Kmodes  0.72157  0.71451  0.812035


1. Why is FMI always greater than ARI and NMI?

The Folkes-Mallows Index tends to be higher than both the Adjusted Rand Index and the Normalized Mutual Information under certain conditions due to its calculation method. FMI measures the similarity between two clustering solutions by considering the ratio of the geometric mean of pairwise agreements in the two solutions to their cluster sizes. This means that FMI considers both agreement and cluster size, potentially resulting in higher values compared to ARI and NMI.

2. What's the problem with ARI and NMI?

ARI and NMI are commonly used metrics for evaluating clustering performance, but they have some limitations. One challenge is their sensitivity to cluster sizes, which can bias the results, especially in imbalanced datasets. Also, because ARI and NMI provide scores between 0 and 1, it can be hard to understand the quality of clustering solutions based only on these scores. Additionally, interpreting what constitutes a "good" or "bad" score may require additional context or comparisons. Both ARI and NMI rely on having a ground truth clustering solution for comparison, which may not always be available or clear in real-world scenarios.