## ASSIGNMENT 3
####  by Joshua Rodriguez
#### Instructions:

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.

In [1]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import numpy as np
from sklearn.metrics.cluster import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
from sklearn.cluster import KMeans
from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import OneHotEncoder
import networkx as nx
import itertools
import warnings
from sklearn.cluster import AgglomerativeClustering
from kmodes.kmodes import KModes
from sklearn.preprocessing import LabelEncoder 

In [2]:
heart_disease = fetch_ucirepo(id=45) 
breast_cancer = fetch_ucirepo(id=15)

In [3]:
X = heart_disease.data.features
y = heart_disease.data.targets 
heart_disease_df = pd.merge(X, y, left_index=True, right_index=True)
heart_disease_df = heart_disease_df.dropna()

X = breast_cancer.data.features
y = breast_cancer.data.targets 
breast_cancer_df = pd.merge(X, y, left_index=True, right_index=True)
breast_cancer_df = breast_cancer_df.dropna()

In [18]:
def jaccard_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union

def ochiai_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = np.sqrt(len(set1) * len(set2))
    return intersection / denominator

def overlap_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    min_length = min(len(set1), len(set2))
    return intersection / min_length

def dice_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = len(set1) + len(set2)
    return 2 * intersection / denominator

def graph_based_representation(data):
    num_samples, num_features = data.shape
    similarity_matrix = np.zeros((num_features, num_features))
    for i, j in itertools.combinations(range(num_features), 2):
        similarity_matrix[i, j] = jaccard_coefficient(set(data[:, i]), set(data[:, j]))
        similarity_matrix[j, i] = similarity_matrix[i, j]
    G = nx.from_numpy_array(similarity_matrix)
    embedding = SpectralEmbedding(n_components=p)
    representation_matrix = embedding.fit_transform(similarity_matrix)
    return representation_matrix

def joint_operation(data, representation_matrix):
    return np.dot(data, representation_matrix)

def mean_operation(data, representation_matrix):
    return np.mean(np.dot(data, representation_matrix), axis=1)

def perform_clustering(data, k):
    kmeans = KMeans(n_clusters=k)
    return kmeans.fit_predict(data)

3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.

In [30]:
# Define parameters
p = 10
q = 10 
k = 3

results = []

# Data preprocessing and clustering
datasets = ["heart_disease_df","breast_cancer_df"]
for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore") 
        try:
            X = dataset
            enc = OneHotEncoder()
            X_encoded = enc.fit_transform(X)
            representation_matrix = graph_based_representation(X_encoded.toarray())
            integrated_data = joint_operation(X_encoded.toarray(), representation_matrix)
            labels = perform_clustering(integrated_data, k)

            true_labels = dataset.iloc[:, -1] 
            ARI = adjusted_rand_score(true_labels, labels)
            NMI = normalized_mutual_info_score(true_labels, labels)
            FMI = fowlkes_mallows_score(true_labels, labels)
            results.append([dataset_name, ARI, NMI, FMI])
        except UserWarning as e:
            print(f"Warning: {e}")

results_df = pd.DataFrame(results, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_df)


            Dataset       ARI       NMI       FMI
0  heart_disease_df  0.013511  0.026229  0.420845
1  breast_cancer_df  0.553329  0.547089  0.765220


3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.

The Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Fowlkes-Mallows Index (FMI) are tools to measure how well clustering algorithms perform. ARI considers both how clusters are similar and accounts for random chance. NMI also looks at similarities, but it's better for situations where clusters have different sizes. FMI looks at how well clusters match each other, but it doesn't consider when clusters are correctly labeled as not belonging together. Each tool has its strengths and weaknesses. ARI works well when clusters are similar in size, NMI is good for when they're not, and FMI is helpful for precision and recall. The best choice depends on the specific clustering task you're dealing with.

In [31]:
results_categorical = []

for dataset_name in datasets:
    dataset = globals()[dataset_name]
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        try:
            X_cat = dataset.iloc[:, :-1]
            true_labels_cat = dataset.iloc[:, -1]

            encoder = LabelEncoder()
            X_cat_encoded = X_cat.apply(encoder.fit_transform)

            km = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
            km_labels = km.fit_predict(X_cat_encoded)
            ARI_km = adjusted_rand_score(true_labels_cat, km_labels)
            NMI_km = normalized_mutual_info_score(true_labels_cat, km_labels)
            FMI_km = fowlkes_mallows_score(true_labels_cat, km_labels)

            ac = AgglomerativeClustering(n_clusters=k, linkage='ward')
            ac_labels = ac.fit_predict(X_cat_encoded)
            ARI_ac = adjusted_rand_score(true_labels_cat, ac_labels)
            NMI_ac = normalized_mutual_info_score(true_labels_cat, ac_labels)
            FMI_ac = fowlkes_mallows_score(true_labels_cat, ac_labels)
            
            results_categorical.append([dataset_name + " (Kmodes)", ARI_km, NMI_km, FMI_km])
            results_categorical.append([dataset_name + " (Hierarchical)", ARI_ac, NMI_ac, FMI_ac])
        except UserWarning as e:
            print(f"Warning: {e}")

results_categorical_df = pd.DataFrame(results_categorical, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_categorical_df)

                           Dataset       ARI       NMI       FMI
0        heart_disease_df (Kmodes)  0.269859  0.207009  0.525872
1  heart_disease_df (Hierarchical)  0.009543  0.010944  0.353434
2        breast_cancer_df (Kmodes)  0.351220  0.439297  0.651028
3  breast_cancer_df (Hierarchical)  0.781678  0.688725  0.893555


##### 7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).
    
In the comparison of clustering performance using different algorithms (Kmodes and Hierarchical) on heart disease and breast cancer datasets, it's evident that the Fowlkes-Mallows Index (FMI) consistently yields higher values compared to the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). This suggests that FMI provides a more optimistic evaluation of clustering quality. However, the ARI and NMI scores are relatively low across all cases, indicating that the clusters produced by the algorithms may not align well with the ground truth labels. This discrepancy highlights the limitations of ARI and NMI, which can be influenced by chance and may not adequately capture the nuances of cluster structures. Therefore, while FMI offers a more favorable assessment, caution is warranted in interpreting clustering results solely based on ARI and NMI due to their potential shortcomings in certain scenarios.