## Cluster Analysis
This book comprises a cluster analyis of the complete dataset (Complete_data) and its derived datasets (Influencers_taxa and Selected_data), as defined in earlier notebooks. The pipeline applied is described below:

__Data Preparation:__  
    Only scaling and normalization were applied, as data cleaning was already performed in earlier steps.  
__Dimensionality Reduction:__  
    Principal Component Analysis (PCA) was used to reduce dimensionality while retaining approximately 90% of the variance.  
    This step resulted in transformed datasets with reduced feature dimensions, which were subsequently used for clustering.  
__Clustering Analysis:__  
    Clustering was performed on the transformed datasets using:   
    K-Means, with k=5 determined by the elbow method.  
    DBSCAN, for detecting clusters and outliers.  
    Gaussian Mixture Models (GMM), to capture potential non-linear structures.  
__Performance evaluation included:__   
    Internal metrics, such as Silhouette Score and Davies-Bouldin Index.  
    External metrics (where labels were available), including Adjusted Rand Index (ARI) and Homogeneity Score.  

In [200]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, adjusted_rand_score, davies_bouldin_score

MinMaxScaler is the denominated scaler that owe to be used with this data since doesnt have gaussian distribution, however it does only work well for DBSCAN, so the StandarScaler is use in conjuction with the KMeans cluster analysis. First pipeline Scales, and reduce the dimensionality with PCA to 90% varianza, then uses 3 clustering methods( K-Means, DBSCAN, and GMM) applied on PCA-reduced data. Evaluation of the methods uses following metrics: Silhouette Score and Davies-Bouldin Index for internal evaluation. Adjusted Rand Index (ARI) for external evaluation since true labels are present.

In [201]:
#Loading the data 
pd.options.display.float_format = '{:.4f}'.format
# Read the excel file
Jointax = pd.read_excel('data/Jointax.xlsx', sheet_name='Biotot_jointax', header=[0,1,2,3,4,5,6,7] , dtype={**{i: str for i in range(0,2)},
                                                                        **{i:float for i in range(2, 884)}},  skiprows=[8]) # Somehow it was showing an empty row, so skiprow deletes it

# Making sure the sites and categories get read as they should
Jointax["Sites"]= Jointax["Sites"].astype(str)
Jointax["Category"]= Jointax["Category"].values.astype(int)
#Drop level of Kindom since it is boring
Jointax.columns = Jointax.columns.droplevel(1)
Jointax = Jointax.reset_index(drop=True)
#Setting the sites as index
Jointax = Jointax.set_index("Sites").reset_index()
# Deleting headers names of unnamed levels
Jointax.columns = Jointax.columns.map(lambda x: tuple('' if 'Unnamed' in str(level) else level for level in x))
#Drop column 1
Jointax =Jointax.drop(Jointax.columns[1], axis=1)
#Correcting the Tuple-like Index
Jointax['Sites'] = Jointax['Sites'].map(lambda x: x[0] if isinstance(x, tuple) else x)
Jointax = Jointax.set_index("Sites")

In [202]:
# We working only with the values in this notebook, still keeping the taxa 
Jointax.columns = Jointax.columns.droplevel([0,1,2,3,4,5])
abundance_all = Jointax.reset_index(drop=False)
#Reset the index

In [203]:
if abundance_all.columns[1] == "":
    abundance_all.rename(columns = {abundance_all.columns[1]: "Category"}, inplace=True)
abundance_all= abundance_all.set_index("Sites")

In [204]:
abundance_all.head()

Unnamed: 0_level_0,Category,1,2,3,4,5,6,7,8,9,...,873,874,875,876,877,878,879,880,881,882
Sites,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
site_1,3,0.0,0.0,0.0,0.0,0.0,0.4308,0.517,0.0,0.0,...,0.0,0.0,0.0,0.0215,0.0,0.0,0.0,0.0,0.0215,0.0
site_2,1,0.0,0.0,0.0,0.0,0.0,0.019,0.3415,0.0,0.0,...,0.019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019,0.0
site_3,1,0.0,0.0,0.0,0.0,0.0,0.0246,0.3192,0.0,0.0,...,0.0123,0.0,0.0,0.0123,0.0,0.0,0.0,0.0,0.0246,0.0123
site_4,1,0.0,0.0,0.0154,0.0,0.0,0.0176,0.2512,0.0,0.0,...,0.0022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0066,0.0022
site_5,1,0.0,0.0,0.0037,0.0,0.0,0.0221,0.5098,0.0,0.0,...,0.0037,0.0,0.0037,0.0,0.0,0.0,0.0,0.0,0.011,0.0037


In [205]:
def clustering_pipeline_all(df, n_clusters=5, eps=0.5, min_samples=5, n_components=2):
    """
    Performs clustering using K-Means, DBSCAN, and GMM with PCA for dimensionality reduction.
    
    Parameters:
    - df: Input DataFrame (features only, no labels).
    - n_clusters: Number of clusters for K-Means and GMM (default=5).
    - eps: DBSCAN's epsilon parameter (default=0.5).
    - min_samples: Minimum samples for DBSCAN (default=5).
    - n_components: Number of components for PCA (default=2).
    
    Returns:
    - results: Dictionary with clustering results for K-Means, DBSCAN, and GMM, as well as PCA data and metrics.
    """
    results = {}
    df = df.drop(columns=['Category'])  # Drop any non-numeric columns

    # Step 1: Scaling the data
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(df)
    
    # Step 2: PCA for dimensionality reduction
    pca = PCA(n_components=n_components)
    pca_data = pca.fit_transform(scaled_data)
    results['explained_variance'] = pca.explained_variance_ratio_
    
    # Step 3: K-Means Clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans_labels = kmeans.fit_predict(pca_data)
    kmeans_silhouette = silhouette_score(pca_data, kmeans_labels)
    kmeans_db_score = davies_bouldin_score(pca_data, kmeans_labels)
    results['kmeans'] = {
        'cluster_labels': kmeans_labels,
        'silhouette_score': kmeans_silhouette,
        'davies_bouldin_score': kmeans_db_score
    }
    
    # Step 4: DBSCAN Clustering
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    dbscan_labels = dbscan.fit_predict(pca_data)
    valid_indices = dbscan_labels != -1
    if len(set(dbscan_labels[valid_indices])) > 1:
        dbscan_silhouette = silhouette_score(pca_data[valid_indices], dbscan_labels[valid_indices])
        dbscan_db_score = davies_bouldin_score(pca_data[valid_indices], dbscan_labels[valid_indices])
    else:
        dbscan_silhouette = None
        dbscan_db_score = None
    results['dbscan'] = {
        'cluster_labels': dbscan_labels,
        'silhouette_score': dbscan_silhouette,
        'davies_bouldin_score': dbscan_db_score
    }
    
    # Step 5: GMM Clustering
    gmm = GaussianMixture(n_components=n_clusters, random_state=42)
    gmm_labels = gmm.fit_predict(pca_data)
    gmm_silhouette = silhouette_score(pca_data, gmm_labels)
    gmm_db_score = davies_bouldin_score(pca_data, gmm_labels)
    results['gmm'] = {
        'cluster_labels': gmm_labels,
        'silhouette_score': gmm_silhouette,
        'davies_bouldin_score': gmm_db_score
    }
    
    # Store PCA-transformed data
    results['pca_data'] = pca_data
    
    return results


In [206]:
#Running the pipeline for DataFrames: df1, df2, df3
dataframes = [abundance_all]

# Running the  pipeline for each DataFrame
clustering_results = [clustering_pipeline_all(df, n_clusters=5, eps=0.5, min_samples=5, n_components=2) for df in dataframes]

# K-Means silhouette score for abundance_all
kmeans_silhouette_abundance_all = clustering_results[0]['kmeans']['silhouette_score']

kmeans_labels_abundance_all = clustering_results[0]['kmeans']['cluster_labels']

In [207]:
print(kmeans_labels_abundance_all , kmeans_silhouette_abundance_all)

[1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 0 1 2 1 1 1 2 2 1 1 2 2
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 3 1 4 4 1 4 4 4 1 1 1 1 1 1 1] 0.7610374381507409


In [208]:
import matplotlib.pyplot as plt
import seaborn as sns

def plot_clusters(pca_data, cluster_labels, method_name="Clustering"):
    """
    Visualize clusters using PCA-transformed data.
    """
    plt.figure(figsize=(10, 6))
    sns.scatterplot(
        x=pca_data[:, 0],
        y=pca_data[:, 1],
        hue=cluster_labels,
        palette="viridis",
        s=50,
        alpha=0.8,
        edgecolor="k"
    )
    plt.title(f"{method_name} Visualization (PCA Reduced)", fontsize=16)
    plt.xlabel("Principal Component 1")
    plt.ylabel("Principal Component 2")
    plt.legend(title="Cluster", loc="best")
    plt.grid(True)
    plt.show()

# Example: Plot K-Means Clusters
pca_data = clustering_results['pca_data']
kmeans_labels = clustering_results[0]['kmeans']['cluster_labels']

plot_clusters(pca_data, kmeans_labels, method_name="K-Means Clustering")
fig.show()

TypeError: list indices must be integers or slices, not str

Second pipeline will use both scalers MinMaxScaler better suitable for DBSCAN and StandarScaler better for KMeans

In [None]:
# Define scalers and clustering methods
scalers = [("StandardScaler", StandardScaler()), ("MinMaxScaler", MinMaxScaler())]
clustering_methods = [
    ("DBSCAN", DBSCAN(eps=0.5, min_samples=5)),
    ("KMeans", KMeans(n_clusters=5, random_state=42))
]

results = []

# Loop through scalers and clustering methods
for scaler_name, scaler in scalers:
    for cluster_name, clusterer in clustering_methods:
        pipeline = Pipeline([
            ("scaler", scaler),
            ("clustering", clusterer)
        ])
        # Fit the pipeline
        pipeline.fit(df.drop(columns=["categories", "Sites"]))

        # Retrieve cluster labels
        labels = pipeline["clustering"].labels_

        # Filter noise for DBSCAN
        if cluster_name == "DBSCAN" and -1 in labels:
            labels = labels[labels != -1]

        # Evaluate metrics if clustering is valid
        if len(np.unique(labels)) > 1:  # Ensure we have more than one cluster
            silhouette = silhouette_score(df.drop(columns=["categories", "Sites"]), labels)
            davies_bouldin = davies_bouldin_score(df.drop(columns=["categories", "Sites"]), labels)
        else:
            silhouette, davies_bouldin = None, None

        # Store results
        results.append({
            "Scaler": scaler_name,
            "Clustering Method": cluster_name,
            "Silhouette Score": silhouette,
            "Davies-Bouldin Index": davies_bouldin
        })

# Display results
import pandas as pd
results_df = pd.DataFrame(results)
print(results_df)
