In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score, v_measure_score, homogeneity_score, completeness_score

# Load and preprocess the Iris dataset
print("Loading dataset...")
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
target = pd.Series(iris.target, name='target')

# Standardize the features
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Clustering with K-Means
print("Applying K-Means...")
kmeans = KMeans(n_clusters=3, random_state=42)
clusters_kmeans = kmeans.fit_predict(data_scaled)

# Clustering with DBSCAN
print("Applying DBSCAN...")
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters_dbscan = dbscan.fit_predict(data_scaled)

# Clustering with Hierarchical Clustering
print("Applying Agglomerative Clustering...")
agg_clustering = AgglomerativeClustering(n_clusters=3)
clusters_agg = agg_clustering.fit_predict(data_scaled)

# Evaluation Metrics

# Homogeneity and Completeness
homogeneity_kmeans = homogeneity_score(target, clusters_kmeans)
completeness_kmeans = completeness_score(target, clusters_kmeans)

homogeneity_dbscan = homogeneity_score(target, clusters_dbscan)
completeness_dbscan = completeness_score(target, clusters_dbscan)

homogeneity_agg = homogeneity_score(target, clusters_agg)
completeness_agg = completeness_score(target, clusters_agg)

# V-measure
v_measure_kmeans = v_measure_score(target, clusters_kmeans)
v_measure_dbscan = v_measure_score(target, clusters_dbscan)
v_measure_agg = v_measure_score(target, clusters_agg)

# Silhouette Coefficient
silhouette_kmeans = silhouette_score(data_scaled, clusters_kmeans)
silhouette_dbscan = silhouette_score(data_scaled, clusters_dbscan) if len(set(clusters_dbscan)) > 1 else 'N/A'
silhouette_agg = silhouette_score(data_scaled, clusters_agg)

# Davies-Bouldin Index
davies_bouldin_kmeans = davies_bouldin_score(data_scaled, clusters_kmeans)
davies_bouldin_dbscan = davies_bouldin_score(data_scaled, clusters_dbscan) if len(set(clusters_dbscan)) > 1 else 'N/A'
davies_bouldin_agg = davies_bouldin_score(data_scaled, clusters_agg)

# Print Results
print(f"K-Means Homogeneity: {homogeneity_kmeans:.2f}")
print(f"K-Means Completeness: {completeness_kmeans:.2f}")
print(f"K-Means V-Measure: {v_measure_kmeans:.2f}")
print(f"K-Means Silhouette Coefficient: {silhouette_kmeans:.2f}")
print(f"K-Means Davies-Bouldin Index: {davies_bouldin_kmeans:.2f}")

print(f"DBSCAN Homogeneity: {homogeneity_dbscan:.2f}")
print(f"DBSCAN Completeness: {completeness_dbscan:.2f}")
print(f"DBSCAN V-Measure: {v_measure_dbscan:.2f}")
print(f"DBSCAN Silhouette Coefficient: {silhouette_dbscan}")
print(f"DBSCAN Davies-Bouldin Index: {davies_bouldin_dbscan}")

print(f"Agglomerative Clustering Homogeneity: {homogeneity_agg:.2f}")
print(f"Agglomerative Clustering Completeness: {completeness_agg:.2f}")
print(f"Agglomerative Clustering V-Measure: {v_measure_agg:.2f}")
print(f"Agglomerative Clustering Silhouette Coefficient: {silhouette_agg:.2f}")
print(f"Agglomerative Clustering Davies-Bouldin Index: {davies_bouldin_agg:.2f}")

print("""
Q1. **Homogeneity and Completeness**:
   - **Homogeneity** measures if all data points in a cluster have the same ground truth class. It is calculated as the ratio of the entropy of the true labels to the entropy of the predicted labels within a cluster.
   - **Completeness** measures if all data points of a given class are assigned to the same cluster. It is calculated as the ratio of the entropy of the predicted labels to the entropy of the true labels within a cluster.
   - Both metrics are calculated using: 
     \[
     \text{Homogeneity} = 1 - \frac{H(C|T)}{H(C)}
     \]
     \[
     \text{Completeness} = 1 - \frac{H(T|C)}{H(T)}
     \]
     where \(H\) denotes entropy, \(C\) is the cluster labels, and \(T\) is the true labels.

Q2. **V-Measure**:
   - The V-measure is a measure of the balance between homogeneity and completeness. It is defined as the harmonic mean of homogeneity and completeness:
     \[
     \text{V-Measure} = \frac{(1 + \beta^2) \cdot \text{Homogeneity} \cdot \text{Completeness}}{\beta^2 \cdot \text{Homogeneity} + \text{Completeness}}
     \]
   - Here, \(\beta\) is a parameter that balances the contribution of homogeneity and completeness. Typically, \(\beta = 1\).

Q3. **Silhouette Coefficient**:
   - The Silhouette Coefficient measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1:
     - **1**: Perfect clustering.
     - **0**: Overlapping clusters.
     - **-1**: Misclassified points.
   - It is calculated as:
     \[
     \text{Silhouette} = \frac{b - a}{\max(a, b)}
     \]
     where \(a\) is the average distance to other points in the same cluster and \(b\) is the average distance to points in the nearest cluster.

Q4. **Davies-Bouldin Index**:
   - The Davies-Bouldin Index measures the average similarity ratio of each cluster with the cluster that is most similar to it. It is calculated as:
     \[
     \text{DBI} = \frac{1}{n} \sum_{i=1}^n \max_{j \ne i} \frac{s_i + s_j}{d_{ij}}
     \]
     where \(s_i\) is the average distance between points in cluster \(i\), and \(d_{ij}\) is the distance between cluster centers \(i\) and \(j\).
   - A lower DBI indicates better clustering quality.

Q5. **High Homogeneity, Low Completeness**:
   - A clustering result may have high homogeneity if each cluster predominantly contains data points from a single class, but if some classes are not fully captured, completeness will be low.
   - Example: If one cluster only contains class A but misses class B, and another cluster contains only class B, completeness will be low.

Q6. **V-Measure for Optimal Number of Clusters**:
   - To determine the optimal number of clusters, calculate the V-measure for different numbers of clusters and choose the number that maximizes V-measure.

Q7. **Advantages and Disadvantages of Silhouette Coefficient**:
   - **Advantages**:
     - Provides a single metric to evaluate clustering quality.
     - Ranges from -1 to 1, making it easy to interpret.
   - **Disadvantages**:
     - May not work well with clusters of different densities.
     - Can be sensitive to the number of clusters chosen.

Q8. **Limitations of Davies-Bouldin Index**:
   - **Limitations**:
     - Assumes clusters are compact and spherical.
     - May not perform well with clusters of different densities.
   - **Overcoming Limitations**:
     - Use in conjunction with other metrics like silhouette coefficient.

Q9. **Relationship Between Metrics**:
   - **Homogeneity, Completeness, and V-Measure**:
     - Homogeneity and completeness are components of V-measure. They can have different values based on clustering results.
     - V-measure balances these two metrics.

Q10. **Comparing Clustering Algorithms with Silhouette Coefficient**:
   - Use the Silhouette Coefficient to compare different algorithms by evaluating their clustering quality on the same dataset.
   - Potential Issues: May not be suitable for all cluster shapes or densities.

Q11. **Davies-Bouldin Index and Cluster Quality**:
   - Measures separation (distance between clusters) and compactness (distance within clusters).
   - Assumes spherical clusters with uniform density.

Q12. **Silhouette Coefficient and Hierarchical Clustering**:
   - Yes, the Silhouette Coefficient can be used to evaluate hierarchical clustering. It assesses how well-separated and well-formed the clusters are, regardless of the clustering method used.
""")
