
# Clustering Analysis using K-Means and K-Medoids (Wine Dataset)

**Name:** Bhawesh Shrestha  
**Course:** MSCS 634 - Machine Learning  
**Lab:** Clustering Analysis using K-Means and K-Medoids

---

### Objectives
- Load and explore the Wine dataset from `sklearn`.
- Standardize features (z-score normalization).
- Apply **K-Means** and **K-Medoids** clustering with `k=3`.
- Evaluate with **Silhouette Score** and **Adjusted Rand Index (ARI)**.
- Visualize clusters in 2D (PCA) and compare results.


In [None]:

# If K-Medoids (scikit-learn-extra) is not installed, uncomment the line below.
# You may need internet access in your environment for this install.
# If your environment blocks installs, consult your instructor for alternatives.
# !pip -q install scikit-learn-extra


In [None]:

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.cluster import KMeans
try:
    from sklearn_extra.cluster import KMedoids
except Exception as e:
    KMedoids = None
    print("Note: sklearn-extra (KMedoids) not available. Run the install cell above if needed.")
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


In [None]:

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# DataFrame for quick inspection
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print("Dataset shape:", df.shape)
print("Features:", feature_names)
print("\nClass distribution (target):")
print(df['target'].value_counts().sort_index())
df.head()


In [None]:

# Standardize features (z-score)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Keep a PCA-projected 2D version for plotting
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)


In [None]:

# K-Means (k=3)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

kmeans_silhouette = silhouette_score(X_scaled, kmeans_labels)
kmeans_ari = adjusted_rand_score(y, kmeans_labels)

print("K-Means Results")
print(f"Silhouette Score: {kmeans_silhouette:.4f}")
print(f"Adjusted Rand Index (ARI): {kmeans_ari:.4f}")


In [None]:

# K-Medoids (k=3)
if KMedoids is None:
    print("K-Medoids is unavailable because sklearn-extra is not installed.")
    kmedoids_labels = None
    kmedoids_silhouette = None
    kmedoids_ari = None
else:
    kmedoids = KMedoids(n_clusters=3, random_state=42)
    kmedoids_labels = kmedoids.fit_predict(X_scaled)

    kmedoids_silhouette = silhouette_score(X_scaled, kmedoids_labels)
    kmedoids_ari = adjusted_rand_score(y, kmedoids_labels)

    print("K-Medoids Results")
    print(f"Silhouette Score: {kmedoids_silhouette:.4f}")
    print(f"Adjusted Rand Index (ARI): {kmedoids_ari:.4f}")


In [None]:

# Visualize K-Means with PCA (single plot, default colors; no seaborn)
plt.figure(figsize=(6, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, s=30)
# Transform centroids into PCA space for plotting
centroids_pca = pca.transform(kmeans.cluster_centers_)
plt.scatter(centroids_pca[:, 0], centroids_pca[:, 1], marker='X', s=200, label='Centroids')
plt.title('K-Means Clustering (PCA 2D)')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend()
plt.show()


In [None]:

# Visualize K-Medoids with PCA (single plot, default colors; no seaborn)
if KMedoids is not None and kmedoids_labels is not None:
    plt.figure(figsize=(6, 5))
    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmedoids_labels, s=30)
    # Transform medoids (cluster centers in feature space) to PCA space
    medoids_pca = pca.transform(kmedoids.cluster_centers_)
    plt.scatter(medoids_pca[:, 0], medoids_pca[:, 1], marker='D', s=200, label='Medoids')
    plt.title('K-Medoids Clustering (PCA 2D)')
    plt.xlabel('PC 1')
    plt.ylabel('PC 2')
    plt.legend()
    plt.show()
else:
    print("Skipping K-Medoids plot (package not installed).")


In [None]:

# Summary table
rows = []
rows.append({"Algorithm": "K-Means",
             "Silhouette Score": kmeans_silhouette,
             "Adjusted Rand Index (ARI)": kmeans_ari})

if KMedoids is not None and kmedoids_labels is not None:
    rows.append({"Algorithm": "K-Medoids",
                 "Silhouette Score": kmedoids_silhouette,
                 "Adjusted Rand Index (ARI)": kmedoids_ari})

results_df = pd.DataFrame(rows)
results_df



## Discussion

**Which algorithm produced better-defined clusters?**  
Compare the **Silhouette Score** and **ARI** in the summary table. The algorithm with higher values typically yields better-defined clusters and more agreement with the true labels.

**What differences do you observe in cluster shapes or positioning?**  
- K-Means: Centers are means of clusters; sensitive to outliers; works well with roughly spherical clusters.  
- K-Medoids: Centers are actual data points (medoids); more robust to outliers; can provide different boundaries when clusters are non-spherical.

**When is K-Means vs. K-Medoids preferable?**  
- **K-Means**: Faster on larger datasets; good when clusters are compact and spherical with few outliers.  
- **K-Medoids**: Preferable when robustness to outliers is important or when clusters are not well-modeled by means.
