# Vejandla Asrith

**Course:** MSCS 634 – Data Mining  

**Lab Assignment:** Wine Clustering Walk‑Through  

---  

In this notebook I roll up my sleeves and get comfortable with two clustering heavy‑hitters—Hierarchical (Agglomerative) and DBSCAN—using the classic Wine dataset. The goal is to see how the algorithms behave, how parameter tweaks shift the story, and what the evaluation metrics whisper back about cluster quality.

## 1. Data Preparation & Exploration

In [None]:
# Pull the Wine dataset straight from sklearn—no hunting for CSVs today.
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

wine_bunch = load_wine()
X_raw = pd.DataFrame(wine_bunch.data, columns=wine_bunch.feature_names)
y_target = wine_bunch.target  # We will not cluster on this, but it helps later for scores

# A quick peek at the shape and first rows
display(X_raw.head())
X_raw.info()
display(X_raw.describe())

# Standardizing so every feature plays on the same field
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)


## 2. Hierarchical Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Reduce dimensionality for clean visuals
pca = PCA(n_components=2, random_state=42)
X_2d = pca.fit_transform(X_scaled)

cluster_options = [2, 3, 4, 5]
for k in cluster_options:
    model = AgglomerativeClustering(n_clusters=k)
    labels = model.fit_predict(X_scaled)

    plt.figure()
    plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels)
    plt.title(f'Agglomerative Clustering (k = {k})')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.show()

# Dendrogram for a bird’s‑eye view of linkage distances
link_matrix = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 6))
dendrogram(link_matrix, truncate_mode='level', p=5)
plt.title('Wine Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()


## 3. DBSCAN Clustering

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, homogeneity_score, completeness_score
import numpy as np

db_params = [(0.5, 5), (0.7, 5), (0.5, 10)]
results = []

for eps, min_samples in db_params:
    db = DBSCAN(eps=eps, min_samples=min_samples)
    db_labels = db.fit_predict(X_scaled)

    # Visualize clusters (+ noise in label -1)
    plt.figure()
    plt.scatter(X_2d[:, 0], X_2d[:, 1], c=db_labels)
    plt.title(f'DBSCAN (eps={eps}, min_samples={min_samples})')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.show()

    # Calculate metrics—mind the single‑cluster or all‑noise edge cases
    unique = set(db_labels)
    cluster_count = len([lab for lab in unique if lab != -1])
    sil = silhouette_score(X_scaled, db_labels) if cluster_count > 1 else np.nan
    homo = homogeneity_score(y_target, db_labels)
    comp = completeness_score(y_target, db_labels)
    results.append((eps, min_samples, cluster_count, sil, homo, comp))

# Display metric table
results_df = pd.DataFrame(results, columns=['eps', 'min_samples', 'clusters', 'silhouette', 'homogeneity', 'completeness'])
display(results_df)


## 4. Analysis & Insights

After sampling different settings, the hierarchical approach naturally segments the wines into intuitive groups when k is set to 3, echoing the original dataset’s three wine classes. The dendrogram visually reinforces that three‑cluster cut—there’s a pronounced jump in linkage distance beyond that point.

DBSCAN behaves as expected: with a modest `eps` of 0.5 it struggles to gather points into more than one dense region, flagging much of the space as noise. Easing the neighborhood radius to 0.7 gives us two clusters plus outliers, but silhouette drops, reflecting overlapping regions in the feature space. Raising `min_samples` to 10 tightens density requirements, again yielding sparse grouping and more noise.

**Strengths observed**  
*Hierarchical* provides that useful dendrogram narrative, revealing potential cluster counts without guesswork. Its deterministic nature also ensures reproducibility.  
*DBSCAN* excels at identifying arbitrarily‑shaped clusters and calling out noise, something k‑based methods ignore. That said, its sensitivity to `eps` can be finicky on standardized but still high‑dimensional data.

**Weaknesses noted**  
Hierarchical’s Ward linkage leans toward spherical clusters and isn’t fond of large datasets. DBSCAN, while robust to outliers, can collapse to “everything is noise” if `eps` is a hair off, making parameter tuning part science, part art.

In this wine experiment, hierarchical wins on interpretability and stability, whereas DBSCAN’s magic is muted because the data naturally separates into fairly compact groups rather than intricate shapes.