# Session 19-20 Support Vector Machine

# Exercise: Clustering Penguin Species

You are a data scientist working on an ecological study of penguins. Your task is to develop a clustering model to group penguins based on their physical characteristics.

Dataset

Use the penguin dataset available at:

https://www.kaggle.com/datasets/youssefaboelwafa/clustering-penguins-species
The dataset includes attributes such as:
* Bill length
* Bill depth
* Flipper length
* Body mass
* (and possibly other ecological features)

# Step 0 – Import Libraries

In [None]:
# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Clustering models
from sklearn.cluster import KMeans, DBSCAN
from sklearn.cluster import AgglomerativeClustering

# Evaluation
from sklearn.metrics import silhouette_score

# Step 1 – Load the Dataset

In [None]:
# Load the dataset (update path if needed)
df = pd.read_csv("penguins.csv")

# Display first few rows
df.head()

# Step 2 – Preprocess the Data
(Handle missing values, select features, scale data)

In [None]:
# Select numerical features only (exclude 'sex' for clustering)
features = [
    "culmen_length_mm",
    "culmen_depth_mm",
    "flipper_length_mm",
    "body_mass_g"
]

X = df[features]

# Handle missing values using mean imputation
imputer = SimpleImputer(strategy="mean")
X_imputed = imputer.fit_transform(X)

# Feature scaling is important for distance-based clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Step 3 – Train a Clustering Model (K-Means Example)

In [None]:
# Choose number of clusters (e.g., 3 penguin species)
k = 3

# Initialize K-Means
kmeans = KMeans(n_clusters=k, random_state=42)

# Fit the model
cluster_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to the dataframe
df["cluster_kmeans"] = cluster_labels

# Step 4 – Analyze the Clustering Results

## 4.1 Silhouette Score

In [None]:
# Evaluate clustering quality
sil_score = silhouette_score(X_scaled, cluster_labels)
print("Silhouette Score (K-Means):", sil_score)

## 4.2 Visualize Clusters

In [None]:
# Visualize clusters using two important features
plt.figure(figsize=(8, 6))
sns.scatterplot(
    x=df["flipper_length_mm"],
    y=df["body_mass_g"],
    hue=df["cluster_kmeans"],
    palette="viridis"
)

plt.title("K-Means Clustering of Penguins")
plt.xlabel("Flipper Length (mm)")
plt.ylabel("Body Mass (g)")
plt.show()

In [None]:
# Limit the x-range
plt.figure(figsize=(8, 6))
sns.scatterplot(
    x=df["flipper_length_mm"],
    y=df["body_mass_g"],
    hue=df["cluster_kmeans"],
    palette="viridis"
)

plt.title("K-Means Clustering of Penguins")
plt.xlabel("Flipper Length (mm)")
plt.xlim(150, 250)
plt.ylabel("Body Mass (g)")
plt.ylim(2000, 7000)
plt.show()


# Step 5 – DBSCAN

## 5.1 Fit DBSCAN

In [None]:
# DBSCAN does not require number of clusters
dbscan = DBSCAN(eps=0.9, min_samples=5)

db_labels = dbscan.fit_predict(X_scaled)

# -1 indicates noise points
df["cluster_dbscan"] = db_labels

# Count noise points
print("Number of noise points:", np.sum(db_labels == -1))

## 5.2 Visualize DBSCAN result

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))

sns.scatterplot(
    x=df["flipper_length_mm"],
    y=df["body_mass_g"],
    hue=df["cluster_dbscan"],
    palette="tab10"
)

plt.title("DBSCAN Clustering of Penguins")
plt.xlabel("Flipper Length (mm)")
plt.ylabel("Body Mass (g)")
plt.show()

## 5.3 Highlight Noise Clearly

Noise points deserve special treatment

In [None]:
plt.figure(figsize=(8, 6))

# Plot clusters (exclude noise)
sns.scatterplot(
    data=df[df["cluster_dbscan"] != -1],
    x="flipper_length_mm",
    y="body_mass_g",
    hue="cluster_dbscan",
    palette="tab10",
    legend="brief"
)

# Plot noise points separately
sns.scatterplot(
    data=df[df["cluster_dbscan"] == -1],
    x="flipper_length_mm",
    y="body_mass_g",
    color="black",
    marker="X",
    label="Noise"
)

plt.title("DBSCAN Clustering of Penguins (Noise Highlighted)")
plt.xlabel("Flipper Length (mm)")
plt.ylabel("Body Mass (g)")
plt.show()

## 5.4 Remove Noise for Cleaner Cluster View

In [None]:
df_no_noise = df[df["cluster_dbscan"] != -1]

sns.scatterplot(
    data=df_no_noise,
    x="flipper_length_mm",
    y="body_mass_g",
    hue="cluster_dbscan",
    palette="tab10"
)

plt.title("DBSCAN Clusters (Noise Removed)")
plt.show()

# Step 6 – Interpretation

In [None]:
# Examine cluster centers (K-Means)
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
centers_df = pd.DataFrame(cluster_centers, columns=features)

centers_df

## Interpretation Guide:

* Larger **body mass & flipper length** → likely larger penguin species

* Differences in **bculmen length/depth** help separate species

* DBSCAN may label sparse points as **noise**

* K-Means forces all points into clusters