# Week 6: Laboratory work on Clustering with K-means and DBSCAN and evaluating via Silhouette Coefficient
   * Remember what K-means and DBSCAN were and how they work. You can have a look at the slides or online materials.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Learning objectives:
Understand and implement K-Means clustering.

1.   Understand and implement K-Means clustering.
2.   Understand and implement DBSCAN clustering.
3.   Evaluate both models using the Silhouette Coefficient.
4.   Compare the results and discuss the strengths and weaknesses of each method.

# 1- Load and Preprocess Iris Dataset

In [None]:
# Import libraries, numpy, pandas, load_iris, StandardScaler


In [None]:
# Please load Iris dataset to 'data' variable "from sklearn.datasets import load_iris"


In [None]:
# Standardize the data
# Standardization ensures all features have a mean of 0 and a standard deviation of 1, which is crucial for distance-based clustering methods like K-Means and DBSCAN.


In [None]:
# Display the first few rows


# 2- K-means Clustering

In [None]:
# Import Kmeans and silhouette_score from sklearn - search for it online


In [None]:
# Apply K-Means clustering using KMeans() and fit_predict()


In [None]:
# Evaluate the clustering performance
# Change the number of clusters to (2, 4, 5) and observe how the change affect Silhouette Score.
# Discuss your observations with another student next to you.



# Helper for discussion:
*The Silhouette Score peaks when the number of clusters matches the natural structure in the data.*

Changing **k** affects both cohesion and separation:

*   Too few clusters: High overlap (poor cohesion).

*   Optimal clusters: Balance of cohesion and separation.

*   Too many clusters: Over-splitting reduces separation.

# 3- DBSCAN Clustering





DBSCAN is a density-based clustering algorithm, and its behavior is significantly influenced by two parameters:

*   eps (epsilon): The maximum distance between two points to consider them part of the same cluster.

*   min_samples: The minimum number of points required to form a dense region (i.e., a cluster).

When you experiment with these parameters, the number and quality of clusters, as measured by the Silhouette Score, will change.

In [None]:
# Import DBSCAN from sklearn - search online for this


In [None]:
# Apply DBSCAN clustering with default parameters (eps=0.5, min_samples=5). Use DBSCAN() and fit_predict() functions.


In [None]:
# Filter out noise points from 'filtered_labels' and 'filtered_data' for Silhouette Score calculation
filtered_labels =
filtered_data =

In [None]:
# Evaluate DBSCAN clustering
# Make sure that Silhouette Score requires at least 2 clusters so the length of 'filtered_labels' need to be larger than 1.
# Experiment with different eps values, e.g., 0.3, 0.7 and min_samples, e.g., 3, 10 to observe how the clusters and Silhouette Scores change.
if len(set(filtered_labels)) > 1:  # Silhouette Score requires at least 2 clusters

    #### Your one line code is HERE

    print(f"DBSCAN Silhouette Score: {dbscan_silhouette}")
else:
    print("DBSCAN could not form enough clusters for Silhouette evaluation.")

## 3- Visualise K-means and DBSCAN Clusters
*   Visualize the clusters formed by K-Means and DBSCAN using the first two features (*sepal length* and *sepal width*).

In [None]:
# import matplotlib as plt


In [None]:
# K-Means Visualization
# Use Scatterplot

#### Write your code HERE

#---------------------------------------------
# DBSCAN Visualization
# Use Scatterplot

#### Write your code HERE

plt.show()

# 4- Discussions
  1. Compare the Silhouette Scores of K-Means and DBSCAN. Which algorithm performed better and why?
      * Feature importance distribution: Scaling gives equal weight to less informative features, confusing K-Means. Silhouette Scores are influenced by the compactness and separation of clusters. When scaled data introduces overlap between clusters, K-Means may have poorly separated clusters, resulting in lower scores.
      * Cluster overlap: DBSCAN handles overlapping clusters better than K-Means.
      * Noise handling: DBSCAN can classify ambiguous points as noise, while K-Means forces them into clusters, distorting results.
      * Assumptions: DBSCAN doesn't assume spherical clusters, which gives it an advantage for datasets like Iris.

  2. Discuss the strengths and weaknesses of both algorithms.
      * K-Means is a simple, efficient, and scalable algorithm, but it assumes spherical clusters and struggles with noise.
      * DBSCAN excels in identifying arbitrary-shaped clusters and handling noise, but it requires careful parameter tuning and may struggle with varying cluster densities.
  3. Discuss scenarios where one algorithm might outperform the other.
    * Scenarios Where K-Means Outperforms DBSCAN:
      * Clusters are spherical, well-separated, and of similar size.
      * Large datasets with no significant noise or outliers.
      * The number of clusters is known in advance.
    * Scenarios Where DBSCAN Outperforms K-Means:
      * Clusters are irregularly shaped or vary in size.
      * The dataset has significant noise or outliers.
      * The number of clusters is unknown.

# 5 - Deep understanding of parameters in DBSCAN
## Task-1: Run the following code cell and observe the output with default values - eps_value=1 and min_samples_value=10

## Task-2: Change eps_value from 1 to 5 while keeping min_samples_value=10, and observe the changes in the results and comment/discuss.

## Task-3: Keep eps_value=1 while changing min_samples_value from 10 to 100, and observe the changes in the results and comment/discuss.

## Task-4: Keep eps_value=1 while changing min_samples_value from 10 to (60,70,80), and observe the changes in the results and what was the final value you could see clustering without having all treated as noise?

## Task-5: Change min_samples_value to 2 while keeping eps_value=1, and observe the changes in the results and comment/discuss.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

# Generate synthetic dataset
n_samples = 1000  # More data points for better visualization
centers = [[2, 2], [8, 8], [5, 2], [2, 8]]  # Cluster centers
X, _ = make_blobs(n_samples=n_samples, centers=centers, cluster_std=1.2, random_state=42)

# Apply DBSCAN clustering
eps_value = 1  # Defines the max distance for neighbors
min_samples_value = 10  # Minimum points required to form a dense region
db = DBSCAN(eps=eps_value, min_samples=min_samples_value)
labels = db.fit_predict(X)

# Identify core, border, and noise points
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True  # Mark core points
unique_labels = set(labels)

# Define colors and markers for visualization
color_map = {0: "blue", 1: "green", 2: "purple", 3: "orange"}
marker_map = {0: "o", 1: "s", 2: "^", 3: "P"}

plt.figure(figsize=(10, 6))

for label in unique_labels:
    if label == -1:
        # Noise points (outliers)
        color, marker, label_name = 'red', 'x', "Noise"
    else:
        # Assign colors and markers to clusters dynamically
        color = color_map.get(label, "gray")
        marker = marker_map.get(label, "D")
        label_name = f"Cluster {label}"

    # Select points in this cluster
    class_member_mask = labels == label

    # Core points (big solid markers)
    xy = X[class_member_mask & core_samples_mask]
    plt.scatter(xy[:, 0], xy[:, 1], s=80, c=color, marker=marker, label=f"{label_name} (Core)", edgecolors='k')

    # Border points (smaller semi-transparent markers)
    xy = X[class_member_mask & ~core_samples_mask]
    plt.scatter(xy[:, 0], xy[:, 1], s=50, c=color, marker=marker, label=f"{label_name} (Border)", alpha=0.6, edgecolors='r')

# Plot DBSCAN parameters
plt.title(f"DBSCAN Clustering (eps={eps_value}, min_samples={min_samples_value})")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend(loc="upper right", fontsize=10)
plt.show()
