# Clustering with Machine Learning

This notebook provides an introduction to clustering techniques in machine learning. 
We will explore **KMeans**, **DBSCAN**, and **Hierarchical Clustering** algorithms to partition datasets into homogeneous groups. 

Clustering is essential for various applications, from market segmentation to data analysis. Let's dive in!


In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

##### Unzip the Dataset if you need to do it

In [38]:
import zipfile
with zipfile.ZipFile('dataset/dataset_clustering.zip', 'r') as zip_ref:
    zip_ref.extractall('dataset/')

### Data Loading and Exploration 

In [None]:
?

In [None]:
# Data shape
?

#### Visualize data using scatter plot


In [None]:
?

## *Clustering Techniques*

## **Kmeans**

Look at [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)

Write a function to call KMeans constructor and fit method

In [42]:
from sklearn.cluster import KMeans 

def build_kmeans():
  ?
  return kmeans

Convert in numpy array and delete one column in order to have 2-dimensional data


In [None]:
?

Create 4 clusters from data using the previously defined function

In [None]:
kmeans4 = build_kmeans(4, my_dummy_data)

# Print the centers of the clusters
kmeans4.cluster_centers_

In [None]:
# Visualize the data with the final centroids

plt.scatter(my_dummy_data[:, 0], my_dummy_data[:,1])
plt.scatter(kmeans4.cluster_centers_[:,0], kmeans4.cluster_centers_[:,1], s= 250, marker="*", 
            c="yellow", edgecolors="black")

plt.show()

In [46]:
# Create 3 clusters from data using the previously defined function
kmeans3 = build_kmeans(3, my_dummy_data)

In [None]:
# Visualize the data with the final centroids for both the models
plt.scatter(my_dummy_data[:, 0], my_dummy_data[:,1])
plt.scatter(kmeans3.cluster_centers_[:,0], kmeans3.cluster_centers_[:,1], s= 250, marker="*", 
            c="yellow", edgecolors="black", label='3 cluster')
plt.scatter(kmeans4.cluster_centers_[:,0], kmeans4.cluster_centers_[:,1], s= 10, marker="o", 
            c="red", edgecolors="black", label='4 cluster')

plt.legend()
plt.show()

In [None]:
# calling labels_ attribute we can see the cluster label of each point
?

### **Selecting the optimal number of clusters**

#### Elbow Method for Determining the Optimal Number of Clusters in K-means

The elbow method for determining the optimal number of clusters in K-means involves:

1. Computing the Within-Cluster Sum of Squares (WCSS) for different values of k (the number of clusters).
2. Plotting k against WCSS.
3. Identifying the "elbow" point in the plot, where the rate of decrease of WCSS slows down significantly.
4. Selecting the number of clusters corresponding to this elbow point as the optimal number of clusters for the dataset.

- Note that: `kMeans.inertia` = WCSS


In [None]:
?
error = [] # array to collect the measure values
?

plt.plot(range(1,number_of_cluster),error )
plt.show()

In [None]:
error[-1]

#### Selecting the number of clusters with silhouette analysis on KMeans clustering
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of `[-1, 1]`.

Silhouette coefficients (as these values are referred to as) near `+1` indicate that the sample is far away from the neighboring clusters. A value of `0` indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

Also from the thickness of the silhouette plot the cluster size can be visualized.

[see here](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py)

In [None]:
import matplotlib.cm as cm

from sklearn.metrics import silhouette_samples, silhouette_score

X = my_dummy_data.copy()

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters
    clusterer = build_kmeans(n_clusters, X)
    cluster_labels = clusterer.predict(X)


    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print(
        "For n_clusters =",
        n_clusters,
        "The average silhouette_score is :",
        silhouette_avg,
    )

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            ith_cluster_silhouette_values,
            facecolor=color,
            edgecolor=color,
            alpha=0.7,
        )

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(
        X[:, 0], X[:, 1], marker=".", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k"
    )

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(
        centers[:, 0],
        centers[:, 1],
        marker="o",
        c="white",
        alpha=1,
        s=200,
        edgecolor="k",
    )

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(
        "Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
        % n_clusters,
        fontsize=14,
        fontweight="bold",
    )

plt.show()

#### `Random` vs `k-means++` initialization

- `k-means++` : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique **speeds up convergence**. The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them.
- `random`: choose n_clusters observations (rows) at random from data for the initial centroids.

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(10,6))
ax1.set_title("k means++ 4")
ax2.set_title("k means++ 3")

ax1.scatter(my_dummy_data[:,0], my_dummy_data[:,1], c = kmeans4.labels_, cmap ="brg" )
ax1.scatter(kmeans4.cluster_centers_[:,0], kmeans4.cluster_centers_[:,1], s = 250, marker = "*", c="yellow", edgecolors="black")

ax2.scatter(my_dummy_data[:,0], my_dummy_data[:,1], c = kmeans3.labels_, cmap ="brg" )
ax2.scatter(kmeans3.cluster_centers_[:,0], kmeans3.cluster_centers_[:,1], s = 80, marker = "o", c="yellow", edgecolors="black")

##################################################################################

kmeans4_rand = build_kmeans(4, my_dummy_data, "random", 1, 2)
kmeans3_rand = build_kmeans(3, my_dummy_data, "random", 1, 2)

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(10,6))
ax1.set_title("k means 4 random_init")
ax2.set_title("k means 3 random_init")

ax1.scatter(my_dummy_data[:,0], my_dummy_data[:,1], c = kmeans4_rand.labels_, cmap ="brg" )
ax1.scatter(kmeans4_rand.cluster_centers_[:,0], kmeans4_rand.cluster_centers_[:,1], s = 250, marker = "*", c="yellow", edgecolors="black")

ax2.scatter(my_dummy_data[:,0], my_dummy_data[:,1], c = kmeans3_rand.labels_, cmap ="brg" )
ax2.scatter(kmeans3_rand.cluster_centers_[:,0], kmeans3_rand.cluster_centers_[:,1], s = 80, marker = "o", c="yellow", edgecolors="black")
plt.show()

#### Re-run the code with `3D data`

In [None]:
?

In [None]:
?
error = [] # array to collect the measure values


plt.plot(range(1,number_of_cluster),error )
plt.show()

In [55]:
kmeans4 = build_kmeans(4, my_dummy_data)

In [None]:
fig = plt.figure(figsize = (10,10))
ax = plt.axes(projection='3d')
ax.grid()

ax.scatter(my_dummy_data[:, 0], my_dummy_data[:, 1], my_dummy_data[:, 2], c = 'b', s = 1)
ax.scatter(kmeans4.cluster_centers_[:,0], kmeans4.cluster_centers_[:,1], kmeans4.cluster_centers_[:,2], s= 10000, marker="*", 
            c="yellow", edgecolors="black")
ax.set_title('3D Scatter Plot')

# Set axes label
ax.set_xlabel('x', labelpad=20)
ax.set_ylabel('y', labelpad=20)
ax.set_zlabel('z', labelpad=20)

plt.show()

## **DBSCAN**

To find the best parameters **Eps** and **MinPts** for DBSCAN we can use the elbow method by plotting the distance of the `k_th` neighbors. 

To do this we can use the methods provided by the class [`NearestNeighbors`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors).

In particular we can use the kneighbors method (applied on the trained model) to find the **K-neighbors** of a point.

Look at DBSCAN [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN)

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

my_dummy_data = dummy_dataset.to_numpy()
print(my_dummy_data[:5],'\n')

my_dummy_data = np.delete(my_dummy_data, 0, axis=1)
print(my_dummy_data[:5])

In [None]:
neighbors = NearestNeighbors(n_neighbors=8) # set n_neighbors to the MinPts you want to analyze (try 8 for good results)
neighbors_fit = neighbors.fit(my_dummy_data)
distances, indices = neighbors_fit.kneighbors(my_dummy_data)

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.show()

In [89]:
# Call DBSCAN and fit a model on the same dataset

?

Analyze the assigned labels in order to retrieve the number of clusters and noise points
- `-1` is the label assigned to noise points

In [None]:
labels=dbscan.labels_
print("type(labels)=",type(labels))
print(labels)

nl=np.unique(labels)
print("Assigned labels: ",nl)

n_clusters = len(np.unique(labels)) - (1 if -1 in labels else 0)

n_noise = list(labels).count(-1)

print("Estimated number of clusters:", n_clusters)
print("Estimated number of noise points:",n_noise, "representing ",n_noise*100/dummy_dataset.shape[0],"%")

In [None]:
# Visualize clusters using different colors for different labels
plt.scatter(my_dummy_data[:, 0], my_dummy_data[:,1], c = dbscan.labels_, cmap ="brg")
plt.show()

In [None]:
# Read dummy_dataset.csv file and visualize the first records
dummy_dataset_bis = pd.read_csv("dataset/dummy_dataset.csv", sep = ';')
print(dummy_dataset_bis.head())
print(dummy_dataset_bis.shape)

my_dummy_data_bis = dummy_dataset_bis.to_numpy()
print(my_dummy_data_bis[:5])

In [None]:
#  Visualize the data (2D scatter plot)
plt.scatter(my_dummy_data_bis[:,0],my_dummy_data_bis[:,1])
plt.show()

In [94]:
# Apply kmeans with k=5 
?

In [None]:
# Visualize centers and colored clusters
plt.scatter(my_dummy_data_bis[:,0],my_dummy_data_bis[:,1], c = kmeans_bis.labels_)
plt.scatter(kmeans_bis.cluster_centers_[:,0], kmeans_bis.cluster_centers_[:,1], marker = "*", c = "yellow", s=150, edgecolors="black")
plt.show()

In [None]:
# Use the elbow method defined above to plot, for each point, the k-th neighbor distance 
# n_neighbors=3 is a good value for this dataset

neighbors = NearestNeighbors(n_neighbors=3)
neighbors_fit = neighbors.fit(my_dummy_data_bis)
distances, indices = neighbors_fit.kneighbors(my_dummy_data_bis)

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.show()

In [96]:
# APPLY DB SCAN (eps=0.4 works well) 
?

In [None]:
# visualize the colored clusters
plt.scatter(my_dummy_data_bis[:,0],my_dummy_data_bis[:,1], c = dbscan_bis.labels_, cmap = "plasma")
plt.show()

## **Agglomerative Clustering**

Look ad Agglomerative CLustering [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)

In [None]:
from sklearn.cluster import AgglomerativeClustering

?

To visualize the dendogram we can use the scipy library in this case also the clustering is made by the methods in scipy

In [None]:
import scipy.cluster.hierarchy as sch

out_linkage=sch.linkage(my_dummy_data, method = "average")
d = sch.dendrogram (out_linkage)
plt.title("dendrogram")
plt.xlabel("Clusters")
plt.ylabel("Euclidean")
plt.axhline(y = 4.5, color = "r", linestyle = "-")
plt.axhline(y = 3.5, color = "black", linestyle = "-")
plt.axhline(y = 2.5, color = "yellow", linestyle = "-")
plt.show()


In [None]:
?

plt.scatter(my_dummy_data_bis[:,0],my_dummy_data_bis[:,1], c = y_agglo_bis,cmap = "plasma")
plt.show()