Skip to content

EmirKorkutUnal/A-Comparison-of-Clustering-Algorithms-K-means-MeanShift-DBSCAN-in-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 

Repository files navigation

A Comparison of Clustering Algorithms (K-means, MeanShift, DBSCAN) in Python

This article compares 3 different clustering algorithms found in scikit-learn, Python's Machine Learning library. You'll be presented how these algorithms are used and optimized according to your needs.
The database used here can be found at Kaggle. You can download the database directly from here. More details about each module can be found at scikit-learn website.

Definition of Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). (Wikipedia)
Clustering is basically grouping observations based on given characteristics.
Normally, clustering provides better results when used together with principal components; but for the sake of this comparison, things were kept simple - this example uses clustering algorithms directly on the columns of the given database.

Database

The database used here is a customer database from a mall. It includes customer profile information and a previously determined spending score, which further makes things easy for a clustering analysis.

Analysis

Let's start by importing modules and the database into Python.

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Users/Emir/Desktop/Mall Customers.csv')
df.head()

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40

A quick look at pairs of variables with scatter plot gives us what we need: Spending score and annual income make a perfect pair for clustering.
df.plot.scatter(x='Spending Score (1-100)', y='Annual Income (k$)',  figsize=(15,7))

We can already see the approximate groups in this plot:

Let's see how scikit-learn's clustering algorithms will group these observations.

K-means

K-means algorithm works by specifying a certain number of clusters beforehand.
First we load the K-means module, then we create a database that only consists of the two variables we selected.
from sklearn.cluster import KMeans
x = df.filter(['Annual Income (k$)','Spending Score (1-100)'])
Because we can obviously see that there are 5 clusters, we will force K-means to create exactly 5 clusters for us.
kmeans = KMeans(n_clusters=5)
Now we can fit our data into the model.
clusters = kmeans.fit(x)
We can get the results by typing ".labels_" after the name of the model.
In[7]:  clusters.labels_
Out[7]: array([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2,
        3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0,
        3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 1, 0, 1, 4, 1, 4, 1,
        0, 1, 4, 1, 4, 1, 4, 1, 4, 1, 0, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1,
        4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1,
        4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1,
        4, 1])
Next step is turning this array into a dataframe and adding it to our original dataframe. We're also renaming the column that contains cluster numbers.
ClusterDataset = pd.DataFrame(data=clusters.labels_)
dfClustered = pd.concat([df, ClusterDataset], axis=1)
dfClustered.rename(columns={0:'Cluster'}, inplace=True)
CustomerID Gender Age Annual Income (k$) Spending Score (1-100) Cluster
0 1 Male 19 15 39 3
1 2 Male 21 15 81 2
2 3 Female 20 16 6 3
3 4 Female 23 16 77 2
4 5 Female 31 17 40 3
Time to get our first results. "c=" will color the groups according to the 'Cluster' column, "cmap=" will use a specified scheme for colorization.
dfClustered.plot.scatter(x='Spending Score (1-100)', y='Annual Income (k$)', c='Cluster', cmap="gist_rainbow", figsize=(15,7))

It looks pretty good; there are 3 observations one can argue that they would belong to the purple cluster rather than the red, also one observation could be classified within the green cluster instead of the red.
Let's look at how other algorithms would do the job.

DBSCAN

DBSCAN creates clusters in a different way than K-means. "min_samples=" allows you to specify a minimum cluster size, and "eps=" is the maximum distance between two obsertavions for them to be considered within the same cluster. This approach allows a more flexible clustering operation, giving control from algorithm to the analyst. Based on these 2 inputs, DBSCAN can also identify some observations as outliers and not include them in any cluster. These observations are labeled with a cluster number of "-1".

We will use the same code as above, changing only these 2 lines.
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=12, min_samples=10)
Remember to change the epsilon (eps) and minimum cluster size according to your own needs and run the code as many times as you need with some variation of numbers to get the optimum results. Eps can also be a float number, meaning that decimals are allowed for this variable.
After some trial and error, using eps as 12 and min_samples as 10 gave a reasonable result.

dfClustered.groupby('Cluster').size()
Cluster
-1    28
 0    16
 1    10
 2    92
 3    31
 4    23
dtype: int64
As you can see, the smallest group really has only 10 observations, and 28 observations are not included in any of the groups.

MeanShift

MeanShift aims to find centroids in plots and labels observations to their nearest centroids. By default, it labels all observations; though this feature can be changed.
The bandwidth feature determines how many observations the algorithm will use to calculate means (source). You should adjust the bandwidth for the algorithm to work as you want it to. You might find it useful to take a look at scikit-learn's bandwidth estimator, or get more information about a dynamically weighted bandwidth.
Again, we'll only adjust two lines of our first code.
from sklearn.cluster import MeanShift
clusters = MeanShift(bandwidth=25).fit(x)
After some more trial and error, using a bandwidth of 25 produced optimum results.

Conclusion

Note that amongst the 3 methods on this page, MeanShift's solution resembles best what is visually obvious. This, however, is subject to change with every dataset examined.

Have a good day,
Emir Korkut Unal

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published