A Comparison of Clustering Algorithms (K-means, MeanShift, DBSCAN) in Python

This article compares 3 different clustering algorithms found in scikit-learn, Python's Machine Learning library. You'll be presented how these algorithms are used and optimized according to your needs.
The database used here can be found at Kaggle. You can download the database directly from here. More details about each module can be found at scikit-learn website.

Definition of Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). (Wikipedia)
Clustering is basically grouping observations based on given characteristics.
Normally, clustering provides better results when used together with principal components; but for the sake of this comparison, things were kept simple - this example uses clustering algorithms directly on the columns of the given database.

Database

The database used here is a customer database from a mall. It includes customer profile information and a previously determined spending score, which further makes things easy for a clustering analysis.

Analysis

Let's start by importing modules and the database into Python.

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Users/Emir/Desktop/Mall Customers.csv')
df.head()

	CustomerID	Gender	Age	Annual Income (k$)	Spending Score (1-100)
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40

A quick look at pairs of variables with scatter plot gives us what we need: Spending score and annual income make a perfect pair for clustering.

df.plot.scatter(x='Spending Score (1-100)', y='Annual Income (k$)',  figsize=(15,7))

We can already see the approximate groups in this plot:

Let's see how scikit-learn's clustering algorithms will group these observations.

K-means

K-means algorithm works by specifying a certain number of clusters beforehand.
First we load the K-means module, then we create a database that only consists of the two variables we selected.

from sklearn.cluster import KMeans
x = df.filter(['Annual Income (k$)','Spending Score (1-100)'])

Because we can obviously see that there are 5 clusters, we will force K-means to create exactly 5 clusters for us.

kmeans = KMeans(n_clusters=5)

Now we can fit our data into the model.

clusters = kmeans.fit(x)

We can get the results by typing ".labels_" after the name of the model.

In[7]:  clusters.labels_
Out[7]: array([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2,
        3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0,
        3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 1, 0, 1, 4, 1, 4, 1,
        0, 1, 4, 1, 4, 1, 4, 1, 4, 1, 0, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1,
        4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1,
        4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1,
        4, 1])

Next step is turning this array into a dataframe and adding it to our original dataframe. We're also renaming the column that contains cluster numbers.

ClusterDataset = pd.DataFrame(data=clusters.labels_)
dfClustered = pd.concat([df, ClusterDataset], axis=1)
dfClustered.rename(columns={0:'Cluster'}, inplace=True)

	CustomerID	Gender	Age	Annual Income (k$)	Spending Score (1-100)	Cluster
0	1	Male	19	15	39	3
1	2	Male	21	15	81	2
2	3	Female	20	16	6	3
3	4	Female	23	16	77	2
4	5	Female	31	17	40	3

Time to get our first results. "c=" will color the groups according to the 'Cluster' column, "cmap=" will use a specified scheme for colorization.

dfClustered.plot.scatter(x='Spending Score (1-100)', y='Annual Income (k$)', c='Cluster', cmap="gist_rainbow", figsize=(15,7))

It looks pretty good; there are 3 observations one can argue that they would belong to the purple cluster rather than the red, also one observation could be classified within the green cluster instead of the red.
Let's look at how other algorithms would do the job.

DBSCAN

DBSCAN creates clusters in a different way than K-means. "min_samples=" allows you to specify a minimum cluster size, and "eps=" is the maximum distance between two obsertavions for them to be considered within the same cluster. This approach allows a more flexible clustering operation, giving control from algorithm to the analyst. Based on these 2 inputs, DBSCAN can also identify some observations as outliers and not include them in any cluster. These observations are labeled with a cluster number of "-1".

We will use the same code as above, changing only these 2 lines.

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=12, min_samples=10)

Remember to change the epsilon (eps) and minimum cluster size according to your own needs and run the code as many times as you need with some variation of numbers to get the optimum results. Eps can also be a float number, meaning that decimals are allowed for this variable.
After some trial and error, using eps as 12 and min_samples as 10 gave a reasonable result.

dfClustered.groupby('Cluster').size()
Cluster
-1    28
 0    16
 1    10
 2    92
 3    31
 4    23
dtype: int64

As you can see, the smallest group really has only 10 observations, and 28 observations are not included in any of the groups.

MeanShift

MeanShift aims to find centroids in plots and labels observations to their nearest centroids. By default, it labels all observations; though this feature can be changed.
The bandwidth feature determines how many observations the algorithm will use to calculate means (source). You should adjust the bandwidth for the algorithm to work as you want it to. You might find it useful to take a look at scikit-learn's bandwidth estimator, or get more information about a dynamically weighted bandwidth.
Again, we'll only adjust two lines of our first code.

from sklearn.cluster import MeanShift

clusters = MeanShift(bandwidth=25).fit(x)

After some more trial and error, using a bandwidth of 25 produced optimum results.

Conclusion

Note that amongst the 3 methods on this page, MeanShift's solution resembles best what is visually obvious. This, however, is subject to change with every dataset examined.

Have a good day,
Emir Korkut Unal

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Screenshots		Screenshots
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Screenshots

Screenshots

README.md

README.md

Repository files navigation

A Comparison of Clustering Algorithms (K-means, MeanShift, DBSCAN) in Python

Definition of Clustering

Database

Analysis

K-means

DBSCAN

MeanShift

Conclusion

About

Releases

Packages

EmirKorkutUnal/A-Comparison-of-Clustering-Algorithms-K-means-MeanShift-DBSCAN-in-Python

Folders and files

Latest commit

History

Screenshots

Screenshots

README.md

README.md

Repository files navigation

A Comparison of Clustering Algorithms (K-means, MeanShift, DBSCAN) in Python

Definition of Clustering

Database

Analysis

K-means

DBSCAN

MeanShift

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages