The database used here can be found at Kaggle. You can download the database directly from here. More details about each module can be found at scikit-learn website. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). (Wikipedia)
Clustering is basically grouping observations based on given characteristics.
Normally, clustering provides better results when used together with principal components; but for the sake of this comparison, things were kept simple - this example uses clustering algorithms directly on the columns of the given database. The database used here is a customer database from a mall. It includes customer profile information and a previously determined spending score, which further makes things easy for a clustering analysis. Let's start by importing modules and the database into Python.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('C:/Users/Emir/Desktop/Mall Customers.csv') df.head()
CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|---|
0 | 1 | Male | 19 | 15 | 39 |
1 | 2 | Male | 21 | 15 | 81 |
2 | 3 | Female | 20 | 16 | 6 |
3 | 4 | Female | 23 | 16 | 77 |
4 | 5 | Female | 31 | 17 | 40 |
A quick look at pairs of variables with scatter plot gives us what we need: Spending score and annual income make a perfect pair for clustering.
df.plot.scatter(x='Spending Score (1-100)', y='Annual Income (k$)', figsize=(15,7))We can already see the approximate groups in this plot: Let's see how scikit-learn's clustering algorithms will group these observations.
K-means algorithm works by specifying a certain number of clusters beforehand.
First we load the K-means module, then we create a database that only consists of the two variables we selected.
from sklearn.cluster import KMeans x = df.filter(['Annual Income (k$)','Spending Score (1-100)'])Because we can obviously see that there are 5 clusters, we will force K-means to create exactly 5 clusters for us.
kmeans = KMeans(n_clusters=5)Now we can fit our data into the model.
clusters = kmeans.fit(x)We can get the results by typing ".labels_" after the name of the model.
In[7]: clusters.labels_ Out[7]: array([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 1, 0, 1, 4, 1, 4, 1, 0, 1, 4, 1, 4, 1, 4, 1, 4, 1, 0, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1])Next step is turning this array into a dataframe and adding it to our original dataframe. We're also renaming the column that contains cluster numbers.
ClusterDataset = pd.DataFrame(data=clusters.labels_) dfClustered = pd.concat([df, ClusterDataset], axis=1) dfClustered.rename(columns={0:'Cluster'}, inplace=True)
CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | Cluster | |
---|---|---|---|---|---|---|
0 | 1 | Male | 19 | 15 | 39 | 3 |
1 | 2 | Male | 21 | 15 | 81 | 2 |
2 | 3 | Female | 20 | 16 | 6 | 3 |
3 | 4 | Female | 23 | 16 | 77 | 2 |
4 | 5 | Female | 31 | 17 | 40 | 3 |
dfClustered.plot.scatter(x='Spending Score (1-100)', y='Annual Income (k$)', c='Cluster', cmap="gist_rainbow", figsize=(15,7))It looks pretty good; there are 3 observations one can argue that they would belong to the purple cluster rather than the red, also one observation could be classified within the green cluster instead of the red.
Let's look at how other algorithms would do the job. DBSCAN creates clusters in a different way than K-means. "min_samples=" allows you to specify a minimum cluster size, and "eps=" is the maximum distance between two obsertavions for them to be considered within the same cluster. This approach allows a more flexible clustering operation, giving control from algorithm to the analyst. Based on these 2 inputs, DBSCAN can also identify some observations as outliers and not include them in any cluster. These observations are labeled with a cluster number of "-1".
We will use the same code as above, changing only these 2 lines.
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=12, min_samples=10)Remember to change the epsilon (eps) and minimum cluster size according to your own needs and run the code as many times as you need with some variation of numbers to get the optimum results. Eps can also be a float number, meaning that decimals are allowed for this variable.
After some trial and error, using eps as 12 and min_samples as 10 gave a reasonable result.
dfClustered.groupby('Cluster').size() Cluster -1 28 0 16 1 10 2 92 3 31 4 23 dtype: int64As you can see, the smallest group really has only 10 observations, and 28 observations are not included in any of the groups. MeanShift aims to find centroids in plots and labels observations to their nearest centroids. By default, it labels all observations; though this feature can be changed.
The bandwidth feature determines how many observations the algorithm will use to calculate means (source). You should adjust the bandwidth for the algorithm to work as you want it to. You might find it useful to take a look at scikit-learn's bandwidth estimator, or get more information about a dynamically weighted bandwidth.
Again, we'll only adjust two lines of our first code.
from sklearn.cluster import MeanShift
clusters = MeanShift(bandwidth=25).fit(x)After some more trial and error, using a bandwidth of 25 produced optimum results. Note that amongst the 3 methods on this page, MeanShift's solution resembles best what is visually obvious. This, however, is subject to change with every dataset examined.
Have a good day,
Emir Korkut Unal