-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1
Comments
I don't see why this would leads to "the" wrong results of the clustering though I agree that the sampling of centroids of each partition should be sampled from each partition. I agree mainly because I think the dataset would distributed abnormally / unevenly through partitions. But think twice, randomly sampling means there would be sample resulting in an empty cluster, even kmeans. And |
this repo is dead as I see, but their are still more work to do, like you said, sampling evenly from every partitions. and there is a fatal bug in the code of you can fork it somewhere, may be we could contribute something to it again. |
Well, basically, it still may return some clusters, but this will not be k-means or k-modes anymore. As far as I know, this new kind of algorithm was never scientifically proved / checked to produce reasonable clustering results, so I would threat it as wrong. At least, it should not be called k-modes anymore. |
Thank you for your suggestion to fork it. Actually, I already have re-factored the code a lot, and added more distance functions. For example, frequency-based dissimilarity as described in "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode" by Dong et al. However, as my work is a subject of NDA, unfortunately up to now there is no decision available if I can publish it. |
Well, I think I would read the paper you mentioned, and refactor and push... |
Hi asapegin, thank you for pointing out the problem. I solved it by implementing the K_mode_partition method to make the model follows each step from the paper (Ensemble based Distributed K-Modes Clustering). Briefly, what I did is I first randomly selected clusters in each partition and then did Kmodes within each partition. After that, I ran local Kmodes in master node for these centered clusters from each partition. The problem is I find it works fine with one master node and two worker nodes (silhouette score is more than 0.7) but gets much worse if I choose more than two worker nodes (silhouette score is smaller than 0.3). Do you know why this happened? Thanks in advance. |
Do you also increase the number of partitions when you have more than 2 worker nodes? If yes, then you will have more "clusters" (n_clusters for each partition, e.g., if n_clusters = 10, with 2 partitions you will have 20 clusters before local kmodes, with 10 partitions you will have 100 clusters) in the end and the silhouette score after local kmodes may indeed become smaller. Cause local kmodes will find n_clusters in n_clusters*partitions points. More partitions = more points = maybe(!) lower silhouette score |
Thank you so much for the reply! I did not increase the partition. What I did is I fix the partition into the same number (32) using coalesce and reparation method. Actually, I also tried to increase the partition, and the result is again very bad. I also tried many different hardware settings, and I found 1) if 1 worker node, no matter what hardware I choose, the result is always good. 2) if 2 worker node, only when the hardware is m4.2xlarge get the perfect result. 3) if more than 2 worker node, never get the perfect result... Do you by any chance know why this happened? |
Hi all! As I mentioned earlier, I have already performed a major refactoring of the code fixing this and also many other issues, as well as adding more distance functions. Now I have got a permission to publish the updated and fixed code. Please feel free to look at https://github.com/asapegin/pyspark-kmetamodes |
pyspark-distributed-kmodes/pyspark_kmodes/pyspark_kmodes.py
Line 253 in 98b27d7
On the line 253, modes are initialised by randomly selecting records from the whole rdd (and not from the corresponding partition).
As a result, on line 96, the random modes initialised from the whole dataset are used instead of modes for the particular partition.
In most cases, this results in undefined behaviour of the algorithm, i.e. in attempt to use records from one partition with the mode from another partition. In turn, this leads to empty clusters (which should never happen in the incremental k-means/k-modes, please see https://stackoverflow.com/questions/43445863/k-meansupdaing-centroids-successively-after-each-addition for details). In the code, it is solved with the function "check_for_empty_cluster(clusters, rdd)" which replaces the mode of emty cluster with a random one in the middle of k-modes execution.
All in all, this leads to the wrong results of the clustering.
The text was updated successfully, but these errors were encountered: