Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1

Open
asapegin opened this issue Jan 9, 2018 · 9 comments

Comments

@asapegin
Copy link

asapegin commented Jan 9, 2018

clusters = [Cluster(centroid.record) for centroid in rdd.takeSample(False, n_partitions * n_clusters, seed=None)]

On the line 253, modes are initialised by randomly selecting records from the whole rdd (and not from the corresponding partition).

As a result, on line 96, the random modes initialised from the whole dataset are used instead of modes for the particular partition.

In most cases, this results in undefined behaviour of the algorithm, i.e. in attempt to use records from one partition with the mode from another partition. In turn, this leads to empty clusters (which should never happen in the incremental k-means/k-modes, please see https://stackoverflow.com/questions/43445863/k-meansupdaing-centroids-successively-after-each-addition for details). In the code, it is solved with the function "check_for_empty_cluster(clusters, rdd)" which replaces the mode of emty cluster with a random one in the middle of k-modes execution.

All in all, this leads to the wrong results of the clustering.

@cinqs
Copy link

cinqs commented Nov 6, 2018

I don't see why this would leads to "the" wrong results of the clustering though I agree that the sampling of centroids of each partition should be sampled from each partition.

I agree mainly because I think the dataset would distributed abnormally / unevenly through partitions.

But think twice, randomly sampling means there would be sample resulting in an empty cluster, even kmeans. And check_for_empty_cluster is a fair enough mechanism for keeping enough clusters. and the clusters updated will find a better centroid for every record in every partition

@cinqs
Copy link

cinqs commented Nov 6, 2018

this repo is dead as I see, but their are still more work to do, like you said, sampling evenly from every partitions. and there is a fatal bug in the code of check_for_empty_cluster . and I worked to added weights to features when compute hamming distance, or use edit distance to features where needed, like strings.

you can fork it somewhere, may be we could contribute something to it again.

@asapegin
Copy link
Author

asapegin commented Nov 6, 2018

Well, basically, it still may return some clusters, but this will not be k-means or k-modes anymore. As far as I know, this new kind of algorithm was never scientifically proved / checked to produce reasonable clustering results, so I would threat it as wrong. At least, it should not be called k-modes anymore.

@asapegin
Copy link
Author

asapegin commented Nov 6, 2018

Thank you for your suggestion to fork it. Actually, I already have re-factored the code a lot, and added more distance functions. For example, frequency-based dissimilarity as described in "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode" by Dong et al. However, as my work is a subject of NDA, unfortunately up to now there is no decision available if I can publish it.

@cinqs
Copy link

cinqs commented Nov 7, 2018

Thank you for your suggestion to fork it. Actually, I already have re-factored the code a lot, and added more distance functions. For example, frequency-based dissimilarity as described in "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode" by Dong et al. However, as my work is a subject of NDA, unfortunately up to now there is no decision available if I can publish it.

Well, I think I would read the paper you mentioned, and refactor and push...

@hughshaoqz
Copy link

Hi asapegin, thank you for pointing out the problem. I solved it by implementing the K_mode_partition method to make the model follows each step from the paper (Ensemble based Distributed K-Modes Clustering).

Briefly, what I did is I first randomly selected clusters in each partition and then did Kmodes within each partition. After that, I ran local Kmodes in master node for these centered clusters from each partition.

The problem is I find it works fine with one master node and two worker nodes (silhouette score is more than 0.7) but gets much worse if I choose more than two worker nodes (silhouette score is smaller than 0.3). Do you know why this happened? Thanks in advance.

@asapegin
Copy link
Author

Do you also increase the number of partitions when you have more than 2 worker nodes? If yes, then you will have more "clusters" (n_clusters for each partition, e.g., if n_clusters = 10, with 2 partitions you will have 20 clusters before local kmodes, with 10 partitions you will have 100 clusters) in the end and the silhouette score after local kmodes may indeed become smaller. Cause local kmodes will find n_clusters in n_clusters*partitions points. More partitions = more points = maybe(!) lower silhouette score

@hughshaoqz
Copy link

Thank you so much for the reply! I did not increase the partition. What I did is I fix the partition into the same number (32) using coalesce and reparation method. Actually, I also tried to increase the partition, and the result is again very bad. I also tried many different hardware settings, and I found 1) if 1 worker node, no matter what hardware I choose, the result is always good. 2) if 2 worker node, only when the hardware is m4.2xlarge get the perfect result. 3) if more than 2 worker node, never get the perfect result... Do you by any chance know why this happened?

@asapegin
Copy link
Author

asapegin commented Jan 2, 2019

Hi all! As I mentioned earlier, I have already performed a major refactoring of the code fixing this and also many other issues, as well as adding more distance functions. Now I have got a permission to publish the updated and fixed code. Please feel free to look at https://github.com/asapegin/pyspark-kmetamodes
@hughshaoqz you can now try my implementation and see if it solves your problem.
@cinqs Please be welcome to join me as co-developer of https://github.com/asapegin/pyspark-kmetamodes and contribute to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants