A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1

asapegin · 2018-01-09T16:34:02Z

pyspark-distributed-kmodes/pyspark_kmodes/pyspark_kmodes.py

Line 253 in 98b27d7

    
           clusters = [Cluster(centroid.record) for centroid in rdd.takeSample(False, n_partitions * n_clusters, seed=None)]

On the line 253, modes are initialised by randomly selecting records from the whole rdd (and not from the corresponding partition).

As a result, on line 96, the random modes initialised from the whole dataset are used instead of modes for the particular partition.

In most cases, this results in undefined behaviour of the algorithm, i.e. in attempt to use records from one partition with the mode from another partition. In turn, this leads to empty clusters (which should never happen in the incremental k-means/k-modes, please see https://stackoverflow.com/questions/43445863/k-meansupdaing-centroids-successively-after-each-addition for details). In the code, it is solved with the function "check_for_empty_cluster(clusters, rdd)" which replaces the mode of emty cluster with a random one in the middle of k-modes execution.

All in all, this leads to the wrong results of the clustering.

cinqs · 2018-11-06T13:38:37Z

I don't see why this would leads to "the" wrong results of the clustering though I agree that the sampling of centroids of each partition should be sampled from each partition.

I agree mainly because I think the dataset would distributed abnormally / unevenly through partitions.

But think twice, randomly sampling means there would be sample resulting in an empty cluster, even kmeans. And check_for_empty_cluster is a fair enough mechanism for keeping enough clusters. and the clusters updated will find a better centroid for every record in every partition

cinqs · 2018-11-06T13:41:35Z

this repo is dead as I see, but their are still more work to do, like you said, sampling evenly from every partitions. and there is a fatal bug in the code of check_for_empty_cluster . and I worked to added weights to features when compute hamming distance, or use edit distance to features where needed, like strings.

you can fork it somewhere, may be we could contribute something to it again.

asapegin · 2018-11-06T13:43:19Z

Well, basically, it still may return some clusters, but this will not be k-means or k-modes anymore. As far as I know, this new kind of algorithm was never scientifically proved / checked to produce reasonable clustering results, so I would threat it as wrong. At least, it should not be called k-modes anymore.

asapegin · 2018-11-06T13:48:40Z

Thank you for your suggestion to fork it. Actually, I already have re-factored the code a lot, and added more distance functions. For example, frequency-based dissimilarity as described in "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode" by Dong et al. However, as my work is a subject of NDA, unfortunately up to now there is no decision available if I can publish it.

cinqs · 2018-11-07T03:14:33Z

Thank you for your suggestion to fork it. Actually, I already have re-factored the code a lot, and added more distance functions. For example, frequency-based dissimilarity as described in "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode" by Dong et al. However, as my work is a subject of NDA, unfortunately up to now there is no decision available if I can publish it.

Well, I think I would read the paper you mentioned, and refactor and push...

hughshaoqz · 2018-11-19T16:05:54Z

Hi asapegin, thank you for pointing out the problem. I solved it by implementing the K_mode_partition method to make the model follows each step from the paper (Ensemble based Distributed K-Modes Clustering).

Briefly, what I did is I first randomly selected clusters in each partition and then did Kmodes within each partition. After that, I ran local Kmodes in master node for these centered clusters from each partition.

The problem is I find it works fine with one master node and two worker nodes (silhouette score is more than 0.7) but gets much worse if I choose more than two worker nodes (silhouette score is smaller than 0.3). Do you know why this happened? Thanks in advance.

asapegin · 2018-11-20T10:50:32Z

Do you also increase the number of partitions when you have more than 2 worker nodes? If yes, then you will have more "clusters" (n_clusters for each partition, e.g., if n_clusters = 10, with 2 partitions you will have 20 clusters before local kmodes, with 10 partitions you will have 100 clusters) in the end and the silhouette score after local kmodes may indeed become smaller. Cause local kmodes will find n_clusters in n_clusters*partitions points. More partitions = more points = maybe(!) lower silhouette score

hughshaoqz · 2018-11-26T06:36:56Z

Thank you so much for the reply! I did not increase the partition. What I did is I fix the partition into the same number (32) using coalesce and reparation method. Actually, I also tried to increase the partition, and the result is again very bad. I also tried many different hardware settings, and I found 1) if 1 worker node, no matter what hardware I choose, the result is always good. 2) if 2 worker node, only when the hardware is m4.2xlarge get the perfect result. 3) if more than 2 worker node, never get the perfect result... Do you by any chance know why this happened?

asapegin · 2019-01-02T15:45:30Z

Hi all! As I mentioned earlier, I have already performed a major refactoring of the code fixing this and also many other issues, as well as adding more distance functions. Now I have got a permission to publish the updated and fixed code. Please feel free to look at https://github.com/asapegin/pyspark-kmetamodes
@hughshaoqz you can now try my implementation and see if it solves your problem.
@cinqs Please be welcome to join me as co-developer of https://github.com/asapegin/pyspark-kmetamodes and contribute to it.

asapegin mentioned this issue Sep 7, 2019

IndexError in pyspark_kmodes #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1

A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1

asapegin commented Jan 9, 2018

cinqs commented Nov 6, 2018

cinqs commented Nov 6, 2018

asapegin commented Nov 6, 2018

asapegin commented Nov 6, 2018 •

edited

Loading

cinqs commented Nov 7, 2018

hughshaoqz commented Nov 19, 2018

asapegin commented Nov 20, 2018

hughshaoqz commented Nov 26, 2018

asapegin commented Jan 2, 2019

A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1

A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1

Comments

asapegin commented Jan 9, 2018

cinqs commented Nov 6, 2018

cinqs commented Nov 6, 2018

asapegin commented Nov 6, 2018

asapegin commented Nov 6, 2018 • edited Loading

cinqs commented Nov 7, 2018

hughshaoqz commented Nov 19, 2018

asapegin commented Nov 20, 2018

hughshaoqz commented Nov 26, 2018

asapegin commented Jan 2, 2019

asapegin commented Nov 6, 2018 •

edited

Loading