Evaluation of k-means for clustering #18

Ladas · 2018-10-08T09:14:28Z

Plotting SSE and silhouette coefficient (multiplying by 3M, so we can see it in 1 chart)

Trying to see what variance is in the features, we can see that around 245 components hold 99% of the variance.

Lets try to see SSE and silhouette coefficient when having only 245 components, transformed by PCA

Result: we can see that max silhouette coefficient is still around 0.43, on just 4 clusters, which has bad SSE. Then it stays around 0.3.

What sources say for silhouette coefficient:

0.71-1.0 | A strong structure has been found
0.51-0.70 | A reasonable structure has been found
0.26-0.50 | The structure is weak and could be artificial
< 0.25 | No substantial structure has been found

Which points to The structure is weak and could be artificial meaning we should try a different clustering method, since k-means seems to have bad score.

The text was updated successfully, but these errors were encountered:

Ladas · 2018-10-08T09:18:09Z

cc @durandom @tumido @MichaelClifford does it make sense to you? The silhouette coefficient shows that k-means is probably not a good method in this case.

durandom · 2018-10-08T09:59:46Z

Maybe this is also because of the input data we are using.
I've written up the next steps proposed here

durandom · 2018-10-08T10:15:47Z

Is this a measurement you could integrate into the metric_tracking package
Whats the input of the silhouette coefficient? Just the clusters? Could you add the notebook or code to produce this?

Ladas · 2018-10-08T10:46:14Z

@durandom yes, I plan to add it in https://github.com/RedHatInsights/aicoe-insights-clustering/pull/14/files#diff-d0301332bd6fef353ec35837646aa49e once it is merged. Also, I'll need to figure out how to run it as separate jobs, now it takes hours to compute for 1 day on 4 cores.

Using the rule data from 2018-09-05 as input

durandom · 2018-10-08T12:02:59Z

Also, I'll need to figure out how to run it as separate jobs, now it takes hours to compute for 1 day on 4 cores.

Thats what the upshift environment is meant for. Create a build config and let it run there.

tumido · 2018-10-08T12:29:54Z

@Ladas thank you for the metrics! It makes total sense and just proves that we all know, that we know nothing. 😄

There are probably multiple factors in effect:

data preprocessing (ironing out the structure too much)
applied clustering metod which doesn't fit the use case
input data composition (extracting wrong markers, missing important attributes, etc.)

I think this is the kind of insight into the clustering Marcel was looking for and these graphs would make it easier to compare different solutions. I'd love to see them as a part of the metrics tracking thingy Marcel's team is working on. 👍

Ladas · 2018-10-09T07:58:44Z

FYI. seems like DBscan also performs bad, we'll need to munge the data before clustering

durandom · 2018-10-10T09:37:38Z

cc @TreeinRandomForest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation of k-means for clustering #18

Evaluation of k-means for clustering #18

Ladas commented Oct 8, 2018 •

edited

Loading

Ladas commented Oct 8, 2018

durandom commented Oct 8, 2018

durandom commented Oct 8, 2018

Ladas commented Oct 8, 2018 •

edited

Loading

durandom commented Oct 8, 2018

tumido commented Oct 8, 2018

Ladas commented Oct 9, 2018

durandom commented Oct 10, 2018

Evaluation of k-means for clustering #18

Evaluation of k-means for clustering #18

Comments

Ladas commented Oct 8, 2018 • edited Loading

Ladas commented Oct 8, 2018

durandom commented Oct 8, 2018

durandom commented Oct 8, 2018

Ladas commented Oct 8, 2018 • edited Loading

durandom commented Oct 8, 2018

tumido commented Oct 8, 2018

Ladas commented Oct 9, 2018

durandom commented Oct 10, 2018

Ladas commented Oct 8, 2018 •

edited

Loading

Ladas commented Oct 8, 2018 •

edited

Loading