-
Notifications
You must be signed in to change notification settings - Fork 14
Evaluation of k-means for clustering #18
Comments
cc @durandom @tumido @MichaelClifford does it make sense to you? The silhouette coefficient shows that k-means is probably not a good method in this case. |
Maybe this is also because of the input data we are using. |
Is this a measurement you could integrate into the metric_tracking package |
@durandom yes, I plan to add it in https://github.com/RedHatInsights/aicoe-insights-clustering/pull/14/files#diff-d0301332bd6fef353ec35837646aa49e once it is merged. Also, I'll need to figure out how to run it as separate jobs, now it takes hours to compute for 1 day on 4 cores. Using the rule data from 2018-09-05 as input |
Thats what the upshift environment is meant for. Create a build config and let it run there. |
@Ladas thank you for the metrics! It makes total sense and just proves that we all know, that we know nothing. 😄 There are probably multiple factors in effect:
I think this is the kind of insight into the clustering Marcel was looking for and these graphs would make it easier to compare different solutions. I'd love to see them as a part of the metrics tracking thingy Marcel's team is working on. 👍 |
Plotting SSE and silhouette coefficient (multiplying by 3M, so we can see it in 1 chart)
Trying to see what variance is in the features, we can see that around 245 components hold 99% of the variance.
Lets try to see SSE and silhouette coefficient when having only 245 components, transformed by PCA
Result: we can see that max silhouette coefficient is still around 0.43, on just 4 clusters, which has bad SSE. Then it stays around 0.3.
What sources say for silhouette coefficient:
Which points to
The structure is weak and could be artificial
meaning we should try a different clustering method, since k-means seems to have bad score.The text was updated successfully, but these errors were encountered: