Skip to content
This repository has been archived by the owner on Oct 3, 2019. It is now read-only.

Evaluation of k-means for clustering #18

Open
Ladas opened this issue Oct 8, 2018 · 8 comments
Open

Evaluation of k-means for clustering #18

Ladas opened this issue Oct 8, 2018 · 8 comments

Comments

@Ladas
Copy link

Ladas commented Oct 8, 2018

Plotting SSE and silhouette coefficient (multiplying by 3M, so we can see it in 1 chart)

image

image


Trying to see what variance is in the features, we can see that around 245 components hold 99% of the variance.

image


Lets try to see SSE and silhouette coefficient when having only 245 components, transformed by PCA

image

image


Result: we can see that max silhouette coefficient is still around 0.43, on just 4 clusters, which has bad SSE. Then it stays around 0.3.

What sources say for silhouette coefficient:

0.71-1.0 | A strong structure has been found
0.51-0.70 | A reasonable structure has been found
0.26-0.50 | The structure is weak and could be artificial
< 0.25 | No substantial structure has been found

Which points to The structure is weak and could be artificial meaning we should try a different clustering method, since k-means seems to have bad score.

@Ladas
Copy link
Author

Ladas commented Oct 8, 2018

cc @durandom @tumido @MichaelClifford does it make sense to you? The silhouette coefficient shows that k-means is probably not a good method in this case.

@durandom
Copy link
Member

durandom commented Oct 8, 2018

Maybe this is also because of the input data we are using.
I've written up the next steps proposed here

@durandom
Copy link
Member

durandom commented Oct 8, 2018

Is this a measurement you could integrate into the metric_tracking package
Whats the input of the silhouette coefficient? Just the clusters? Could you add the notebook or code to produce this?

@Ladas
Copy link
Author

Ladas commented Oct 8, 2018

@durandom yes, I plan to add it in https://github.com/RedHatInsights/aicoe-insights-clustering/pull/14/files#diff-d0301332bd6fef353ec35837646aa49e once it is merged. Also, I'll need to figure out how to run it as separate jobs, now it takes hours to compute for 1 day on 4 cores.

Using the rule data from 2018-09-05 as input

@durandom
Copy link
Member

durandom commented Oct 8, 2018

Also, I'll need to figure out how to run it as separate jobs, now it takes hours to compute for 1 day on 4 cores.

Thats what the upshift environment is meant for. Create a build config and let it run there.

@tumido
Copy link
Member

tumido commented Oct 8, 2018

@Ladas thank you for the metrics! It makes total sense and just proves that we all know, that we know nothing. 😄

There are probably multiple factors in effect:

  • data preprocessing (ironing out the structure too much)
  • applied clustering metod which doesn't fit the use case
  • input data composition (extracting wrong markers, missing important attributes, etc.)

I think this is the kind of insight into the clustering Marcel was looking for and these graphs would make it easier to compare different solutions. I'd love to see them as a part of the metrics tracking thingy Marcel's team is working on. 👍

@Ladas
Copy link
Author

Ladas commented Oct 9, 2018

FYI. seems like DBscan also performs bad, we'll need to munge the data before clustering

image

image

@durandom
Copy link
Member

cc @TreeinRandomForest

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants