New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add elbow detection using the "kneedle" method to Elbow Visualizer #813
Conversation
@pswaldia Welcome and Thanks for opening a PR. We are wading through our backlog of issues and will get to your PR asap. |
@lwgray Sure. Thanks for the response. |
Hey there @pswaldia and thanks for jumping in to address #764! After taking a look through Full disclosure: My rationale for reimplementation over optional dependency is also partly in the interest of the long-term maintainability of our codebase. It will be much easier on our end to maintain the code if it's part of YB. Basically, from the maintainer perspective, fewer dependencies == fewer headaches down the line :D Let us know if you're game to try the reimplementation @pswaldia! |
@rebeccabilbro Thanks for the response. I am ready to reimplement the knee_locator functionality in yb. I'll be updating the pull request soon. |
This is so cool to see I'm not an expert on OSS licenses, but if you go the reimplementation route it might be best to include the license and copyright attribution? Only mentioning this based on what I've picked up on in my OS adventures. Thanks for the shoutout 👍 |
@arvkevi thanks for chiming in - I was going to reach out to you about this, but I'm super excited that you're onboard to have us include You've licensed
as follows: # yellowbrick.utils.kneed
# A port of the knee-point detection package, kneed.
#
# Author: Kevin Arvai
# Author: Pradeep Singh
# Created: Mon Apr 15 09:43:18 2019 -0400
#
# Copyright (C) 2017 Kevin Arvai
# All rights reserved.
# Redistribution and use in source and binary forms, with or without modification,
# are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this list
# of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice, this
# list of conditions and the following disclaimer in the documentation and/or other
# materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its contributors may
# be used to endorse or promote products derived from this software without specific
# prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
# OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
# IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# ID: kneed.py [] pswaldia@no-reply.github.com $
"""
This package contains a port of the knee-point detection package, kneed, by
Kevin Arvai and hosted at https://github.com/arvkevi/kneed. This port is maintained
with permission by the Yellowbrick contributors.
""" We will also link to you and say thank you in our documentation when we get to that point. Thoughts about the above proposal? |
@pswaldia thank you for taking on this challenge! I will be reviewing your PR when you push the kneedle code and we can work together to make sure this is implemented correctly. As noted above, I think the best way to port the code would be to include the Note that the two tests that rely on the Finally, to answer your question about the legend, I think that it is helpful, but with more information, e.g. remove the Adding the knee annotation should be an optional kwarg, wtih default=True so our users can turn it off if they don't want to use it. Let me know if you have any other questions! And thank you again so much for contributing to Yellowbrick! |
Added preample as well as code.
@bbengfort I have made some commits implementing changes as directed by you. Before moving to tests I would like to recieve feedback regarding the changes.
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12)) #default knee=True
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12), knee=False) #setting knee=False
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data Thanks. |
@bbengfort this is great, I'm excited to see this feature hit master 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pswaldia thank you so much for your patience and hard work on this PR. Things are looking really good and I'm excited to include the port of the kneed
package!
I think I've addressed all of your questions in the PR in my comments below (very nice PR description, by the way, thank you for your detail) - but please let me know if I skipped anything.
I also took a look at the tests, and right now it seems that it's just image comparison failures right now. Depending on how you want to approach this, we can either:
- Set the
knee
/locate_elbow
parameter toFalse
by default, as a prototype feature, then merge this PR and work on the tests in another PR - Work on the tests/docs as we keep going in this PR.
It's up to you! Thank you again!
yellowbrick/cluster/elbow.py
Outdated
@@ -170,6 +171,9 @@ class KElbowVisualizer(ClusteringScoreVisualizer): | |||
Display the fitting time per k to evaluate the amount of time required | |||
to train the clustering model. | |||
|
|||
knee : bool, default=True | |||
Display the vertical line corresponding to the optimal value of k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, I do think it's funny to have a knee
parameter in an "elbow" visualizer. I really do want to keep calling this "knee", but I'm worried it might be a bit confusing particularly to students. May I propose the following:
locate_elbow : bool, default: True
Automatically find the "elbow" or "knee" which likely corresponds to the optimal
value of k using the "knee point detection algorithm". The knee point detection
algorithm finds the point of maximum curvature, which in a well-behaved clustering
problem also represents the pivot of the elbow curve. The point is labeled with a
dashed line and annotated with the score and k values.
Does that seem a bit more understandable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes , this is definitely a good name and a nice description too. I'll replace that with the suggested one.
Thanks.
yellowbrick/cluster/elbow.py
Outdated
@@ -219,6 +223,7 @@ def __init__(self, model, ax=None, k=10, | |||
# Store the arguments | |||
self.scoring_metric = KELBOW_SCOREMAP[metric] | |||
self.timings = timings | |||
self.knee=knee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do change the name of the argument, this should also be updated to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! Sure.
yellowbrick/cluster/elbow.py
Outdated
@@ -247,6 +252,9 @@ def fit(self, X, y=None, **kwargs): | |||
|
|||
self.k_scores_ = [] | |||
self.k_timers_ = [] | |||
self.kneedle=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the KneeLocator
store any state that we may need to keep around? If so, we should store it in a variable self._knee_locator
(just to make things obvious to other YB contributors). If not, there is probably no reason to keep this object on the visualizer, simply make it a variable local to fit()
which will be discarded when we're done fitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbengfort For the moment I don't think there's any need to include knee locator
object in the visualizer. We can make it local to fit()
.I also think the name kneedle
is not that intuitive , I will change that to knee_locator
or elbow_locator
as you wish.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, also after more consideration; the KneeLocator
object holds all the data, which is something we don't want to do generally, so I think it's good that we're not storing it on the Visualizer. I'm fine with either knee_locator
or elbow_locator
-- your choice!
yellowbrick/cluster/elbow.py
Outdated
@@ -247,6 +252,9 @@ def fit(self, X, y=None, **kwargs): | |||
|
|||
self.k_scores_ = [] | |||
self.k_timers_ = [] | |||
self.kneedle=None | |||
self.knee_value=None | |||
self.score=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you so much for storing these on the visualizer! I think users will be interested in directly accessing them. I've got two requests:
-
We should make these "learned" attributes and suffix them with an underscore, e.g.
self.knee_value_
andself.score_
; also it is probably more informative if we name themself.elbow_value_
andself.elbow_score_
if that's ok -- see discussion above. -
We should document these in the docstring of the class. Also, it seems that
k_scores_
andk_timers_
are not documented, so would you mind also documenting them? See icdm.py Line 123 for an example of how to document the learned attributes.
Finally (more on this later), these properties should only exist iff self.locate_elbow==True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbengfort Thanks so much for your kind words.
- I was not aware of that naming convention of 'learned' attributes in yb but I found that in every visualizer class. I would change them as sugggested by you. Thanks.
- I will document those attributes. They are important for other contributors too. Thanks.
And as suggested it makes sense for these properties to exist iff self.locate_elbow==True
. I will make sure that happens.
One more question...I am not sure what do you mean by "learned" attributes , so will KneeLocator
object be considered as a "learned" attribute or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate it! The learned attributes are more a scikit-learn thing than a Yellowbrick thing. If you're interested in learning more, check out the sklearn developer guide. I guess they call them "estimated attributes" there.
A learned/estimated attribute is any data that is created when fit()
is called - e.g. the coef_
of a linear model. They only exist because data has been passed into the estimator. Because of this, the convention is to omit them from the class and only set them in fit()
-- this also is what we use to check if an estimator is fitted or not.
Because you're making the KneeLocator
a local variable, not storing it on the visualizer, it is not a learned attribute. However, elbow_value_
, elbow_score_
, k_scores_
, and k_timers_
are all learned attributes because they do not/cannot exist until fit()
.
yellowbrick/utils/kneed.py
Outdated
warnings.warn("No local maxima found in the distance curve\n" | ||
"The line is probably not polynomial, try plotting\n" | ||
"the distance curve with plt.plot(knee.xd, knee.yd)\n" | ||
"Also check that you aren't mistakenly setting the curve argument", RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think we may have to do something about this warning, otherwise, it could be very confusing to our users.
First, it is probably preferable to raise an exception here - that way we can catch the exception in the KElbowVisualizer
and either keep drawing without the detected elbow, issue an elbow-specific warning, or advise the user to set locate_elbow=False
. I think this warning will be issued given the convex/decreasing or concave/increasing issue that I mentioned above. So we may want to do things differently depending on if the user has the ability to control these params or something has gone wrong with what we expected for the metric.
If we do go with a warning, instead of a RuntimeWarning, could we please issue a YellowbrickWarning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whew, this is very, very legacy... the warning started as a sanity check long ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are running into it because we can be a bit all over the place with convex/increasing or concave/decreasing depending on the metric we're using -- and if the clustering is terrible (e.g. bad features, wrong algorithm, no actual clusters) then things get really wild at that point. Any advice or thoughts you have would be welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbengfort I am trying to get familiar with the exceptions handling. Once done I'll accommodate these changes in a PR.
Thanks.
Hello @bbengfort I have made changes suggested by you. Please tell me what further changes need to be done.
Demo: from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12)) #default knee=True, metric='distortion'
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
visualizer = KElbowVisualizer(model,metric="calinski_harabaz", k=(4,12)) #default knee=True, metric=''
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
visualizer = KElbowVisualizer(model,metric="calinski_harabaz", k=(4,12)) #default knee=True, metric=''
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @pswaldia -- we're getting there! I had a few minor, non-blocking comments. But now it's on to tests! Feeling up for it?
Basically, my thoughts are that we need the current batch of tests to continue with locate_elbow=False
then we should have a second batch of tests that test all the metrics and do image similarity with locate_elbow=True
; does this sound like a reasonable testing strategy?
Additionally, we should also check the documentation to make sure it's updated correctly with the new visual features.
yellowbrick/cluster/elbow.py
Outdated
self.curve_direction = 'decreasing' | ||
elif self.metric=='silhouette' or self.metric=='calinski_harabaz': | ||
self.curve_nature = 'concave' | ||
self.curve_direction = 'increasing' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a big deal and won't block merging of this code, but I wanted to make a recommendation because I'm hoping you will be a regular contributor to Yellowbrick, and this is both a code style thing and an user-facing API thing.
I mentioned that learned attributes, suffixed with an underscore, are only set in fit()
(nice work with those attrs, by the way!) In YB/sklearn terms, the other type of attributes are hyperparameters - the properties that the user passes into __init__
- unlike other python programming, sklearn and YB treat these specially. All other attributes of the class should be methods, e.g. fit()
or they should be marked as private, e.g. prefixed with an _
.
Based on this, technically (again not a big deal) curve direction and nature should be self._curve_direction
and self._curve_nature
.
There is another little item here, though, and that's that the curve nature and direction should depend only on the metric. What happens if the user changes these values? (nothing in the case of your code) It could become confusing for the user.
Therefore I would propose that we simply keep these as local variables when instantiating the KneeLocator
. Simply removing the self.
is fine - but I also wanted to show you how you might see this in other YB code:
locator_kwargs = {
'distortion': {'curve_nature': 'convex', 'curve_direction': 'decreasing'},
'silhouette': {'curve_nature': 'concave', 'curve_direction': 'increasing'},
'calinski_harabaz': {'curve_nature': 'concave', 'curve_direction': 'increasing'},
}.get(self.metric, {})
elbow_locator = KneeLoator(self.k_values_, self.k_scores, **locator_kwargs)
This is a jump table pattern and we often use it over if/elif so that it's easy to make modifications on a per-metric basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This jump table pattern is excellent. I always wondered what **kwargs
used for , today I learnt how they can be used. It's lot better over the if/else , looks tidy too.
Thank you!
yellowbrick/cluster/elbow.py
Outdated
elif self.metric=='silhouette' or self.metric=='calinski_harabaz': | ||
self.curve_nature = 'concave' | ||
self.curve_direction = 'increasing' | ||
self.elbow_locator = KneeLocator(self.k_values_,self.k_scores_,curve_nature=self.curve_nature,curve_direction=self.curve_direction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed that this would be a local variable and not stored on the class? I still think that's the right way to go.
|
||
elbow_score_ : float | ||
The silhouette score corresponding to the optimal value of k. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you!
warning_message=\ | ||
"No 'knee' or 'elbow' point detected, " \ | ||
"pass `locate_elbow=False` to remove the warning" | ||
warnings.warn(warning_message,YellowbrickWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you for changing this warning into a much more understandable signal to the user!
yellowbrick/cluster/elbow.py
Outdated
|
||
self.ax.plot(self.k_values_, self.k_scores_, marker="D") | ||
if self.locate_elbow and self.elbow_value_!=None: | ||
elbow_label = "$elbow\ at\ k={}, score={}$".format(self.elbow_value_, np.round(self.elbow_score_,3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not a big deal in this case but my preference is "{:0.3f}".format(self.elbow_score_)
rather than np.round
- I think this might be more expected by other developers. It won't be a blocker for this PR though.
|
||
#set the legend if locate_elbow=True | ||
if self.locate_elbow and self.elbow_value_!=None: | ||
self.ax.legend(loc='best', fontsize='medium') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
----- | ||
The KneeLocator is implemented using the "knee point detection algorithm" which can be read at | ||
`<https://www1.icsi.berkeley.edu/~barath/papers/kneedle-simplex11.pdf>` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
@bbengfort I am done with those minor yet important changes suggested. |
Thanks for doing that @pswaldia - again, it was no big deal at all, but I do appreciate you making the changes; you're ensuring that this PR is absolutely top notch! A couple of tips for docs and tests follow: Documentation
Let me know if there are any errors building the documentation or if the docs look weird. Tests
I suggest the following: For any tests that are currently failing, update them with |
@bbengfort As you suggested I followed the approach for tests , there were some failures related to image similarity, but I was able to complete the tests by updating the tests with
What I expected was I changed the documentation a bit so please have a look at it and suggest changes if any. I will be completing the tests today Thanks.. |
@pswaldia , I am facing the same issue that you mentioned here. This seems to be an environment issue I feel. Deactivating virtualenv and then running make command works fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pswaldia awesome, this looks great - and thank you for adding a reference to kneed
in the documentation. As for the build docs weirdness, I'll take a look and see how it builds on Read the Docs to see if we need to make any changes.
I'm going to go ahead and merge this in -- thank you so much for your contribution! But let's make sure we get those tests in for both kneed
and the locate_elbow
param in soon! Will you be able to work on that next? If not, we can always set the default locate_elbow=False
to give us more time (e.g. marking it as an experimental feature).
Thanks @bbengfort I will be working on writing tests. Though it will not be that fast this time as I will be busy with my exams. But I will open a pull request doing the required. Thanks again. It was really nice having this pull request merged. |
Nice work @pswaldia 😃 |
Thanks sir @arvkevi for your kind words. |
@pswaldia it is important for you to be able to focus on your exams, therefore I think there are two options here:
I'm fine with either, but option 1 requires a bit more git wrangling for you (e.g. pull from upstream develop, push into your fork, delete your old feature branch, and create a new feature branch). Let me know what you'd prefer! |
@bbengfort we need to complete the tests sooner or the later , let's go with option 1. I will open a new PR soon. |
@bbengfort there was a bug in It prompted a code refactor, but there are no changes to the API. The refactored code now reports the first knee and
I'm happy to open a new issue and/or PR and modify the yb source code myself. I see you are nearing a v1.0 release 🎉 so you may want to forge ahead and put this on the backburner. Either way, I wanted to bring it to your attention. |
Hi @arvkevi — thank you so much for reaching out! Your timing is perfect because yesterday while reviewing PR #891 I noticed that number being returned for I had planned on reaching out to you today to see if perhaps there was a bug in our port, so your timing couldn't be better. We greatly appreciate your help opening a PR to fix this; we are all pretty focused on our v1.0 release, but this is a great bug to squash before it goes out! |
I. Are you merging from a feature branch into develop?
Yes
II. Summarize your PR
This PR fixes #764 that aimed to include a feature to annotate the optimal value of 'k' using kneedle method described in https://github.com/arvkevi/kneed .
I have solved the issue in the following way:
kneed
library which can be added as an optional dependancy.III. Include a sample plot
In the above plot k=7 is the optimal K.
IV. List any TODOs or questions
Still to do:
Questions for the @DistrictDataLabs/team-oz-maintainers: