Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add elbow detection using the "kneedle" method to Elbow Visualizer #813

Merged
merged 17 commits into from Apr 22, 2019

Conversation

pswaldia
Copy link
Contributor

@pswaldia pswaldia commented Apr 12, 2019

I. Are you merging from a feature branch into develop?

Yes

II. Summarize your PR

This PR fixes #764 that aimed to include a feature to annotate the optimal value of 'k' using kneedle method described in https://github.com/arvkevi/kneed .

I have solved the issue in the following way:

  1. By using kneed library which can be added as an optional dependancy.
  2. By plotting a vertical line (black dashed) showing optimal K value.

III. Include a sample plot

plot

In the above plot k=7 is the optimal K.

IV. List any TODOs or questions

Still to do:

  • Upon getting feedback from the maintainers , it can be decided on whether it is to be added as an optional feature or regular feature.

Questions for the @DistrictDataLabs/team-oz-maintainers:

  • Is the legend required or not. ? It can be removed if not required.

@lwgray
Copy link
Contributor

lwgray commented Apr 12, 2019

@pswaldia Welcome and Thanks for opening a PR. We are wading through our backlog of issues and will get to your PR asap.

@pswaldia
Copy link
Contributor Author

pswaldia commented Apr 12, 2019

@lwgray Sure. Thanks for the response.

@rebeccabilbro
Copy link
Member

Hey there @pswaldia and thanks for jumping in to address #764!

After taking a look through kneed, I think this will be a very useful functionality to integrate into our KElbowVisualizer! So useful, in fact, that I am going to propose that instead of using kneed as an optional dependency, we instead reimplement the code in knee_locator.py and integrate it directly into our KElbowVisualizer (we should also make sure to give @arvkevi a big shout-out in our docs!). Since kneed's dependencies are all already dependencies of YB, and since the code required to generate the elbows is only ~100 lines (not including tests), I'm hoping this won't be too heavy of a lift.

Full disclosure: My rationale for reimplementation over optional dependency is also partly in the interest of the long-term maintainability of our codebase. It will be much easier on our end to maintain the code if it's part of YB. Basically, from the maintainer perspective, fewer dependencies == fewer headaches down the line :D

Let us know if you're game to try the reimplementation @pswaldia!

@pswaldia
Copy link
Contributor Author

pswaldia commented Apr 14, 2019

@rebeccabilbro Thanks for the response. I am ready to reimplement the knee_locator functionality in yb. I'll be updating the pull request soon.
Thanks.

@arvkevi
Copy link
Contributor

arvkevi commented Apr 14, 2019

This is so cool to see kneed even mentioned by yb 😄
The reimplementation rationale makes sense, case in point, I'm planning a release soon that will implement some significant changes (although they wouldn't affect this PR).

I'm not an expert on OSS licenses, but if you go the reimplementation route it might be best to include the license and copyright attribution? Only mentioning this based on what I've picked up on in my OS adventures.

Thanks for the shoutout 👍

@bbengfort
Copy link
Member

bbengfort commented Apr 15, 2019

@arvkevi thanks for chiming in - I was going to reach out to you about this, but I'm super excited that you're onboard to have us include knee_locater.py in our source. Let me know if you'd like to chat more directly over email or elsewhere but here is what I propose:

You've licensed kneed under a very permissive BSD-3 license. Yellowbrick is also licensed in a permissive fashion, using the Apache 2 license. Because our license is slightly more restrictive than yours, I believe that our license will cover yours in an acceptable fashion. Further to meet the obligations of your license we propose to add the following preamble to:

  • yellowbrick/utils/kneed.py
  • tests/test_utils/test_kneed.py

as follows:

# yellowbrick.utils.kneed
# A port of the knee-point detection package, kneed.
#
# Author:   Kevin Arvai
# Author:   Pradeep Singh 
# Created:  Mon Apr 15 09:43:18 2019 -0400
#
# Copyright (C) 2017 Kevin Arvai
# All rights reserved.
# Redistribution and use in source and binary forms, with or without modification, 
# are permitted provided that the following conditions are met:
# 
# 1. Redistributions of source code must retain the above copyright notice, this list
# of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice, this 
# list of conditions and the following disclaimer in the documentation and/or other 
# materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its contributors may 
# be used to endorse or promote products derived from this software without specific
# prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS 
# OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
# IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# ID: kneed.py [] pswaldia@no-reply.github.com $

"""
This package contains a port of the knee-point detection package, kneed, by 
Kevin Arvai and hosted at https://github.com/arvkevi/kneed. This port is maintained 
with permission by the Yellowbrick contributors.
"""

We will also link to you and say thank you in our documentation when we get to that point. Thoughts about the above proposal?

@bbengfort
Copy link
Member

@pswaldia thank you for taking on this challenge! I will be reviewing your PR when you push the kneedle code and we can work together to make sure this is implemented correctly. As noted above, I think the best way to port the code would be to include the knee_locator.py file with the preamble above in yellowbrick/utils/kneed.py -- the file rename gives a stronger nod to @arvkevi. We can then place test_sample.py into tests/test_utils/test_kneed.py with the preamble modified for that file.

Note that the two tests that rely on the DataGenerator can probably be simplified as we don't have the same requirement to model the original paper. We can either simply move the noisy_gaussian function over, or simply omit those tests.

Finally, to answer your question about the legend, I think that it is helpful, but with more information, e.g. remove the score label and instead simply have a label that says optimal k=7, score=18232 or something like that.

Adding the knee annotation should be an optional kwarg, wtih default=True so our users can turn it off if they don't want to use it.

Let me know if you have any other questions! And thank you again so much for contributing to Yellowbrick!

@pswaldia
Copy link
Contributor Author

pswaldia commented Apr 15, 2019

@bbengfort I have made some commits implementing changes as directed by you. Before moving to tests I would like to recieve feedback regarding the changes.
Changes done:

  1. Added kneed.py to yellowbrick/utils/ .
  2. Added the knee annotation as an optional kwarg, wtih default=True so our users can turn it off if they don't want to use it. ( Please let me know if the name of the variable chosen is alright.)
  3. Changes been made to legend. ( I am not sure how to include independent texts to the legend without any symbols. Would like to recieve feedback on that.)
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans()

visualizer = KElbowVisualizer(model, k=(4,12)) #default knee=True

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

plt2

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans()

visualizer = KElbowVisualizer(model, k=(4,12), knee=False) #setting knee=False 

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

plot1

Thanks.

@arvkevi
Copy link
Contributor

arvkevi commented Apr 15, 2019

@bbengfort this is great, I'm excited to see this feature hit master 😄

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pswaldia thank you so much for your patience and hard work on this PR. Things are looking really good and I'm excited to include the port of the kneed package!

I think I've addressed all of your questions in the PR in my comments below (very nice PR description, by the way, thank you for your detail) - but please let me know if I skipped anything.

I also took a look at the tests, and right now it seems that it's just image comparison failures right now. Depending on how you want to approach this, we can either:

  1. Set the knee/locate_elbow parameter to False by default, as a prototype feature, then merge this PR and work on the tests in another PR
  2. Work on the tests/docs as we keep going in this PR.

It's up to you! Thank you again!

@@ -170,6 +171,9 @@ class KElbowVisualizer(ClusteringScoreVisualizer):
Display the fitting time per k to evaluate the amount of time required
to train the clustering model.

knee : bool, default=True
Display the vertical line corresponding to the optimal value of k.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I do think it's funny to have a knee parameter in an "elbow" visualizer. I really do want to keep calling this "knee", but I'm worried it might be a bit confusing particularly to students. May I propose the following:

    locate_elbow : bool, default: True 
        Automatically find the "elbow" or "knee" which likely corresponds to the optimal 
        value of k using the "knee point detection algorithm". The knee point detection 
        algorithm finds the point of maximum curvature, which in a well-behaved clustering
        problem also represents the pivot of the elbow curve. The point is labeled with a
        dashed line and annotated with the score and k values.

Does that seem a bit more understandable?

Copy link
Contributor Author

@pswaldia pswaldia Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , this is definitely a good name and a nice description too. I'll replace that with the suggested one.
Thanks.

@@ -219,6 +223,7 @@ def __init__(self, model, ax=None, k=10,
# Store the arguments
self.scoring_metric = KELBOW_SCOREMAP[metric]
self.timings = timings
self.knee=knee
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do change the name of the argument, this should also be updated to match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Sure.

@@ -247,6 +252,9 @@ def fit(self, X, y=None, **kwargs):

self.k_scores_ = []
self.k_timers_ = []
self.kneedle=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the KneeLocator store any state that we may need to keep around? If so, we should store it in a variable self._knee_locator (just to make things obvious to other YB contributors). If not, there is probably no reason to keep this object on the visualizer, simply make it a variable local to fit() which will be discarded when we're done fitting.

Copy link
Contributor Author

@pswaldia pswaldia Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbengfort For the moment I don't think there's any need to include knee locator object in the visualizer. We can make it local to fit().I also think the name kneedle is not that intuitive , I will change that to knee_locator or elbow_locator as you wish.
Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, also after more consideration; the KneeLocator object holds all the data, which is something we don't want to do generally, so I think it's good that we're not storing it on the Visualizer. I'm fine with either knee_locator or elbow_locator -- your choice!

@@ -247,6 +252,9 @@ def fit(self, X, y=None, **kwargs):

self.k_scores_ = []
self.k_timers_ = []
self.kneedle=None
self.knee_value=None
self.score=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you so much for storing these on the visualizer! I think users will be interested in directly accessing them. I've got two requests:

  1. We should make these "learned" attributes and suffix them with an underscore, e.g. self.knee_value_ and self.score_; also it is probably more informative if we name them self.elbow_value_ and self.elbow_score_ if that's ok -- see discussion above.

  2. We should document these in the docstring of the class. Also, it seems that k_scores_ and k_timers_ are not documented, so would you mind also documenting them? See icdm.py Line 123 for an example of how to document the learned attributes.

Finally (more on this later), these properties should only exist iff self.locate_elbow==True.

Copy link
Contributor Author

@pswaldia pswaldia Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbengfort Thanks so much for your kind words.

  1. I was not aware of that naming convention of 'learned' attributes in yb but I found that in every visualizer class. I would change them as sugggested by you. Thanks.
  2. I will document those attributes. They are important for other contributors too. Thanks.

And as suggested it makes sense for these properties to exist iff self.locate_elbow==True. I will make sure that happens.

One more question...I am not sure what do you mean by "learned" attributes , so will KneeLocator object be considered as a "learned" attribute or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate it! The learned attributes are more a scikit-learn thing than a Yellowbrick thing. If you're interested in learning more, check out the sklearn developer guide. I guess they call them "estimated attributes" there.

A learned/estimated attribute is any data that is created when fit() is called - e.g. the coef_ of a linear model. They only exist because data has been passed into the estimator. Because of this, the convention is to omit them from the class and only set them in fit() -- this also is what we use to check if an estimator is fitted or not.

Because you're making the KneeLocator a local variable, not storing it on the visualizer, it is not a learned attribute. However, elbow_value_, elbow_score_, k_scores_, and k_timers_ are all learned attributes because they do not/cannot exist until fit().

yellowbrick/cluster/elbow.py Outdated Show resolved Hide resolved
warnings.warn("No local maxima found in the distance curve\n"
"The line is probably not polynomial, try plotting\n"
"the distance curve with plt.plot(knee.xd, knee.yd)\n"
"Also check that you aren't mistakenly setting the curve argument", RuntimeWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think we may have to do something about this warning, otherwise, it could be very confusing to our users.

First, it is probably preferable to raise an exception here - that way we can catch the exception in the KElbowVisualizer and either keep drawing without the detected elbow, issue an elbow-specific warning, or advise the user to set locate_elbow=False. I think this warning will be issued given the convex/decreasing or concave/increasing issue that I mentioned above. So we may want to do things differently depending on if the user has the ability to control these params or something has gone wrong with what we expected for the metric.

If we do go with a warning, instead of a RuntimeWarning, could we please issue a YellowbrickWarning ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whew, this is very, very legacy... the warning started as a sanity check long ago.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are running into it because we can be a bit all over the place with convex/increasing or concave/decreasing depending on the metric we're using -- and if the clustering is terrible (e.g. bad features, wrong algorithm, no actual clusters) then things get really wild at that point. Any advice or thoughts you have would be welcome!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbengfort I am trying to get familiar with the exceptions handling. Once done I'll accommodate these changes in a PR.
Thanks.

yellowbrick/utils/kneed.py Outdated Show resolved Hide resolved
yellowbrick/utils/kneed.py Show resolved Hide resolved
yellowbrick/utils/kneed.py Show resolved Hide resolved
yellowbrick/utils/kneed.py Show resolved Hide resolved
yellowbrick/utils/kneed.py Outdated Show resolved Hide resolved
@pswaldia
Copy link
Contributor Author

pswaldia commented Apr 18, 2019

Hello @bbengfort I have made changes suggested by you. Please tell me what further changes need to be done.

  • In elbow.py
  1. Learned attributes are documented.
  2. Parameters/Attributes names are changed to make them more understandanble.
  3. Set the curve_nature and curve directionaccording to the metric we have. It was pretty straighforward.
  4. Modified legend in mathLatEx style and sent the legend drawing to finalize().
  5. Warn the user and advise them to set locate_elbow=False when no knee point deteced.
  • In kneed.py
  1. Changed the docstring to match with yellowbrick style.
  2. Modified the warning message and used YellowBrickWarning instead of RunTimeWarning. Warning message for 2nd warning has not been changed , it will be made more meaningful after going through the entire implementation of kneed.

Demo:
1.

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans()

visualizer = KElbowVisualizer(model, k=(4,12))  #default knee=True, metric='distortion'

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

dist

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans()

visualizer = KElbowVisualizer(model,metric="calinski_harabaz", k=(4,12))  #default knee=True, metric=''

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

calinski

  1. yellowBrickWarning will be displayed when knee point is not detected. Curve will be drawn without elbow point and warning will advice user to pass locate_elbow=False to remove warning.Note: The below plot has a elbow point but just for demonstration purpose the knee_ value has been set to None.
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans()

visualizer = KElbowVisualizer(model,metric="calinski_harabaz", k=(4,12))  #default knee=True, metric=''

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

warn

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @pswaldia -- we're getting there! I had a few minor, non-blocking comments. But now it's on to tests! Feeling up for it?

Basically, my thoughts are that we need the current batch of tests to continue with locate_elbow=False then we should have a second batch of tests that test all the metrics and do image similarity with locate_elbow=True; does this sound like a reasonable testing strategy?

Additionally, we should also check the documentation to make sure it's updated correctly with the new visual features.

self.curve_direction = 'decreasing'
elif self.metric=='silhouette' or self.metric=='calinski_harabaz':
self.curve_nature = 'concave'
self.curve_direction = 'increasing'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a big deal and won't block merging of this code, but I wanted to make a recommendation because I'm hoping you will be a regular contributor to Yellowbrick, and this is both a code style thing and an user-facing API thing.

I mentioned that learned attributes, suffixed with an underscore, are only set in fit() (nice work with those attrs, by the way!) In YB/sklearn terms, the other type of attributes are hyperparameters - the properties that the user passes into __init__ - unlike other python programming, sklearn and YB treat these specially. All other attributes of the class should be methods, e.g. fit() or they should be marked as private, e.g. prefixed with an _.

Based on this, technically (again not a big deal) curve direction and nature should be self._curve_direction and self._curve_nature.

There is another little item here, though, and that's that the curve nature and direction should depend only on the metric. What happens if the user changes these values? (nothing in the case of your code) It could become confusing for the user.

Therefore I would propose that we simply keep these as local variables when instantiating the KneeLocator. Simply removing the self. is fine - but I also wanted to show you how you might see this in other YB code:

locator_kwargs = {
     'distortion': {'curve_nature': 'convex', 'curve_direction': 'decreasing'},
     'silhouette': {'curve_nature': 'concave', 'curve_direction': 'increasing'},
     'calinski_harabaz': {'curve_nature': 'concave', 'curve_direction': 'increasing'},
}.get(self.metric, {})

elbow_locator = KneeLoator(self.k_values_, self.k_scores, **locator_kwargs)

This is a jump table pattern and we often use it over if/elif so that it's easy to make modifications on a per-metric basis.

Copy link
Contributor Author

@pswaldia pswaldia Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This jump table pattern is excellent. I always wondered what **kwargs used for , today I learnt how they can be used. It's lot better over the if/else , looks tidy too.
Thank you!

elif self.metric=='silhouette' or self.metric=='calinski_harabaz':
self.curve_nature = 'concave'
self.curve_direction = 'increasing'
self.elbow_locator = KneeLocator(self.k_values_,self.k_scores_,curve_nature=self.curve_nature,curve_direction=self.curve_direction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed that this would be a local variable and not stored on the class? I still think that's the right way to go.


elbow_score_ : float
The silhouette score corresponding to the optimal value of k.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you!

warning_message=\
"No 'knee' or 'elbow' point detected, " \
"pass `locate_elbow=False` to remove the warning"
warnings.warn(warning_message,YellowbrickWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you for changing this warning into a much more understandable signal to the user!


self.ax.plot(self.k_values_, self.k_scores_, marker="D")
if self.locate_elbow and self.elbow_value_!=None:
elbow_label = "$elbow\ at\ k={}, score={}$".format(self.elbow_value_, np.round(self.elbow_score_,3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not a big deal in this case but my preference is "{:0.3f}".format(self.elbow_score_) rather than np.round - I think this might be more expected by other developers. It won't be a blocker for this PR though.


#set the legend if locate_elbow=True
if self.locate_elbow and self.elbow_value_!=None:
self.ax.legend(loc='best', fontsize='medium')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

-----
The KneeLocator is implemented using the "knee point detection algorithm" which can be read at
`<https://www1.icsi.berkeley.edu/~barath/papers/kneedle-simplex11.pdf>`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

@pswaldia
Copy link
Contributor Author

pswaldia commented Apr 18, 2019

@bbengfort I am done with those minor yet important changes suggested.
And yes that testing strategy seems good to me too. (Although I am not experienced writing tests, it will be first one for me, but i am really excited.). I will also look into the documentation part.
Thanks.

@bbengfort
Copy link
Member

Thanks for doing that @pswaldia - again, it was no big deal at all, but I do appreciate you making the changes; you're ensuring that this PR is absolutely top notch!

A couple of tips for docs and tests follow:

Documentation

  • Make sure you have installed the documentation requirements pip install -r docs/requirements.txt (preferably inside a virtual environment)
  • To build the docs, change into the docs directory and run make html, this should create a folder called _build/html inside of the docs directory. Note this only works in the docs directory! Let me know if there are any errors.
  • If you're on a Mac or a Linux machine you can open _build/html/index.html on the command line, otherwise, open a browser and go to File -> Open and select that file.
  • Navigate to Visualizers and API -> Clustering Visualizers -> Elbow Method
  • To edit the docs, edit the reStructuredText file at docs/api/cluster/elbow.rst

Let me know if there are any errors building the documentation or if the docs look weird.

Tests

  • Make sure you have the test requirements installed pip install -r tests/requirements.txt (preferably inside a virtual environment)
  • Run the tests with pytest ... this will take a while. Note that the tests must be run from the root project directory.
  • Run the elbow visualizer tests with pytest tests/test_cluster/test_elbow.py -- this is probably what you'll want to do most when testing.
  • You'll be writing/updating tests in TestElbowVisualizer, note that the method name must start with test_ and that the method name and the single line docstring are used in the test reporting.

I suggest the following:

For any tests that are currently failing, update them with locate_elbow=False; if those tests pass, let's merge this PR in and get to the tests and the updates to the docs in another PR. If those tests don't pass, then we'll have to sort something out.

@pswaldia
Copy link
Contributor Author

pswaldia commented Apr 19, 2019

@bbengfort As you suggested I followed the approach for tests , there were some failures related to image similarity, but I was able to complete the tests by updating the tests with locate_elbow=False. So at the end there were no issue with the tests regarding the KElbowVisualizer. So I think this PR can now be merged and we can get to further tests in another PR.

  • Regarding the documentation it was a smooth sail but I did noticed some weird things that did not match with the code written.
    Screenshot 2019-04-20 at 02 24 36

What I expected was locate_elbow : bool, default: True but it was not so. This was same with every visualizer not only KElbowVisualizer that we updated. Is it okay to have this bcoz in the official documentation available on net its fine. ?

I changed the documentation a bit so please have a look at it and suggest changes if any. I will be completing the tests today

Thanks..

@saurabhdaalia
Copy link

@bbengfort Regarding the documentation it was a smooth sail but I did noticed some wierd things that did not match with the code written.
Screenshot 2019-04-20 at 02 24 36

What I expected was locate_elbow : bool, default: True but it was not so. This was same with every visualizer not only KElbowVisualizer that we updated. Is it okay to have this bcoz in the official documentation available on net its fine. ?

I changed the documentation a bit so please have a look at it and suggest changes if any. I will be completing the tests today

Thanks..

@pswaldia , I am facing the same issue that you mentioned here. This seems to be an environment issue I feel. Deactivating virtualenv and then running make command works fine.

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pswaldia awesome, this looks great - and thank you for adding a reference to kneed in the documentation. As for the build docs weirdness, I'll take a look and see how it builds on Read the Docs to see if we need to make any changes.

I'm going to go ahead and merge this in -- thank you so much for your contribution! But let's make sure we get those tests in for both kneed and the locate_elbow param in soon! Will you be able to work on that next? If not, we can always set the default locate_elbow=False to give us more time (e.g. marking it as an experimental feature).

@bbengfort bbengfort merged commit 9a3587a into DistrictDataLabs:develop Apr 22, 2019
@pswaldia
Copy link
Contributor Author

pswaldia commented Apr 22, 2019

Thanks @bbengfort I will be working on writing tests. Though it will not be that fast this time as I will be busy with my exams. But I will open a pull request doing the required. Thanks again. It was really nice having this pull request merged.

@arvkevi
Copy link
Contributor

arvkevi commented Apr 22, 2019

Nice work @pswaldia 😃

@pswaldia
Copy link
Contributor Author

Thanks sir @arvkevi for your kind words.

@bbengfort
Copy link
Member

@pswaldia it is important for you to be able to focus on your exams, therefore I think there are two options here:

  1. If you create a PR - I can fill in the test stubs to make sure we don't lose the thread while you're taking your exam.
  2. We can set the default locate_elbow=False

I'm fine with either, but option 1 requires a bit more git wrangling for you (e.g. pull from upstream develop, push into your fork, delete your old feature branch, and create a new feature branch). Let me know what you'd prefer!

@pswaldia pswaldia deleted the feature branch April 22, 2019 19:24
@pswaldia
Copy link
Contributor Author

@bbengfort we need to complete the tests sooner or the later , let's go with option 1. I will open a new PR soon.

@arvkevi
Copy link
Contributor

arvkevi commented Jul 22, 2019

@bbengfort there was a bug in knee_locator.py.
The algorithm was reporting the last (most recent) knee point detected instead of the first. This caused the sensitivity parameter S to have an insignificant effect.

It prompted a code refactor, but there are no changes to the API. The refactored code now reports the first knee and S now has a significant impact on the knee point detected. See code snippet and plot in the README.
Also included in the 0.4+ release:

  1. removed an additional for loop
  2. improved variable naming
  3. KneeLocator saves all_knees
  4. more tests...

I'm happy to open a new issue and/or PR and modify the yb source code myself. I see you are nearing a v1.0 release 🎉 so you may want to forge ahead and put this on the backburner. Either way, I wanted to bring it to your attention.

@Kautumn06
Copy link
Contributor

Hi @arvkevi — thank you so much for reaching out! Your timing is perfect because yesterday while reviewing PR #891 I noticed that number being returned for k appeared to be one less than the optimal value. I tested the KElbowVisualizer again using make_blobs to create a synthetic dataset with 8 clusters and it also showed the optimal value as one less than what I was expecting:
make-blobs

I had planned on reaching out to you today to see if perhaps there was a bug in our port, so your timing couldn't be better. We greatly appreciate your help opening a PR to fix this; we are all pretty focused on our v1.0 release, but this is a great bug to squash before it goes out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add elbow detection using the "kneedle" method to Elbow Visualizer
7 participants