-
-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SilhouetteVisualizer
add support for more estimators
#1294
SilhouetteVisualizer
add support for more estimators
#1294
Conversation
…ilhouette_score()` and `silhouette_samples()`
Codecov Report
@@ Coverage Diff @@
## develop #1294 +/- ##
===========================================
- Coverage 90.89% 90.70% -0.19%
===========================================
Files 93 93
Lines 5303 5327 +24
===========================================
+ Hits 4820 4832 +12
- Misses 483 495 +12
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…` from the labels.
@stergion thank you so much for your interest in Yellowbrick and for opening this PR; we really appreciate all contributions to Yellowbrick. We'll find a reviewer for this PR as soon as possible so that we can include it in our next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR. I like how you simply added support for more estimators. The only thing left is to write some tests. Let me know if you need help with that. After that I will approve.
yellowbrick/cluster/silhouette.py
Outdated
if check_fitted(self.estimator, is_fitted_by=self.is_fitted) and hasattr(self.estimator, "predict"): | ||
labels = self.estimator.predict(X) | ||
else: # if estimator is NOT fitted, OR estimator does NOT implement predict() | ||
labels = self.estimator.fit_predict(X, y, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👍 way to cover fit_predict here.
@lwgray Thanks for the comments and the review. Sorry, I took so long to reply. If you could help me with the tests, it would be great, since I don't have any experience. |
I went back and reviewed this again and realized that all but two clustering algorithm work immediately with this fix. I am unsure why I get the ValueError with SpectralClustering. FeatureAgglomeration gives back an AttributeError and it is because it doesn't have a fit_predict method. See table below:
|
A couple of questions...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwgray I think it is simple enough to add a test for all the clusters using pytest parameterize -- we do that for a lot of tests, and I agree it should be part of this PR.
We do need to fix SpectralClustering since the bug was introduced in this PR. FeatureAgglomoration is a transformer not a model -- so I think we can omit that from the tests.
yellowbrick/cluster/silhouette.py
Outdated
elif hasattr(self.estimator, "affinity"): | ||
metric = self.estimator.affinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwgray this is where the error is occurring for SpectralClustering - SpectralClustering
does have an attribute affinity
which is used to compute the adjacency matrix between instances. For spectral clustering the attribute affinity
is not a distance metric and defaults to "rbf" -- which is why that test is failing.
@stergion what model prompted you to add this metric selector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was for the AffinityPropagation, AgglomerativeClustering, FeatureAgglomeration.
Since sklearn version 1.2, affinity
is deprecated in AgglomerativeClustering and FeatureAgglomeration
and metric
is used instead, like the other clustering algorithms.
AffinityPropagation was not updated, it still uses affinity
. Although, AffinityPropagation uses the negative
squared euclidean distance between points, when affinity='euclidean'
…d are implementing fit_predict(). Added condition to make sure Spectral Clustering metric is not being set to
@stergion @bbengfort Can you review my changes? |
Co-authored-by: stergion <35434161+stergion@users.noreply.github.com> Signed-off-by: Benjamin Bengfort <benjamin@bengfort.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwgray RE: the conda tests; in a separate PR we're going to have to update our Python versions for testing. The Conda 3.8 and 3.9 test failures can be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Your changes also made the code cleaner and easier to understand.
@stergion thank you so much for your contribution to Yellowbrick! |
This PR closes #1182 .
predict()
method,n_clusters
infer clusters from labels,metric
oraffinity
as the silhouette metric.I decided not to implement special handling for estimators that produce outlier values, eg DBSCAN,
as sklearn doesn't do neither in their examples.
I wasn't sure weather to use the estimator's metric attribute for the silhouette metric,
or add a parameter
metric
inSilhouetteVisualizer
constructor. I chose the first option because it didn'talter the class signature and seemed like the safer option. Although, I do believe the second option to be the better one.