Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dbscan() #248

Merged
merged 12 commits into from
Mar 24, 2023
Merged

Refactor dbscan() #248

merged 12 commits into from
Mar 24, 2023

Conversation

alyst
Copy link
Member

@alyst alyst commented Mar 20, 2023

We have two implementations of dbscan(): one is distance matrix-based, another is NNTree-based. As noted in #221, they have inconsistent APIs and return different type.
This PR consolidates the two implementations (based on NNTree one), fixes some behaviour differences and sets deprecations:

  • dbscan(dist, eps, minpts) is deprecated in favor of dbscan(dist, radius, metric=nothihg, min_neighbors=minpts)
  • dbscan() adds return_type= parameter, which is Vector{DbscanCluster} for dbscan(points, ...), but this behaviour is deprecated in favor of return_type=DbscanResult.
    dbscan() now always returns DbscanResult. This is a breaking change for NN-tree-based implementation, which returned Vector{DbscanCluster}.
  • fixes a bug in NNTree-based implementation -- the check that the cluster core is not empty was missing. See e.g. DBscan Adjacency Lists have repeated clusters #200
  • radius=eps=0 is allowed for both implementations
  • min_neighbors now includes the point itself (as in the original pseudocode and distance matrix-based impl)
  • metric allows specifying non-Euclidean metrics; when metric=nothing, it signifies that the matrix provided is the distance matrix.

as shown in wiki pseudocode and
implemented in distance-based
dbscan()
@alyst
Copy link
Member Author

alyst commented Mar 20, 2023

@ararslan Sorry for bothering again. I've tried to fix the inconsistencies of the two dbscan() implementations we have: they differ in return type and positional arguments. It required creating two deprecations to maintain backward compatibility.
But potentially there are other alternatives for the consolidated API and deprecation path. Let me know what you think.

@alyst alyst requested a review from ararslan March 20, 2023 22:01
@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.11 ⚠️

Comparison is base (04b0705) 95.22% compared to head (7782248) 95.11%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #248      +/-   ##
==========================================
- Coverage   95.22%   95.11%   -0.11%     
==========================================
  Files          18       18              
  Lines        1339     1310      -29     
==========================================
- Hits         1275     1246      -29     
  Misses         64       64              
Impacted Files Coverage Δ
src/deprecate.jl 44.44% <ø> (ø)
src/dbscan.jl 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ararslan
Copy link
Member

Sorry for bothering again

No worries! Apologies for the delay, I've been insanely busy lately and have fallen quite far behind on GitHub notifications...

It's been so long since I last looked at this package that I don't feel like I'm well suited to judge correctness. I do wonder whether an easier deprecation path would be one of these:

  • Deprecate the method that doesn't return a DbscanResult in favor of simply wrapping the call in DbscanResult, i.e.
    @deprecate dbscan(...) DbscanResult(dbscan(...))
  • Just change the return type of the method that doesn't return a DbscanResult without a deprecation and tag a breaking release. A description of user-facing changes can be added to the GitHub release so people can determine how to be compatible with the new version.

I feel like the latter might actually be easiest for both users and maintainers. Thoughts?

@alyst alyst force-pushed the ast/refactor_dbscan branch 2 times, most recently from 7782248 to 858dcd8 Compare March 23, 2023 22:46
@alyst
Copy link
Member Author

alyst commented Mar 23, 2023

I feel like the latter might actually be easiest for both users and maintainers. Thoughts?

Thanks for allowing me the breaking change :)
It is a much more straightforward path + NN-tree based dbscan() (the one that returned Vector{DbscanCluster}) gave incorrect results anyway,
so the breaking change will alert the users that they should expect very different clustering.

alyst and others added 11 commits March 23, 2023 20:17
counts has the same length as seeds
it just means exact match, nntree-based impl allows it
as in wiki pseudocode and nntree-based impl
instead of restricting to leafsize=
* merge the two dbscan() implementations
* deprecate dbscan(dist, radius) in favor of
  dbscan(dist, radius, metric=nothing)
* dbscan(points, ...) returns DbscanResult instead of
  Vector{DbscanCluster} (breaking change)
@alyst alyst merged commit 3d64e14 into master Mar 24, 2023
@alyst alyst deleted the ast/refactor_dbscan branch March 24, 2023 07:48
alyst added a commit that referenced this pull request Mar 24, 2023
update w.r.t. #248 changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants