Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering: User defined distances and keyers #6612

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zyadtaha
Copy link
Contributor

@zyadtaha zyadtaha commented May 18, 2024

The first pr in the GSoC project #4301

Changes in this pull request:

  • Added UserDefinedKeyer class + unit tests for it
  • Added UserDefinedDistance class + unit tests for it

@zyadtaha zyadtaha requested a review from wetneb May 18, 2024 03:20
Copy link
Sponsor Member

@wetneb wetneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks really great!
The next step would be to register the keyer and distance so that they can be used in the compute-clusters HTTP command. This registration is done in the controller.js file (main/webapp/modules/core/MOD-INF/controller.js).
One problem that you will encounter is how to pass the clustering expression to the keyer. You can have a look at how it's currently done for the ngram clusterer for instance: it's quite ad-hoc and not extensible. It's worth thinking about ways to improve that, or keep using the same strategy if we think it's still the best approach.

Copy link
Member

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it intentional that this branch was created in the organization's repository? I think it would be preferable to do the development on a private fork of the repo and submit the PR from there, but I don't know if @wetneb told you to do it this way for some reason.

It would also be useful to reference the issue that this addresses so that the PR and the issue are linked together.

@zyadtaha
Copy link
Contributor Author

@tfmorris no, it wasn't intentional. Sorry for that mistake. I will work on my private fork in the future PRs.

@wetneb
Copy link
Sponsor Member

wetneb commented May 21, 2024

The failure is likely due to the fact that I have just merged #6574, which means that you'll need to explicitly register GREL in your test (see that PR for examples)

@zyadtaha
Copy link
Contributor Author

Thank you so much. You've saved me time.

@zyadtaha
Copy link
Contributor Author

I thought about the idea of passing the clustering expression to the keyer like how it's currently done for the n-gram clusterer but this will need adapting to:

  • BinningClusterer and kNNClusterer (add the expression to the parameter field of the classes and when initializing the visitor use it)
  • UserDefinedKeyer and UserDefinedDistance (not passing expression through constructor but the keying/distance function)

So I have decided to only check if there is an expression in the request. If yes, then pass it to the constructors of the Keyer or Distance. The changes are only made in ComputeClustersCommand

I also tested passing expressions in the request manually and it works fine. Should we have a unit test for it ?

@wetneb
Copy link
Sponsor Member

wetneb commented May 23, 2024

I think the approach you used to configure the keyer / distance is fine. Our existing architecture does not make it possible to register configurable keyers or distances, so some workaround like you used is necessary.

It would be good to clean this up in the future, but I would say this can be part of a separate refactoring, especially since it will impact other clustering methods (such as ngrams).

@wetneb
Copy link
Sponsor Member

wetneb commented May 24, 2024

I'm changing the title of this pull request to make its topic clearer (there is nothing in it that indicates that it's about clustering so far).

@wetneb wetneb changed the title Added new backend classes + unit tests clustering: User defined distances and keyers May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants