Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geomask.match(): check for duplicates #811

Open
araichoor opened this issue Jan 3, 2024 · 2 comments
Open

geomask.match(): check for duplicates #811

araichoor opened this issue Jan 3, 2024 · 2 comments

Comments

@araichoor
Copy link

the desitarget.geomask.match() function only works if there are no duplicates in the input arguments.
this is correctly written in the docstr; however, not everyone carefully reads a docstr :)

if I m correct, with the current version, when the arguments contain duplicates, the code just proceeds, but the outputs are not meaningful; so it s likely the user will be hurt by that.

that s why it would be better/safer to add a check that none of the arguments have duplicates; and just return an errror if they do.
something like:

if np.unique(A).size != len(A):
    msg = "A contains duplicates"
    log.error(msg)
    raise ValueError(msg)
if np.unique(B).size != len(B):
    msg = "B contains duplicates"
    log.error(msg)
    raise ValueError(msg)

also, with writing the code snippet above, I m wondering if A and B should only be 1d-array...

@araichoor
Copy link
Author

@geordie666, another suggestion for this one:

actually, maybe:

  • add an argument to say if the check for duplicates should be done or not;
  • set the default to True;
    e.g.: match(A, B, check_for_dups=True); and also for the "twin" function: match_to(A, B, check_for_dups=True).

would that make sense?

the motivation for that is that this check can take few seconds for 1e8 arrays, and if one is really looking for speed -- and already knows that the inputs don t have duplicates, this could be a "loss" of time.

[context: I ve a in-development fiberassign update using match(), to speed up the code when running at once several hundreds of tiles with several tens of millions of targets]

@geordie666
Copy link
Contributor

Ah, yes, sorry, this fell off my plate for a while.

Would it be better to set the default to check_for_dups=False for strict backward-compatibility? It's likely people use this function in other places where speed is important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants