-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highly Duplicated Filters #5
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with suggestions. See comments about exposing occurrences as well for a future PR.
""" | ||
return Counter(token_indices.apply(lambda x: _concat_token_indices(x, delimiter=delimiter))) | ||
|
||
def get_highly_duplicated_filter_func(histogram: Counter[str, int], frequency_threshold: int = 1, delimiter: str = '_') -> Callable[[List[int]], bool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for having a function that returns another function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filter works a bit differently than others since it requires a pre-computation of calculating the histogram of sequences. Ideally, we only need to calculate the histogram once.
When it comes to the final stage of running all filters and joining them together, we could just have a list of filter_funcs
and iteratively run pd.apply
to generate each metric column, and having this function allows us to tune the threshold as different variants of the filters, e.g., highly_duplicated_filter_freq_2
vs. highly_duplicated_filter_freq_5
.
Returns: | ||
Callable[[List[int]], bool]: Filter function that checks if a list of token indices is highly duplicated. | ||
""" | ||
def _highly_duplicated_filter_func(token_indices: List[int]) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we'll want both the number of occurrences for a given sequence and whether it's highly duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can add another filter function for generating the frequency as well.
Summary
requirements.txt
to includepytest
and re-ordered package namesTest Output