Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highly Duplicated Filters #5

Merged
merged 4 commits into from
Jun 1, 2023
Merged

Highly Duplicated Filters #5

merged 4 commits into from
Jun 1, 2023

Conversation

alvin319
Copy link
Collaborator

Summary

  • Updated requirements.txt to include pytest and re-ordered package names
  • Added core functions for both histogram generation and frequency-based duplication filters
  • Added some simple test cases for the filter

Test Output

> pytest test_highly_duplicated_filter.py

================================================= test session starts ==================================================
platform linux -- Python 3.10.9, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/alvin/research/semantic-memorization/filters
plugins: anyio-3.6.2
collected 3 items

test_highly_duplicated_filter.py ...                                                                             [100%]

================================================== 3 passed in 0.14s ===================================================

@alvin319 alvin319 requested a review from Kyle1668 May 29, 2023 03:02
@CLAassistant
Copy link

CLAassistant commented May 29, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@Kyle1668 Kyle1668 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with suggestions. See comments about exposing occurrences as well for a future PR.

"""
return Counter(token_indices.apply(lambda x: _concat_token_indices(x, delimiter=delimiter)))

def get_highly_duplicated_filter_func(histogram: Counter[str, int], frequency_threshold: int = 1, delimiter: str = '_') -> Callable[[List[int]], bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for having a function that returns another function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter works a bit differently than others since it requires a pre-computation of calculating the histogram of sequences. Ideally, we only need to calculate the histogram once.

When it comes to the final stage of running all filters and joining them together, we could just have a list of filter_funcs and iteratively run pd.apply to generate each metric column, and having this function allows us to tune the threshold as different variants of the filters, e.g., highly_duplicated_filter_freq_2 vs. highly_duplicated_filter_freq_5.

Returns:
Callable[[List[int]], bool]: Filter function that checks if a list of token indices is highly duplicated.
"""
def _highly_duplicated_filter_func(token_indices: List[int]) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'll want both the number of occurrences for a given sequence and whether it's highly duplicated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can add another filter function for generating the frequency as well.

@Kyle1668 Kyle1668 merged commit 9059bea into master Jun 1, 2023
@alvin319 alvin319 deleted the alvin/duplicated-filters branch July 21, 2023 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants