Highly Duplicated Filters #5

alvin319 · 2023-05-29T03:02:17Z

Summary

Updated requirements.txt to include pytest and re-ordered package names
Added core functions for both histogram generation and frequency-based duplication filters
Added some simple test cases for the filter

Test Output

> pytest test_highly_duplicated_filter.py

================================================= test session starts ==================================================
platform linux -- Python 3.10.9, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/alvin/research/semantic-memorization/filters
plugins: anyio-3.6.2
collected 3 items

test_highly_duplicated_filter.py ...                                                                             [100%]

================================================== 3 passed in 0.14s ===================================================

CLAassistant · 2023-05-29T03:02:22Z

All committers have signed the CLA.

Kyle1668

Approved with suggestions. See comments about exposing occurrences as well for a future PR.

Kyle1668 · 2023-06-01T20:28:05Z

filters/highly_duplicated_filter.py

+    """
+    return Counter(token_indices.apply(lambda x: _concat_token_indices(x, delimiter=delimiter)))
+
+def get_highly_duplicated_filter_func(histogram: Counter[str, int], frequency_threshold: int = 1, delimiter: str = '_') -> Callable[[List[int]], bool]:


What is the motivation for having a function that returns another function?

This filter works a bit differently than others since it requires a pre-computation of calculating the histogram of sequences. Ideally, we only need to calculate the histogram once.

When it comes to the final stage of running all filters and joining them together, we could just have a list of filter_funcs and iteratively run pd.apply to generate each metric column, and having this function allows us to tune the threshold as different variants of the filters, e.g., highly_duplicated_filter_freq_2 vs. highly_duplicated_filter_freq_5.

Kyle1668 · 2023-06-01T20:34:16Z

filters/highly_duplicated_filter.py

+    Returns:
+        Callable[[List[int]], bool]: Filter function that checks if a list of token indices is highly duplicated.
+    """
+    def _highly_duplicated_filter_func(token_indices: List[int]) -> bool:


Ideally, we'll want both the number of occurrences for a given sequence and whether it's highly duplicated.

Yeah, I can add another filter function for generating the frequency as well.

alvin319 added 4 commits May 28, 2023 19:44

add filters

0aaa982

update reqs

188e5ab

comments

cb2d6f9

update func names

f7f9e94

alvin319 requested a review from Kyle1668 May 29, 2023 03:02

Kyle1668 approved these changes Jun 1, 2023

View reviewed changes

Kyle1668 merged commit 9059bea into master Jun 1, 2023

alvin319 deleted the alvin/duplicated-filters branch July 21, 2023 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highly Duplicated Filters #5

Highly Duplicated Filters #5

alvin319 commented May 29, 2023

CLAassistant commented May 29, 2023 •

edited

Loading

Kyle1668 left a comment

Kyle1668 Jun 1, 2023

alvin319 Jun 1, 2023

Kyle1668 Jun 1, 2023

alvin319 Jun 1, 2023

Highly Duplicated Filters #5

Highly Duplicated Filters #5

Conversation

alvin319 commented May 29, 2023

CLAassistant commented May 29, 2023 • edited Loading

Kyle1668 left a comment

Choose a reason for hiding this comment

Kyle1668 Jun 1, 2023

Choose a reason for hiding this comment

alvin319 Jun 1, 2023

Choose a reason for hiding this comment

Kyle1668 Jun 1, 2023

Choose a reason for hiding this comment

alvin319 Jun 1, 2023

Choose a reason for hiding this comment

CLAassistant commented May 29, 2023 •

edited

Loading