Add method to remove any entries from a .voc file from a string. #2987

krisgesling · 2021-08-31T11:03:36Z

Description

The remove_voc method complements the existing vocab matching tools.

It is useful to clean utterances of common articles or other noise when parsing for specific information.

This method is not case sensitive unlike the other vocab matching methods. Is the case sensitivity intentional? Should we make all methods case insensitive?

Context: I was needing this type of method today, then saw the same need had been identified in #2986 and an existing method to do it already existed in OVOS

How to test

Unit tests included

Contributor license agreement signed?

CLA

Useful to clean utterances of common articles or other noise when parsing for specific information. This method is not case sensitive.

devops-mycroft · 2021-08-31T12:19:07Z

Voight Kampff Integration Test Succeeded (Results)

krisgesling · 2021-08-31T19:53:52Z

I think I've missed something... where are the other 2 approaches?
If there are better options, that's great - let's do that!

I noticed the extraneous spaces thing too but didn't see a reason to remove them. In one way it helps to show that things have been removed, rather than operating on what is now probably at least grammatically incorrect, and potentially has even changed the meaning slightly. This is kind of silly but the best I can come up with at 5am...

cleaned_string = self.remove_voc("john likes the football", "football")
if "john the football" in cleaned_string:
    ...

On the flipside, I could see that extra characters may impact a confidence rating based on keyword frequency etc.

In terms of case sensitivity the docstring says "is not case sensitive" but maybe this should be worded as "is case insensitive" to match the flag.

forslund · 2021-09-01T19:28:56Z

There were some discussion around the implementation when I tried to point out a bug in #2986. I couldn't find Ken's versions there now but I put up a gist with a collection of the ones I could remember: https://gist.github.com/forslund/c01636a18ddef0a9bef84dae609f4511

The results for 100000 runs of each:
str_replace 0.8078192858956754
for_regex 6.423368333023973
single_regex 0.6766295690322295
precalculated_single_regex 0.42126812401693314
list_join 0.4742607999360189
dont_preserve_order 0.4839391199639067

The str_replace() is the implementation chosen in above mentioned PR since it's pretty readable and pretty fast.

for_regex() is basically what is done here and it gets a lot of overhead from the regex compilation. The good thing about of the regex versions is that capitalization of the input string can be kept without extra logic keeping it simple.

list_join() is a really nifty and fast function if I got it right...To handle capitalization it can be changed to

' '.join(w for w in phrase.split() if w.lower() not in words)

The lowering makes it lose about 0.1 seconds in the test case but it's still one of the fastest of the versions above.

Edit: Ken provided the missing version and I have updated with the dont_preserve_order one. Not sure it's ideal since it will rearrange the sentence.

krisgesling · 2021-09-13T06:04:54Z

Nice comparison - thanks!

I love the simplicity of the list_join comprehension. However I just did some quick testing and there's a few extra pitfalls.

list_join() won't handle multi-word vocab
and
str_replace() is case-sensitive which I think we should handle.
Eg:

remove_voc("Never Forget it", "forget it")  # Never

I've added a "help wanted" flag to see if anyone else has a good suggestion. You can find unit tests in this PR to validate it.
Note - I've just removed the spaces that were left in from the original implementation so the unit tests will currently fail.

krisgesling requested a review from ken-mycroft August 31, 2021 11:03

devops-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Aug 31, 2021

Add method to remove any entries from a .voc file from a string.

25156b8

Useful to clean utterances of common articles or other noise when parsing for specific information. This method is not case sensitive.

krisgesling force-pushed the feature/remove-voc branch from 8d27785 to 25156b8 Compare August 31, 2021 11:50

krisgesling added the help wanted label Sep 13, 2021

remove undesirable spaces from tests

96ef673

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to remove any entries from a .voc file from a string. #2987

Add method to remove any entries from a .voc file from a string. #2987

krisgesling commented Aug 31, 2021

devops-mycroft commented Aug 31, 2021 •

edited

krisgesling commented Aug 31, 2021

forslund commented Sep 1, 2021 •

edited

krisgesling commented Sep 13, 2021

Add method to remove any entries from a .voc file from a string. #2987

Are you sure you want to change the base?

Add method to remove any entries from a .voc file from a string. #2987

Conversation

krisgesling commented Aug 31, 2021

Description

How to test

Contributor license agreement signed?

devops-mycroft commented Aug 31, 2021 • edited

krisgesling commented Aug 31, 2021

forslund commented Sep 1, 2021 • edited

krisgesling commented Sep 13, 2021

devops-mycroft commented Aug 31, 2021 •

edited

forslund commented Sep 1, 2021 •

edited