Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to remove any entries from a .voc file from a string. #2987

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

krisgesling
Copy link
Contributor

Description

The remove_voc method complements the existing vocab matching tools.

It is useful to clean utterances of common articles or other noise when parsing for specific information.

This method is not case sensitive unlike the other vocab matching methods. Is the case sensitivity intentional? Should we make all methods case insensitive?

Context: I was needing this type of method today, then saw the same need had been identified in #2986 and an existing method to do it already existed in OVOS

How to test

Unit tests included

Contributor license agreement signed?

  • CLA

@devops-mycroft devops-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Aug 31, 2021
Useful to clean utterances of common articles or other noise when
parsing for specific information. This method is not case
sensitive.
@devops-mycroft
Copy link

devops-mycroft commented Aug 31, 2021

Voight Kampff Integration Test Succeeded (Results)

@krisgesling
Copy link
Contributor Author

I think I've missed something... where are the other 2 approaches?
If there are better options, that's great - let's do that!

I noticed the extraneous spaces thing too but didn't see a reason to remove them. In one way it helps to show that things have been removed, rather than operating on what is now probably at least grammatically incorrect, and potentially has even changed the meaning slightly. This is kind of silly but the best I can come up with at 5am...

cleaned_string = self.remove_voc("john likes the football", "football")
if "john the football" in cleaned_string:
    ...

On the flipside, I could see that extra characters may impact a confidence rating based on keyword frequency etc.

In terms of case sensitivity the docstring says "is not case sensitive" but maybe this should be worded as "is case insensitive" to match the flag.

@forslund
Copy link
Collaborator

forslund commented Sep 1, 2021

There were some discussion around the implementation when I tried to point out a bug in #2986. I couldn't find Ken's versions there now but I put up a gist with a collection of the ones I could remember: https://gist.github.com/forslund/c01636a18ddef0a9bef84dae609f4511

The results for 100000 runs of each:
str_replace 0.8078192858956754
for_regex 6.423368333023973
single_regex 0.6766295690322295
precalculated_single_regex 0.42126812401693314
list_join 0.4742607999360189
dont_preserve_order 0.4839391199639067

The str_replace() is the implementation chosen in above mentioned PR since it's pretty readable and pretty fast.

for_regex() is basically what is done here and it gets a lot of overhead from the regex compilation. The good thing about of the regex versions is that capitalization of the input string can be kept without extra logic keeping it simple.

list_join() is a really nifty and fast function if I got it right...To handle capitalization it can be changed to

' '.join(w for w in phrase.split() if w.lower() not in words)

The lowering makes it lose about 0.1 seconds in the test case but it's still one of the fastest of the versions above.

Edit: Ken provided the missing version and I have updated with the dont_preserve_order one. Not sure it's ideal since it will rearrange the sentence.

@krisgesling
Copy link
Contributor Author

Nice comparison - thanks!

I love the simplicity of the list_join comprehension. However I just did some quick testing and there's a few extra pitfalls.

list_join() won't handle multi-word vocab
and
str_replace() is case-sensitive which I think we should handle.
Eg:

remove_voc("Never Forget it", "forget it")  # Never

I've added a "help wanted" flag to see if anyone else has a good suggestion. You can find unit tests in this PR to validate it.
Note - I've just removed the spaces that were left in from the original implementation so the unit tests will currently fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) help wanted
Projects
No open projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

None yet

4 participants