New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add keep_types
for filtering by token type
#7120
Conversation
LGTM |
👍 |
Will this work for custom patterns? I thought it was great that there existed a UAXURLTokenizer but what if one isn't so lucky in the future and must write one using a custom pattern. What would be the type of those tokens? |
Well this is related to token type, which is like a "tag" for the token that is produced by the analysis chain. Typically the lucene tokenizers provide tags if they recognize different types of tokens, but there is nothing limiting it to that. For example it could contain part-of-speech or whatever is useful. This filter just filters by tag, it doesnt do tagging itself. It couldn't meet all the possible use cases for token types :) So to tag by "pattern", we would just need a filter that does that. Its separate from what action to do with the actual tags... |
Adding review just for another opinion on the API before pushing it. |
LGTM, except minor naming nit: it's called "keep_types" publicly (I like this name) but in the code it's KeepType (without the s). |
@@ -0,0 +1,37 @@ | |||
[[analysis-keep-words-tokenfilter]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the id of the section should be
[[analysis-keep-types-tokenfilter]]
Thanks @clintongormley and @mikemccand I updated the PR. |
LGTM |
LGTM, thanks for the rename! |
keep_types
for filtering by token type
For example, you can use this combined with UAXURLTokenizer to extract urls from text.
This patch works, but I am unsure about the naming of the filter and the parameters.
In fact this filter in lucene supports two modes "stop" and "keep".