Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep_types for filtering by token type #7120

Closed
wants to merge 4 commits into from

Conversation

rmuir
Copy link
Contributor

@rmuir rmuir commented Aug 1, 2014

For example, you can use this combined with UAXURLTokenizer to extract urls from text.

This patch works, but I am unsure about the naming of the filter and the parameters.
In fact this filter in lucene supports two modes "stop" and "keep".

@jpountz
Copy link
Contributor

jpountz commented Aug 1, 2014

LGTM

@dweinstein
Copy link

👍

@dweinstein
Copy link

Will this work for custom patterns? I thought it was great that there existed a UAXURLTokenizer but what if one isn't so lucky in the future and must write one using a custom pattern. What would be the type of those tokens?

@rmuir
Copy link
Contributor Author

rmuir commented Aug 1, 2014

Well this is related to token type, which is like a "tag" for the token that is produced by the analysis chain. Typically the lucene tokenizers provide tags if they recognize different types of tokens, but there is nothing limiting it to that. For example it could contain part-of-speech or whatever is useful.

This filter just filters by tag, it doesnt do tagging itself. It couldn't meet all the possible use cases for token types :)

So to tag by "pattern", we would just need a filter that does that. Its separate from what action to do with the actual tags...

@rmuir rmuir added the review label Aug 1, 2014
@rmuir
Copy link
Contributor Author

rmuir commented Aug 1, 2014

Adding review just for another opinion on the API before pushing it.

@mikemccand
Copy link
Contributor

LGTM, except minor naming nit: it's called "keep_types" publicly (I like this name) but in the code it's KeepType (without the s).

@@ -0,0 +1,37 @@
[[analysis-keep-words-tokenfilter]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the id of the section should be

[[analysis-keep-types-tokenfilter]]

@rmuir
Copy link
Contributor Author

rmuir commented Aug 15, 2014

Thanks @clintongormley and @mikemccand

I updated the PR.

@mikemccand
Copy link
Contributor

LGTM

@mikemccand
Copy link
Contributor

LGTM, thanks for the rename!

@rmuir rmuir changed the title Add keep_types for filtering by token type Analysis: Add keep_types for filtering by token type Aug 15, 2014
@rmuir rmuir closed this Aug 15, 2014
@jpountz jpountz removed the review label Aug 18, 2014
@clintongormley clintongormley added the :Search/Analysis How text is split into tokens label Jun 6, 2015
@clintongormley clintongormley changed the title Analysis: Add keep_types for filtering by token type Add keep_types for filtering by token type Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants