New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML tokenizer #301
Comments
Is the latest html char filter addition #315 to construct your own analyzer that can strip out html good for this? If so, can I close this issue? |
closing the issue, the htmp strip has been added. |
williamrandolph
pushed a commit
to williamrandolph/elasticsearch
that referenced
this issue
Jun 4, 2020
mindw
pushed a commit
to mindw/elasticsearch
that referenced
this issue
Sep 5, 2022
Remove vladimir Also add igor and fabien to gen students just in case Approved-by: Gideon Avida
This issue was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The need to index fields containing HTML is a frequent use case. I think that it is important to have a built-in tokenizer which can handle HTML (ie remove tags and decode entities).
At the moment, in my client code I do the following:
This is sufficient to convert HTML to text suitable for indexing by the default analyzer - doesn't need to do any more than this.
Any chance of getting this built in?
The text was updated successfully, but these errors were encountered: