Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML tokenizer #301

Closed
clintongormley opened this issue Aug 8, 2010 · 2 comments
Closed

HTML tokenizer #301

clintongormley opened this issue Aug 8, 2010 · 2 comments

Comments

@clintongormley
Copy link

The need to index fields containing HTML is a frequent use case. I think that it is important to have a built-in tokenizer which can handle HTML (ie remove tags and decode entities).

At the moment, in my client code I do the following:

$value =~ s/<[^>]+>/ /g;   # replace any <.....> extents with a single space
$value =~ s/\s+/ /g;           # replace multiple spaces with a single space
$value =~ s/^ //;                # trim leading whitespace
$value =~ s/ $//;                # trim trailing whitespace
decode_entities($value);  # translate all HTML entities to the equiv UTF-8 char

This is sufficient to convert HTML to text suitable for indexing by the default analyzer - doesn't need to do any more than this.

Any chance of getting this built in?

@kimchy
Copy link
Member

kimchy commented Aug 18, 2010

Is the latest html char filter addition #315 to construct your own analyzer that can strip out html good for this? If so, can I close this issue?

@kimchy
Copy link
Member

kimchy commented Sep 21, 2010

closing the issue, the htmp strip has been added.

williamrandolph pushed a commit to williamrandolph/elasticsearch that referenced this issue Jun 4, 2020
mindw pushed a commit to mindw/elasticsearch that referenced this issue Sep 5, 2022
Remove vladimir
Also add igor and fabien to gen students just in case

Approved-by: Gideon Avida
costin pushed a commit that referenced this issue Dec 6, 2022
🤖 ESQL: Merge upstream
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants