HTML tokenizer #301

clintongormley · 2010-08-08T14:59:37Z

The need to index fields containing HTML is a frequent use case. I think that it is important to have a built-in tokenizer which can handle HTML (ie remove tags and decode entities).

At the moment, in my client code I do the following:

$value =~ s/<[^>]+>/ /g;   # replace any <.....> extents with a single space
$value =~ s/\s+/ /g;           # replace multiple spaces with a single space
$value =~ s/^ //;                # trim leading whitespace
$value =~ s/ $//;                # trim trailing whitespace
decode_entities($value);  # translate all HTML entities to the equiv UTF-8 char

This is sufficient to convert HTML to text suitable for indexing by the default analyzer - doesn't need to do any more than this.

Any chance of getting this built in?

The text was updated successfully, but these errors were encountered:

kimchy · 2010-08-18T21:54:10Z

Is the latest html char filter addition #315 to construct your own analyzer that can strip out html good for this? If so, can I close this issue?

kimchy · 2010-09-21T21:16:58Z

closing the issue, the htmp strip has been added.

Fix elastic#301

Remove vladimir Also add igor and fabien to gen students just in case Approved-by: Gideon Avida

🤖 ESQL: Merge upstream

williamrandolph pushed a commit to williamrandolph/elasticsearch that referenced this issue Jun 4, 2020

[DOC] Fix typo

d24f2bf

Fix elastic#301

mindw pushed a commit to mindw/elasticsearch that referenced this issue Sep 5, 2022

Merged in dev/can/remove-vladimir (pull request elastic#301)

803ccca

Remove vladimir Also add igor and fabien to gen students just in case Approved-by: Gideon Avida

costin pushed a commit that referenced this issue Dec 6, 2022

Merge pull request #301 from elastic/main

44a073d

🤖 ESQL: Merge upstream

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML tokenizer #301

HTML tokenizer #301

clintongormley commented Aug 8, 2010

kimchy commented Aug 18, 2010

kimchy commented Sep 21, 2010

HTML tokenizer #301

HTML tokenizer #301

Comments

clintongormley commented Aug 8, 2010

kimchy commented Aug 18, 2010

kimchy commented Sep 21, 2010