Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer #315

kimchy · 2010-08-12T15:15:01Z

The analysis process in Lucene allows also for char_filter to be used which are filters done on the actual character stream before the tokenization process. Allow to configure custom char_filter and provide an implementation for html stripping called html_strip.

Also, add a standard_html_strip analyzer that combines the standard analyzer with an html_strip char filter.

Here are some examples how to configure it using both yaml and json:

YAML:

index :
  analysis :
    tokenizer :
      standard :
        type : standard
    char_filter :
      my_html :
        type : html_strip
        escaped_tags : [xxx, yyy]
        read_ahead : 1024
    filter :
      stop :
        type : stop
        stopwords : [test-stop]
      stop2 :
        type : stop
        stopwords : [stop2-1, stop2-2]
    analyzer :
      standard :
        type : standard
        stopwords : [test1, test2, test3]
      custom1 :
        tokenizer : standard
        filter : [stop, stop2]
      custom2 :
        tokenizer : standard
        char_filter : [html_strip, my_html]

JSON:

{
    "index" : {
        "analysis" : {
            "tokenizer" : {
                "standard" : {
                    "type" : "standard"
                }
            },
            "char_filter" : {
                "my_html" : {
                    "type" : "html_strip",
                    "escaped_tags" : ["xxx", "yyy"],
                    "read_ahead" : 1024
                }
            },
            "filter" : {
                "stop" : {
                    "type" : "stop",
                    "stopwords" : ["test-stop"]
                },
                "stop2" : {
                    "type" : "stop",
                    "stopwords" : ["stop2-1", "stop2-2"]
                }
            },
            "analyzer" : {
                "standard" : {
                    "type" : "standard",
                    "stopwords" : ["test1", "test2", "test3"]
                },
                "custom1" : {
                    "tokenizer" : "standard",
                    "filter" : ["stop", "stop2"]
                },
                "custom2" : {
                    "tokenizer" : "standard",
                    "char_filter" : ["html_strip", "my_html"]
                }
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

kimchy · 2010-08-12T15:16:40Z

Analysis: Add char_filter on top of tokenizer, filter, and analyzer. Add an html_strip char filter, closed by 98bc828.

clintongormley · 2010-08-19T11:40:53Z

This works well. What does the escaped_tags arg do? I experimented a bit, but it didn't seem to make any difference.

Given that this html_strip filter (plus standard analyser) will be a frequent requirement, any chance of making it one of the default analysers available by default?

…est elastic#315) MPC-3581: beta stack for support sandbox * sync with master, beta stack update for new tenants & communitech Approved-by: Can Yildiz

#315) Minor rename of internal default values for grouping min/max aggregators. Fix issue with negative values in max aggregator.

…nce per second, so sleeping for 2s put it right on the boundary. This should make our travis tests pass consistently Fixes elastic#315

bluelu mentioned this issue Dec 11, 2014

Node is not responsive after the end of a big merge for close to 10 minutes #8905

Closed

costin pushed a commit that referenced this issue Dec 6, 2022

Minor rename of internal default values for grouping min/max aggregat… (

db2bd49

#315) Minor rename of internal default values for grouping min/max aggregators. Fix issue with negative values in max aggregator.

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer #315

Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer #315

kimchy commented Aug 12, 2010

kimchy commented Aug 12, 2010

clintongormley commented Aug 19, 2010

Analysis: Add char_filter on top of tokenizer, filter, and analyzer. Add an html_strip char filter and standard_html_strip analyzer #315

Analysis: Add char_filter on top of tokenizer, filter, and analyzer. Add an html_strip char filter and standard_html_strip analyzer #315

Comments

kimchy commented Aug 12, 2010

kimchy commented Aug 12, 2010

clintongormley commented Aug 19, 2010

Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer #315

Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer #315