Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis: Add char_filter on top of tokenizer, filter, and analyzer. Add an html_strip char filter and standard_html_strip analyzer #315

Closed
kimchy opened this issue Aug 12, 2010 · 2 comments

Comments

@kimchy
Copy link
Member

kimchy commented Aug 12, 2010

The analysis process in Lucene allows also for char_filter to be used which are filters done on the actual character stream before the tokenization process. Allow to configure custom char_filter and provide an implementation for html stripping called html_strip.

Also, add a standard_html_strip analyzer that combines the standard analyzer with an html_strip char filter.

Here are some examples how to configure it using both yaml and json:

YAML:

index :
  analysis :
    tokenizer :
      standard :
        type : standard
    char_filter :
      my_html :
        type : html_strip
        escaped_tags : [xxx, yyy]
        read_ahead : 1024
    filter :
      stop :
        type : stop
        stopwords : [test-stop]
      stop2 :
        type : stop
        stopwords : [stop2-1, stop2-2]
    analyzer :
      standard :
        type : standard
        stopwords : [test1, test2, test3]
      custom1 :
        tokenizer : standard
        filter : [stop, stop2]
      custom2 :
        tokenizer : standard
        char_filter : [html_strip, my_html]

JSON:

{
    "index" : {
        "analysis" : {
            "tokenizer" : {
                "standard" : {
                    "type" : "standard"
                }
            },
            "char_filter" : {
                "my_html" : {
                    "type" : "html_strip",
                    "escaped_tags" : ["xxx", "yyy"],
                    "read_ahead" : 1024
                }
            },
            "filter" : {
                "stop" : {
                    "type" : "stop",
                    "stopwords" : ["test-stop"]
                },
                "stop2" : {
                    "type" : "stop",
                    "stopwords" : ["stop2-1", "stop2-2"]
                }
            },
            "analyzer" : {
                "standard" : {
                    "type" : "standard",
                    "stopwords" : ["test1", "test2", "test3"]
                },
                "custom1" : {
                    "tokenizer" : "standard",
                    "filter" : ["stop", "stop2"]
                },
                "custom2" : {
                    "tokenizer" : "standard",
                    "char_filter" : ["html_strip", "my_html"]
                }
            }
        }
    }
}
@kimchy
Copy link
Member Author

kimchy commented Aug 12, 2010

Analysis: Add char_filter on top of tokenizer, filter, and analyzer. Add an html_strip char filter, closed by 98bc828.

@clintongormley
Copy link

This works well. What does the escaped_tags arg do? I experimented a bit, but it didn't seem to make any difference.

Given that this html_strip filter (plus standard analyser) will be a frequent requirement, any chance of making it one of the default analysers available by default?

mindw pushed a commit to mindw/elasticsearch that referenced this issue Sep 5, 2022
…est elastic#315)

MPC-3581: beta stack for support sandbox

* sync with master, beta stack update for new tenants & communitech


Approved-by: Can Yildiz
costin pushed a commit that referenced this issue Dec 6, 2022
#315)

Minor rename of internal default values for grouping min/max aggregators.
Fix issue with negative values in max aggregator.
emilykmarx pushed a commit to emilykmarx/elasticsearch that referenced this issue Dec 26, 2023
…nce per second, so sleeping for 2s put it right on the boundary. This should make our travis tests pass consistently

Fixes elastic#315
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants