You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The analysis process in Lucene allows also for char_filter to be used which are filters done on the actual character stream before the tokenization process. Allow to configure custom char_filter and provide an implementation for html stripping called html_strip.
Also, add a standard_html_strip analyzer that combines the standard analyzer with an html_strip char filter.
Here are some examples how to configure it using both yaml and json:
YAML:
index :
analysis :
tokenizer :
standard :
type : standard
char_filter :
my_html :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
filter :
stop :
type : stop
stopwords : [test-stop]
stop2 :
type : stop
stopwords : [stop2-1, stop2-2]
analyzer :
standard :
type : standard
stopwords : [test1, test2, test3]
custom1 :
tokenizer : standard
filter : [stop, stop2]
custom2 :
tokenizer : standard
char_filter : [html_strip, my_html]
This works well. What does the escaped_tags arg do? I experimented a bit, but it didn't seem to make any difference.
Given that this html_strip filter (plus standard analyser) will be a frequent requirement, any chance of making it one of the default analysers available by default?
The analysis process in Lucene allows also for
char_filter
to be used which are filters done on the actual character stream before the tokenization process. Allow to configure customchar_filter
and provide an implementation for html stripping calledhtml_strip
.Also, add a
standard_html_strip
analyzer that combines the standard analyzer with an html_strip char filter.Here are some examples how to configure it using both
yaml
andjson
:YAML:
JSON:
The text was updated successfully, but these errors were encountered: