Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add
irish
analyzerAdd
sorani
analyzer (Kurdish)Add
classic
tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.Add
thai
tokenizer: segments thai text into words.Add
classic
tokenfilter: cleans up acronyms and possessives from classic tokenizerAdd
apostrophe
tokenfilter: removes text after apostrophe and the apostrophe itselfAdd
german_normalization
tokenfilter: umlaut/sharp S normalizationAdd
hindi_normalization
tokenfilter: accounts for hindi spelling differencesAdd
indic_normalization
tokenfilter: accounts for different unicode representations in Indian languagesAdd
sorani_normalization
tokenfilter: normalizes kurdish textAdd
scandinavian_normalization
tokenfilter: normalizes Norwegian, Danish, Swedish textAdd
scandinavian_folding
tokenfilter: much more aggressive form ofscandinavian_normalization
Add additional languages to stemmer tokenfilter:
galician
,minimal_galician
,irish
,sorani
,light_nynorsk
,minimal_nynorsk
Add support access to default Thai stopword set "thai"
Fix some bugs and broken links in documentation.
Closes #5935