Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693

rmuir · 2014-07-02T19:13:47Z

Add irish analyzer
Add sorani analyzer (Kurdish)

Add classic tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add thai tokenizer: segments thai text into words.

Add classic tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add apostrophe tokenfilter: removes text after apostrophe and the apostrophe itself
Add german_normalization tokenfilter: umlaut/sharp S normalization
Add hindi_normalization tokenfilter: accounts for hindi spelling differences
Add indic_normalization tokenfilter: accounts for different unicode representations in Indian languages
Add sorani_normalization tokenfilter: normalizes kurdish text
Add scandinavian_normalization tokenfilter: normalizes Norwegian, Danish, Swedish text
Add scandinavian_folding tokenfilter: much more aggressive form of scandinavian_normalization
Add additional languages to stemmer tokenfilter: galician, minimal_galician, irish, sorani, light_nynorsk, minimal_nynorsk

Add support access to default Thai stopword set "thai"

Fix some bugs and broken links in documentation.

Closes #5935

… Lucene Add `irish` analyzer Add `sorani` analyzer (Kurdish) Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc. Add `thai` tokenizer: segments thai text into words. Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself Add `german_normalization` tokenfilter: umlaut/sharp S normalization Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages Add `sorani_normalization` tokenfilter: normalizes kurdish text Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization` Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk` Add support access to default Thai stopword set "_thai_" Fix some bugs and broken links in documentation. Closes elastic#5935

nik9000 · 2014-07-02T19:15:07Z

Yay!

jpountz · 2014-07-02T23:27:41Z

LGTM this is great!

s1monw · 2014-07-03T07:19:58Z

LGTM good stuff

See elastic/elasticsearch#6693

rmuir added review labels Jul 2, 2014

jpountz assigned rmuir Jul 2, 2014

s1monw removed the review label Jul 3, 2014

Merge branch 'master' into moaranalyzers

5d10836

rmuir closed this Jul 3, 2014

rmuir added a commit to rmuir/elasticsearch-definitive-guide that referenced this pull request Jul 3, 2014

Add Irish and Kurdish to the list of supported languages.

41cba37

See elastic/elasticsearch#6693

rmuir mentioned this pull request Jul 3, 2014

Add Irish and Kurdish to the list of supported languages. elastic/elasticsearch-definitive-guide#116

Merged

clintongormley added the :Search/Analysis How text is split into tokens label Jun 7, 2015

clintongormley changed the title ~~Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene~~ Add additional Analyzers, Tokenizers, and TokenFilters from Lucene Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693

Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693

rmuir commented Jul 2, 2014

nik9000 commented Jul 2, 2014

jpountz commented Jul 2, 2014

s1monw commented Jul 3, 2014

Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693

Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693

Conversation

rmuir commented Jul 2, 2014

nik9000 commented Jul 2, 2014

jpountz commented Jul 2, 2014

s1monw commented Jul 3, 2014