Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693

Closed
wants to merge 2 commits into from

Conversation

rmuir
Copy link
Contributor

@rmuir rmuir commented Jul 2, 2014

Add irish analyzer
Add sorani analyzer (Kurdish)

Add classic tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add thai tokenizer: segments thai text into words.

Add classic tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add apostrophe tokenfilter: removes text after apostrophe and the apostrophe itself
Add german_normalization tokenfilter: umlaut/sharp S normalization
Add hindi_normalization tokenfilter: accounts for hindi spelling differences
Add indic_normalization tokenfilter: accounts for different unicode representations in Indian languages
Add sorani_normalization tokenfilter: normalizes kurdish text
Add scandinavian_normalization tokenfilter: normalizes Norwegian, Danish, Swedish text
Add scandinavian_folding tokenfilter: much more aggressive form of scandinavian_normalization
Add additional languages to stemmer tokenfilter: galician, minimal_galician, irish, sorani, light_nynorsk, minimal_nynorsk

Add support access to default Thai stopword set "thai"

Fix some bugs and broken links in documentation.

Closes #5935

… Lucene

Add `irish` analyzer
Add `sorani` analyzer (Kurdish)

Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.

Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`

Add support access to default Thai stopword set "_thai_"

Fix some bugs and broken links in documentation.

Closes elastic#5935
@nik9000
Copy link
Member

nik9000 commented Jul 2, 2014

Yay!

@jpountz
Copy link
Contributor

jpountz commented Jul 2, 2014

LGTM this is great!

@s1monw
Copy link
Contributor

s1monw commented Jul 3, 2014

LGTM good stuff

@s1monw s1monw removed the review label Jul 3, 2014
@rmuir rmuir closed this Jul 3, 2014
rmuir added a commit to rmuir/elasticsearch-definitive-guide that referenced this pull request Jul 3, 2014
@clintongormley clintongormley added the :Search/Analysis How text is split into tokens label Jun 7, 2015
@clintongormley clintongormley changed the title Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene Add additional Analyzers, Tokenizers, and TokenFilters from Lucene Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Analysis: Add factories for some Tokenizers/TokenFilters in Lucene
5 participants