Skip to content
This repository has been archived by the owner. It is now read-only.

Elasticsearch with custom English analyzer #20

Closed
5hirish opened this issue May 1, 2018 · 1 comment
Closed

Elasticsearch with custom English analyzer #20

5hirish opened this issue May 1, 2018 · 1 comment
Assignees
Labels

Comments

@5hirish
Copy link
Owner

@5hirish 5hirish commented May 1, 2018

The in-built English analyzer for Elasticsearch seems to be using a weak stemmer (Porter Stemmer). So for a token like 'friendly' would get stemmed to 'friendli' and not 'friend'. A Lemmatizer would actually be perfect in such use cases.

Lemmatization is a much more complicated and expensive process that needs to understand the context in which words appear in order to make decisions about what they mean.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Source

@5hirish 5hirish added the enhancement label May 1, 2018
@5hirish 5hirish self-assigned this May 1, 2018
@5hirish 5hirish added this to To do in Issues Board May 4, 2018
@5hirish
Copy link
Owner Author

@5hirish 5hirish commented May 8, 2018

Refer to choose from different stemmers available in ES: Choosing a stemmer in ES
How to override the stopwords list in ES: Using a custom stop words list. We can use the stop words from spaCy: All stopwords here

Good choice would be Porter2 Algorithm: Snowball Porter2

Should we Enable ASCII Folding: More on it here. If yes should we store the original too.

@5hirish 5hirish moved this from To do to In progress in Issues Board May 9, 2018
@5hirish 5hirish mentioned this issue May 19, 2018
2 of 2 tasks complete
@5hirish 5hirish closed this May 20, 2018
Issues Board automation moved this from In progress to Done May 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Issues Board
  
Done
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant