AggressiveTokenizer for spanish (aggressive_tokenizer_es.js) #132

Closed
walgarcia opened this Issue Mar 7, 2014 · 2 comments

2 participants

@walgarcia

Hello

The tokenizer for spanish shoud use the folowing symbols which include accented vowels áéíóúü ÁÉÍÓÚÜ and letter "eñe" ñÑ:

so in aggressive_tokenizer_es.js

AggressiveTokenizer.prototype.tokenize = function(text) {
    // break a string up into an array of tokens by anything non-word
    return this.trim(text.split(/\W+/));
}; 

should be changed for

AggressiveTokenizer.prototype.tokenize = function(text) {
    // break a string up into an array of tokens by anything non-word
    return this.trim(text.split(/[a-zA-Zá-úÁ-ÚñÑüÜ]+/));
};

Thanks
W.

@kkoch986
NaturalNode member

Thanks for this! Based on testing out what you provided i think we will need this instead:
return this.trim(text.split(/[^a-zA-Zá-úÁ-ÚñÑüÜ]+/));

Just prefixing the regular expression with ^ so that the string is split on any character that is not in the set. Going to commit this ASAP.

@kkoch986 kkoch986 pushed a commit that referenced this issue Mar 7, 2014
Ken Koch implemented #132, possible fix for #125 as well. 443692b
@kkoch986
NaturalNode member

Added. Thanks!

@kkoch986 kkoch986 closed this Mar 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment