-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the documentation on how to add new languages #8
Comments
Hi @BraveyJS, @vidiemme-brainy, and @serafinomb, Maybe an interesting possibility is to use a port from Snowball stemmers, jssnowball - specifically, the implemented in snowball.babel.js. It currently supports the following languages (according to an example here):
I observed that the library can be used like this: function stem( lang, word ) {
var stemmer = snowballFactory.newStemmer( lang );
return stemmer.stem( word );
}
console.log( stem( 'portuguese', 'bocado' ) ); // prints 'boc' What do you think? |
We already used Snowball stemmers for the first two languages, so we can do the same with the others as well. You can start putting together the brazilian portuguese using /**
* Brazilian Portuguese language functions.
* @namespace
*/
Bravey.Language.PT_BR = {};
/**
* Creates a brazilian portuguese words stemmer (i.e. stemmed version of "bocado" and variants is always "boc").
* @constructor
*/
Bravey.Language.PT_BR.Stemmer = function(word) {
return stemmedWord;
} You can both extract the stemming code from You can start creating your class Since that's the first language we are adding a new language to the initial release, I suggest you to work on a single language and then - if you want or need it - adding the others gradually but, obviously, feel free to do the same with the other languages as well whenever you want :) |
Hi @BraveyJS , unfortunately the code from Bravey.Language.PT.Stemmer = (function() {
return function(word) {
return Bravey.Stemmer.stem( 'portuguese', word );
}
})(); I tried to make a pt.js (attached, untested) according to |
I agree with you on leaving I know that Snowball is cryptic and it's easier to include a Anyway, you can still work on your language support module, leaving the stemmer returning the We will try to add the stemmer from Snowball to your language and the others ASAP. Just a closing note about the file you attached: consider that sentences are cleaned before being processed by entity recognizers, so you can skip language-specific accents in regular expressions. That can help improving entity recognizing with people that doesn't use accents while writing on mobile - or foreign people. |
I understand your concerns about the js size and the memory use, although they can make Bravey less friendly for developers to include new languages. Do you know a port from Snowball less cryptic than I think the sentences could have the graphic accentuation removed, but - as I had suggested in the Issue #6 - it's better not transforming them to lowercase, because there is the need for case-sensitive rules, like those for extracting people names, place names, and the like. |
Good source for stemmers here: https://github.com/snowballstem/snowball-website/tree/master/js |
Portuguese version done. Later on I can contribute to improve the docs on how to add a new language, if you want. |
Thank you for your contribution! Together with your work, I've just added the portuguese sample you provided in the documentation and with others samples in unit tests. |
Here they are ;) |
Added! Thanks! Feel free to test and tune up the portuguese examples. |
Okay! |
It would be great to have more information about the functions inside
en.js
andit.js
, including and their purposes.BTW, I'm interested on creating a version for brazilian portuguese (pt-br.js).
The text was updated successfully, but these errors were encountered: