Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the documentation on how to add new languages #8

Open
thiagodp opened this issue Jun 20, 2017 · 11 comments
Open

Improve the documentation on how to add new languages #8

thiagodp opened this issue Jun 20, 2017 · 11 comments

Comments

@thiagodp
Copy link
Contributor

It would be great to have more information about the functions inside en.js and it.js, including and their purposes.

BTW, I'm interested on creating a version for brazilian portuguese (pt-br.js).

@thiagodp
Copy link
Contributor Author

Hi @BraveyJS, @vidiemme-brainy, and @serafinomb,

Maybe an interesting possibility is to use a port from Snowball stemmers, jssnowball - specifically, the implemented in snowball.babel.js.

It currently supports the following languages (according to an example here):

  • arabic
  • armenian
  • basque
  • catalan
  • czech
  • danish
  • dutch
  • english
  • finnish
  • french
  • german
  • hungarian
  • italian
  • irish
  • norwegian
  • porter
  • portuguese
  • romanian
  • russian
  • spanish
  • slovene
  • swedish
  • tamil
  • turkish

I observed that the library can be used like this:

function stem( lang, word ) {
	var stemmer = snowballFactory.newStemmer( lang );
	return stemmer.stem( word );
}
console.log( stem( 'portuguese', 'bocado' ) ); // prints 'boc'

What do you think?

@BraveyJS
Copy link
Owner

We already used Snowball stemmers for the first two languages, so we can do the same with the others as well.
Stemmers only improves intent recognizing - entity recognizers are very important in many chatbot scenarios and should be made sometime - but this should work as initial stubs anyway.

You can start putting together the brazilian portuguese using src/languages/en.js as example, with something like this:

/**
 * Brazilian Portuguese language functions.
 * @namespace
 */
Bravey.Language.PT_BR = {};

/**
 * Creates a brazilian portuguese words stemmer (i.e. stemmed version of "bocado" and variants is always "boc").
 * @constructor
 */
Bravey.Language.PT_BR.Stemmer = function(word) {
  return stemmedWord;
}

You can both extract the stemming code from snowball.babel.js and nest it in your stemmer or include snowball.babel.js as dependence and call it.
In your own project you can choose both ways but I suggest you the first one for being included in the Bravey core. That's the same way we used for Italian and English, since this way you can make your own Bravey build with the languages you want to support in your project only, in order to optimize JS size and memory usage.

You can start creating your class src/languages/pt-br.js, include it in the tests file src/unit.html like the other languages and write unit tests using your stemmer stand alone and together with NLPs objects and basic entity recognizers (the one you can find in src/entityrecognizers).
I'll update this Issue and the Wiki accordingly with the informations you need along the time in order to help others.

Since that's the first language we are adding a new language to the initial release, I suggest you to work on a single language and then - if you want or need it - adding the others gradually but, obviously, feel free to do the same with the other languages as well whenever you want :)

@thiagodp
Copy link
Contributor Author

thiagodp commented Jun 21, 2017

Hi @BraveyJS ,

unfortunately the code from snoball.babel.js is a bit cryptic. I think it would be much easier if Bravey could make a stemmers/Stemmer.js with a Stemmer object containing a method like stem( lang, word ). In this way, each language namespace could just define its own stemmer:

Bravey.Language.PT.Stemmer = (function() {
  return function(word) {
	return Bravey.Stemmer.stem( 'portuguese', word );
  }
})();

I tried to make a pt.js (attached, untested) according to en.js and it.js. However, as I said before, it would be nice for new developers just worry about defining the EntityRecognizers. Don't you think?

@BraveyJS
Copy link
Owner

I agree with you on leaving EntityRecognizer only to developers but, as you can see, a little effort is needed to split the various part in its language package.

I know that Snowball is cryptic and it's easier to include a Stemmer object that works like Snowball in Bravey but we'd like to keep the same design and decision to the other languages and keep the optimization we've planned - and we will use in our projects.

Anyway, you can still work on your language support module, leaving the stemmer returning the word argument as-is as stub and complete the entity recognizers, which are very precious and that can be tested well by someone that knows that language, writing the unit tests as I suggested.

We will try to add the stemmer from Snowball to your language and the others ASAP.

Just a closing note about the file you attached: consider that sentences are cleaned before being processed by entity recognizers, so you can skip language-specific accents in regular expressions. That can help improving entity recognizing with people that doesn't use accents while writing on mobile - or foreign people.

@thiagodp
Copy link
Contributor Author

thiagodp commented Jun 21, 2017

I understand your concerns about the js size and the memory use, although they can make Bravey less friendly for developers to include new languages.

Do you know a port from Snowball less cryptic than snowball.babel.js? I would like to include a portuguese stemmer inside pt.js, as you suggested. I'll also try to create some tests for the EntityRecognizers.

I think the sentences could have the graphic accentuation removed, but - as I had suggested in the Issue #6 - it's better not transforming them to lowercase, because there is the need for case-sensitive rules, like those for extracting people names, place names, and the like.

@thiagodp
Copy link
Contributor Author

Good source for stemmers here: https://github.com/snowballstem/snowball-website/tree/master/js

@thiagodp
Copy link
Contributor Author

thiagodp commented Jul 4, 2017

Portuguese version done. Later on I can contribute to improve the docs on how to add a new language, if you want.

@BraveyJS
Copy link
Owner

BraveyJS commented Jul 4, 2017

Thank you for your contribution! Together with your work, I've just added the portuguese sample you provided in the documentation and with others samples in unit tests.
Feel free to improve the docs (I've seen that you've worked on a Microsoft environment due to the new batch builder files ;) ) and, if you want, you can port the two multilingual sample chatbots to portuguese too. You can find localized files in samples/browser/chatterbox/data/medbot.en.js and samples/browser/chatterbox/data/prices.en.js.

@thiagodp
Copy link
Contributor Author

thiagodp commented Jul 4, 2017

Here they are ;)

@BraveyJS
Copy link
Owner

Added! Thanks! Feel free to test and tune up the portuguese examples.

@thiagodp
Copy link
Contributor Author

Okay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants