Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexEntityRecognizer should not clean the text #6

Open
thiagodp opened this issue Jun 6, 2017 · 3 comments
Open

RegexEntityRecognizer should not clean the text #6

thiagodp opened this issue Jun 6, 2017 · 3 comments

Comments

@thiagodp
Copy link
Contributor

thiagodp commented Jun 6, 2017

In RegexEntityRecognizer the text is "cleaned" and transformed to lowercase before being processed by the regex. This causes case sensitive regexes not to work, for example. IMO, the text should not be modified before being checked by the regex, so it is better not to perform Bravey.Text.clean( text ) on it.

@thiagodp thiagodp changed the title Change Bravey.Text.clean() to have an optional parameter for not converting to lowercase RegexEntityRecognizer should not clean the text Jun 6, 2017
@thiagodp
Copy link
Contributor Author

thiagodp commented Jun 6, 2017

Oh, it looks like the text is also cleaned by Bravey.Nlp.Fuzzy.test() before being given to RegexEntityRecognizer.

@BraveyJS
Copy link
Owner

BraveyJS commented Jun 7, 2017

RegexEntityRecognizer and other entity recognizers are designed to be eventually used stand-alone as much as possibile, without depending on an NLP object, so you can use just what you need in your chatbot.
That's why you've found Bravey.Text.clean( text ) in two often sequential places - and in most of the others entity recognizers. You can find some of these stand-alone usages in the unit tests.

RegexEntityRecognizer is thought mostly for matching parts of text via regexp and converting them to machine-readable data via callback, like language specific DateEntityRecognizer, TimeEntityRecognizer ...
It works on a cleaned string in order to simplify regexp definition and its callback: since double spaces, diatrics, case and so on are cleaned, you can ignore them when creating your regexp and reduce the cases of the callback.

Whatever, what you're saying about case sensitive regexps is still right. We can make a brand new and more strict entity recognizer for manipulating the text as-is or adding an argument on constructor as you were originally proposed. What do you think?

@thiagodp
Copy link
Contributor Author

thiagodp commented Jun 7, 2017

A strict entity recognizer would be great. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants