RegexEntityRecognizer should not clean the text #6

thiagodp · 2017-06-06T18:18:29Z

In RegexEntityRecognizer the text is "cleaned" and transformed to lowercase before being processed by the regex. This causes case sensitive regexes not to work, for example. IMO, the text should not be modified before being checked by the regex, so it is better not to perform Bravey.Text.clean( text ) on it.

The text was updated successfully, but these errors were encountered:

thiagodp · 2017-06-06T18:38:52Z

Oh, it looks like the text is also cleaned by Bravey.Nlp.Fuzzy.test() before being given to RegexEntityRecognizer.

BraveyJS · 2017-06-07T09:55:52Z

RegexEntityRecognizer and other entity recognizers are designed to be eventually used stand-alone as much as possibile, without depending on an NLP object, so you can use just what you need in your chatbot.
That's why you've found Bravey.Text.clean( text ) in two often sequential places - and in most of the others entity recognizers. You can find some of these stand-alone usages in the unit tests.

RegexEntityRecognizer is thought mostly for matching parts of text via regexp and converting them to machine-readable data via callback, like language specific DateEntityRecognizer, TimeEntityRecognizer ...
It works on a cleaned string in order to simplify regexp definition and its callback: since double spaces, diatrics, case and so on are cleaned, you can ignore them when creating your regexp and reduce the cases of the callback.

Whatever, what you're saying about case sensitive regexps is still right. We can make a brand new and more strict entity recognizer for manipulating the text as-is or adding an argument on constructor as you were originally proposed. What do you think?

thiagodp · 2017-06-07T17:17:26Z

A strict entity recognizer would be great. Thanks.

thiagodp changed the title ~~Change Bravey.Text.clean() to have an optional parameter for not converting to lowercase~~ RegexEntityRecognizer should not clean the text Jun 6, 2017

thiagodp mentioned this issue Jun 21, 2017

Improve the documentation on how to add new languages #8

Open

BraveyJS mentioned this issue Sep 15, 2017

Result text should keep the upper case on words without Entity #9

Open

thiagodp mentioned this issue May 26, 2018

Diacritics are being removed thiagodp/concordialang#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegexEntityRecognizer should not clean the text #6

RegexEntityRecognizer should not clean the text #6

thiagodp commented Jun 6, 2017 •

edited

Loading

thiagodp commented Jun 6, 2017 •

edited

Loading

BraveyJS commented Jun 7, 2017

thiagodp commented Jun 7, 2017

RegexEntityRecognizer should not clean the text #6

RegexEntityRecognizer should not clean the text #6

Comments

thiagodp commented Jun 6, 2017 • edited Loading

thiagodp commented Jun 6, 2017 • edited Loading

BraveyJS commented Jun 7, 2017

thiagodp commented Jun 7, 2017

thiagodp commented Jun 6, 2017 •

edited

Loading

thiagodp commented Jun 6, 2017 •

edited

Loading