work on regex intent classifier #417

Sarenne · 2017-06-06T12:43:07Z

Proposed changes:
Regex features can be used as training data in rasa_data.json with the following format (eg.): "regex_features": [ { "name": "number_matcher", "intent": "provide_number", "pattern": "[0-9]+" }, { "name": "number", "entity": "insurance_number", "pattern": "[0-9]+" }
Both converters.py and training_data.py have been modified to handle this.

regex_intent_classifier.py returns the respective intent of a regexp present in input in the same format as the keyword component.
regex_entity_extractor.py returns the set of entities that match regexps that are specified in the training data and present in input.
Both components have been added to registry.py

Status:

ready for code review
there are tests for the functionality
documentation updated
changelog updated

coveralls · 2017-06-06T13:23:47Z

Coverage decreased (-0.6%) to 92.176% when pulling 9b661ae on regex-component into d945abf on master.

tmbo · 2017-06-06T13:40:32Z

rasa_nlu/classifiers/regex_intent_classifier.py

+    def process(self, text):
+        # type: (Text) -> Dict[Text, Any]
+
+        return {


If no intent is matched with any of the regex expressions, the function shouldn't return any intent.

I tried to do this in line 58 - if there's no match between any regexp and the input, intent = None. Should I change it?

tmbo · 2017-06-06T13:46:10Z

rasa_nlu/classifiers/regex_intent_classifier.py

+        self.clf = clf
+        self.regex_dict = {}
+
+    def train(self, training_data):


The incoming trianing_data parameter is of type TrainingData (can be found in training_data.py). Before the data gets passed through the components in the pipeline the data will be read here: https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/converters.py#L205

That function loads the different parts of the training data file, e.g. the common_examples section from the training data file (e.g. as seen in https://github.com/RasaHQ/rasa_nlu/blob/master/data/examples/rasa/demo-rasa.json#L3).

The regex expressions should be put into the training data file (demo-rasa.json) and they should be read within the load_rasa_data in the converters.py. The read regex expressions should be passed to the TrainingData object https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/training_data.py#L33 and stored. You can then access these stored regexes in this function.

Ah! Hadn't realised that training_data was its own data type... So should the regexp should be in the "text" part of the training data file?

coveralls · 2017-06-07T09:38:54Z

Coverage decreased (-0.4%) to 92.349% when pulling df341c0 on regex-component into d945abf on master.

…data. Updated the train function for the intent classifier accordingly and wrote the entity extractor component. Added both components to registry.py.

tmbo

Nice work! Already looks pretty good. I have added a couple of remarks and there are a couple of things that need to be done before we can merge this:

persist components. currently it is not possible to save the regex components (i.e. the regex_dict will be lost after storing and loading the component). The components should implement load and persist, somewhat along the lines of https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/extractors/duckling_extractor.py#L102
parse regex expressions from luis data
add some tests (have a look at the existing tests, e.g. _pytest/test_extractors.py )
add some information to the documation about the training data format and the component

tmbo · 2017-06-12T13:12:27Z

rasa_nlu/classifiers/regex_intent_classifier.py

+        self.regex_dict = {}
+
+    def train(self, training_data):
+    # build regex: intent dict from training data


Should be a function comment, e.g. """Build regex: ...""" instead of # ... describing the functionality of the function.

tmbo · 2017-06-12T13:13:59Z

rasa_nlu/classifiers/regex_intent_classifier.py

+    def train(self, training_data):
+    # build regex: intent dict from training data
+        for example in training_data.regex_features:
+            if ("intent" in example):


no brackets around if condition necessary.

tmbo · 2017-06-12T13:16:58Z

rasa_nlu/classifiers/regex_intent_classifier.py

+
+    def process(self, text):
+        # type: (Text) -> Dict[Text, Any]
+        return {


it should rather be something along the lines of:

result = self.parse(text) if result is not None: return ...

tmbo · 2017-06-12T13:17:19Z

rasa_nlu/classifiers/regex_intent_classifier.py

+            search = (re.search(exp, text) != None)
+            if search:
+                return intent
+        return "None"


should rather be None instead of "None"

tmbo · 2017-06-12T13:17:44Z

rasa_nlu/classifiers/regex_intent_classifier.py

+        #def is_present(x): return x in _text
+
+        for exp, intent in self.regex_dict.items():
+            search = (re.search(exp, text) != None)


avoid the intermediate variable and use the expression directly in the following if

tmbo · 2017-06-12T13:21:33Z

rasa_nlu/extractors/regex_entity_extractor.py

+    def train(self, training_data):
+    # build regex: intent dict from training data
+        for example in training_data.regex_features:
+            if ("entity" in example):


no ( & ) needed

tmbo · 2017-06-12T13:21:39Z

rasa_nlu/extractors/regex_entity_extractor.py

+    output_provides = ["entities"]
+
+    def __init__(self, clf=None, regex_dict=None):
+        self.clf = clf


tmbo · 2017-06-12T13:23:56Z

rasa_nlu/extractors/regex_entity_extractor.py

+        # type: (Doc, Language, List[Dict[Text, Any]]) -> Dict[Text, Any]
+
+        entities = self.extract_entities(text)
+        return {


To be able to compose this with other entity extractors, this should look something like this (feel free to have a look at the other extractors):

def process(self, text, entities): extracted = self.extract_entities(text) entities.extend(extracted) return { "entities": entities }

tmbo · 2017-06-12T13:25:21Z

rasa_nlu/extractors/regex_entity_extractor.py

+            ent = regexp.search(text)
+            if ent != None:
+                entity = {
+                    "entity": str(entity),


str is not available automatically in python 3, needs to be imported from builtins, i.e. from builtins import str. Thinking about it, is the str(...) even necessary?

tmbo · 2017-06-12T13:25:53Z

rasa_nlu/extractors/regex_entity_extractor.py

+                entity = {
+                    "entity": str(entity),
+                    "value": str(ent.group()),
+                    "start": int(ent.start()),


isn't .start already an int?

Sort of! the function is returning unicode strings and I wasn't sure if this was an issue?

^ same for the use of str(...) for "entity" and "value"

… pipeline templates

… regex-component

…of dict keys

amn41

looks really good - left some comments / questions for clarification.

One thing, have we actually tested whether this performs better on any particular dataset?

I'm not sure if zip codes are the best example, since the 'digit' feature will presumably already account for this.

Maybe something a little different, hyphenated words maybe? or words ending in 'ing' (verbs)?

amn41 · 2017-07-06T11:00:10Z

rasa_nlu/featurizers/regex_featurizer.py

+            return message.get("text_features")
+
+    def features_for_patterns(self, message):
+        """Given a sentence, returns a vector of {1,0} values indicating which regexes match"""


I guess this is no longer a pure function, right? it has side effects in the call to t.set ?

amn41 · 2017-07-06T12:23:22Z

_pytest/test_featurizers.py

+    ("hey how are you today", [0., 1.], [0]),
+    ("hey 123 how are you", [1., 1.], [0, 1]),
+    ("blah balh random eh", [0., 0.], []),
+    ("looks really like 123 today", [1., 0.], [3]),


so what do we give to the CRF in terms of features? just a binary feature for each token which says whether any regex matched it?

no its a text feature saying which regex matched (can only be one though, will always be the lasted matched one if there are multiple)

amn41 · 2017-07-06T12:26:57Z

docs/dataformat.rst

+this regex is used for. As you can see in the above example, you can also use the regex features to improve the intent
+classification performance.
+
+Try to create your regular expressions in a wy that they match as few words as possible. E.g. using ``hey[^\s]*``


typo : 'wy'

amn41 · 2017-07-06T12:28:39Z

docs/pipeline.rst

@@ -173,6 +173,18 @@ intent_classifier_sklearn
    to other classifiers it also provides rankings of the labels that did not "win". The spacy intent classifier
    needs to be preceded by a featurizer in the pipeline. This featurizer creates the features used for the classification.

+intent_featurizer_regex


not something we have to resolve right now, but with this change featurizers are no longer strictly intent_featurizers

I know, I was thinking about changing it to featurizer_regex either one has its advantages. what do you think, rename?

I can see two sensible solutions,

keep others as intent_featurizer_x and call this either featurizer_regex or intent_entity_featurizer_regex

call everything featurizer_x and rely on docs to explain the difference

I do like the intent_entity_featurizer_regex . didn't think of that. don't want to rename all the others as that breaks any trained model.

amn41 · 2017-07-06T12:29:34Z

docs/pipeline.rst

+:Short: regex feature creation to support intent and entity classification
+:Outputs: ``text_features`` and ``tokens.pattern``
+:Description:
+    During training, the regex intent featurizer creates a list of `regular expressions` collected from the training data.


'collected from' is awkward wording I think. Should we just say 'defined in the training data format' ?

amn41 · 2017-07-06T12:30:47Z

rasa_nlu/converters.py

@@ -153,12 +154,24 @@ def rasa_nlu_data_schema():
        "required": ["text"]
    }

+    regex_feature_schema = {


do we know if the LUIS regex syntax is valid python regex?

will have a look. need to implement the reading in of luis regular expressions from their format anyway.

yes, format looks good 👍

amn41 · 2017-07-06T12:31:25Z

rasa_nlu/extractors/crf_entity_extractor.py

+        'bias': lambda doc: 'bias',
+        'upper': lambda doc: str(doc[0].isupper()),
+        'digit': lambda doc: str(doc[0].isdigit()),
+        'pattern': lambda doc: doc[2],


so just a binary feature to indicate that it matched one of the patterns, right?

amn41 · 2017-07-06T12:34:22Z

rasa_nlu/featurizers/__init__.py

-class Featurizer(object):
-    pass
+
+class Featurizer(Component):


nice ! this is much cleaner than what we had before

tmbo · 2017-07-06T13:02:39Z

I went for the zip code example because it fit into the restaurant example data. It does work better with the regex feature (you are right, the crf would probably pick them up anyway but this way I only needed 2 examples, removing the regex features results in the crf not finding the zipcodes ;))

tmbo · 2017-07-06T14:55:04Z

🎉

* Test more if env variants * The correct negation syntax is != * Make the Interpolate function support negated booleans from envs * Move assert := a.New(t) into t.Run This uncovered that some of the test premisses was wrong and the Eval Bool function also had flaws * Remove a stray logrus import * Add an ACT env set to true This can be used to skip certain steps that you don't want to run locally when testing. E.g. steps that sends messages to Slack channels on successful builds etc. * Add a description about env.ACT to the readme * A new attempt at Interpolation and EvalBool * One small merge fix * Remove some fmt.Printfs * Fix some merge conflicts

work on regex intent classifier

9b661ae

tmbo reviewed Jun 6, 2017

View reviewed changes

Updated train function

df341c0

Sarenne and others added 2 commits June 9, 2017 10:59

Modified converters.py and training_data.py to handle regex features …

2fc8a8d

…data. Updated the train function for the intent classifier accordingly and wrote the entity extractor component. Added both components to registry.py.

Merge branch 'master' into regex-component

9a3c271

tmbo reviewed Jun 12, 2017

View reviewed changes

Sarenne and others added 18 commits June 14, 2017 15:21

added persist & load functions

c42885e

updated documentation to include regex components

98b8d9c

added ability to process regex features and added regex components to…

7d01e64

… pipeline templates

new tests for both regex components

c3c9196

Merge branch 'regex-component' of github.com:golastmile/rasa_nlu into…

1c5d04a

… regex-component

merged master

a1ac37a

pep8

722b5b1

fixed merge

ca617d2

bugfix in test : multi regex entities test no longer relies on order …

1b3d905

…of dict keys

pep8 fix

381c9ff

added basic version of regex featurization - WIP

6e3e2c3

Working on intent & entity regex features

f8d116e

fixed pep8

44f0961

Updates to documentation and tests

7a1c415

Improved entities page

8137c1a

added changelog

7c11832

merged in master

5870081

Add pattern feature to default crf extractor

2e16ad2

amn41 reviewed Jul 6, 2017

View reviewed changes

tmbo added 3 commits July 6, 2017 14:50

Couple of pr feedback fixes

ee3ad35

Read in Luis regex features

119319b

Renamed component to intent_entity_featurizer_regex

a160eb7

tmbo added 4 commits July 6, 2017 15:36

Added a couple more tests to increase coverage

34b170f

moved regex component persistent to json instead of pkl

7275459

Fixed pep 8

71c0780

Pretty print persisted regex son

720b488

tmbo merged commit 76fe1e3 into master Jul 6, 2017

tmbo deleted the regex-component branch July 6, 2017 14:55

Kaijiro mentioned this pull request Apr 4, 2018

Possibility for extracting entity with RegEx #963

Closed

work on regex intent classifier #417

work on regex intent classifier #417

Conversation

Sarenne commented Jun 6, 2017 • edited by tmbo

coveralls commented Jun 6, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sarenne Jun 6, 2017 • edited

Choose a reason for hiding this comment

coveralls commented Jun 7, 2017 • edited

tmbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amn41 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmbo commented Jul 6, 2017

tmbo commented Jul 6, 2017

Sarenne commented Jun 6, 2017 •

edited by tmbo

coveralls commented Jun 6, 2017 •

edited

Sarenne Jun 6, 2017 •

edited

coveralls commented Jun 7, 2017 •

edited