-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
work on regex intent classifier #417
Conversation
def process(self, text): | ||
# type: (Text) -> Dict[Text, Any] | ||
|
||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no intent is matched with any of the regex expressions, the function shouldn't return any intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do this in line 58 - if there's no match between any regexp and the input, intent = None. Should I change it?
self.clf = clf | ||
self.regex_dict = {} | ||
|
||
def train(self, training_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The incoming trianing_data
parameter is of type TrainingData
(can be found in training_data.py
). Before the data gets passed through the components in the pipeline the data will be read here: https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/converters.py#L205
That function loads the different parts of the training data file, e.g. the common_examples
section from the training data file (e.g. as seen in https://github.com/RasaHQ/rasa_nlu/blob/master/data/examples/rasa/demo-rasa.json#L3).
The regex expressions should be put into the training data file (demo-rasa.json
) and they should be read within the load_rasa_data
in the converters.py
. The read regex expressions should be passed to the TrainingData object https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/training_data.py#L33 and stored. You can then access these stored regexes in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Hadn't realised that training_data was its own data type... So should the regexp should be in the "text" part of the training data file?
…data. Updated the train function for the intent classifier accordingly and wrote the entity extractor component. Added both components to registry.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Already looks pretty good. I have added a couple of remarks and there are a couple of things that need to be done before we can merge this:
- persist components. currently it is not possible to save the regex components (i.e. the
regex_dict
will be lost after storing and loading the component). The components should implementload
andpersist
, somewhat along the lines of https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/extractors/duckling_extractor.py#L102 - parse regex expressions from luis data
- add some tests (have a look at the existing tests, e.g.
_pytest/test_extractors.py
) - add some information to the documation about the training data format and the component
self.regex_dict = {} | ||
|
||
def train(self, training_data): | ||
# build regex: intent dict from training data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be a function comment, e.g. """Build regex: ..."""
instead of # ...
describing the functionality of the function.
def train(self, training_data): | ||
# build regex: intent dict from training data | ||
for example in training_data.regex_features: | ||
if ("intent" in example): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no brackets around if condition necessary.
|
||
def process(self, text): | ||
# type: (Text) -> Dict[Text, Any] | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should rather be something along the lines of:
result = self.parse(text)
if result is not None:
return ...
search = (re.search(exp, text) != None) | ||
if search: | ||
return intent | ||
return "None" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should rather be None
instead of "None"
#def is_present(x): return x in _text | ||
|
||
for exp, intent in self.regex_dict.items(): | ||
search = (re.search(exp, text) != None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid the intermediate variable and use the expression directly in the following if
def train(self, training_data): | ||
# build regex: intent dict from training data | ||
for example in training_data.regex_features: | ||
if ("entity" in example): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no (
& )
needed
output_provides = ["entities"] | ||
|
||
def __init__(self, clf=None, regex_dict=None): | ||
self.clf = clf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needed?
# type: (Doc, Language, List[Dict[Text, Any]]) -> Dict[Text, Any] | ||
|
||
entities = self.extract_entities(text) | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be able to compose this with other entity extractors, this should look something like this (feel free to have a look at the other extractors):
def process(self, text, entities):
extracted = self.extract_entities(text)
entities.extend(extracted)
return {
"entities": entities
}
ent = regexp.search(text) | ||
if ent != None: | ||
entity = { | ||
"entity": str(entity), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str
is not available automatically in python 3, needs to be imported from builtins, i.e. from builtins import str
. Thinking about it, is the str(...)
even necessary?
entity = { | ||
"entity": str(entity), | ||
"value": str(ent.group()), | ||
"start": int(ent.start()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't .start
already an int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of! the function is returning unicode strings and I wasn't sure if this was an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ same for the use of str(...) for "entity" and "value"
… pipeline templates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good - left some comments / questions for clarification.
One thing, have we actually tested whether this performs better on any particular dataset?
I'm not sure if zip codes are the best example, since the 'digit' feature will presumably already account for this.
Maybe something a little different, hyphenated words maybe? or words ending in 'ing' (verbs)?
return message.get("text_features") | ||
|
||
def features_for_patterns(self, message): | ||
"""Given a sentence, returns a vector of {1,0} values indicating which regexes match""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is no longer a pure function, right? it has side effects in the call to t.set
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes :(
("hey how are you today", [0., 1.], [0]), | ||
("hey 123 how are you", [1., 1.], [0, 1]), | ||
("blah balh random eh", [0., 0.], []), | ||
("looks really like 123 today", [1., 0.], [3]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what do we give to the CRF in terms of features? just a binary feature for each token which says whether any regex matched it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no its a text feature saying which regex matched (can only be one though, will always be the lasted matched one if there are multiple)
docs/dataformat.rst
Outdated
this regex is used for. As you can see in the above example, you can also use the regex features to improve the intent | ||
classification performance. | ||
|
||
Try to create your regular expressions in a wy that they match as few words as possible. E.g. using ``hey[^\s]*`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo : 'wy'
docs/pipeline.rst
Outdated
@@ -173,6 +173,18 @@ intent_classifier_sklearn | |||
to other classifiers it also provides rankings of the labels that did not "win". The spacy intent classifier | |||
needs to be preceded by a featurizer in the pipeline. This featurizer creates the features used for the classification. | |||
|
|||
intent_featurizer_regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not something we have to resolve right now, but with this change featurizers are no longer strictly intent_featurizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, I was thinking about changing it to featurizer_regex
either one has its advantages. what do you think, rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see two sensible solutions,
-
keep others as
intent_featurizer_x
and call this eitherfeaturizer_regex
orintent_entity_featurizer_regex
-
call everything
featurizer_x
and rely on docs to explain the difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the intent_entity_featurizer_regex
. didn't think of that. don't want to rename all the others as that breaks any trained model.
docs/pipeline.rst
Outdated
:Short: regex feature creation to support intent and entity classification | ||
:Outputs: ``text_features`` and ``tokens.pattern`` | ||
:Description: | ||
During training, the regex intent featurizer creates a list of `regular expressions` collected from the training data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'collected from' is awkward wording I think. Should we just say 'defined in the training data format' ?
@@ -153,12 +154,24 @@ def rasa_nlu_data_schema(): | |||
"required": ["text"] | |||
} | |||
|
|||
regex_feature_schema = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know if the LUIS regex syntax is valid python regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will have a look. need to implement the reading in of luis regular expressions from their format anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, format looks good 👍
'bias': lambda doc: 'bias', | ||
'upper': lambda doc: str(doc[0].isupper()), | ||
'digit': lambda doc: str(doc[0].isdigit()), | ||
'pattern': lambda doc: doc[2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so just a binary feature to indicate that it matched one of the patterns, right?
class Featurizer(object): | ||
pass | ||
|
||
class Featurizer(Component): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ! this is much cleaner than what we had before
I went for the zip code example because it fit into the restaurant example data. It does work better with the regex feature (you are right, the crf would probably pick them up anyway but this way I only needed 2 examples, removing the regex features results in the crf not finding the zipcodes ;)) |
🎉 |
* Test more if env variants * The correct negation syntax is != * Make the Interpolate function support negated booleans from envs * Move assert := a.New(t) into t.Run This uncovered that some of the test premisses was wrong and the Eval Bool function also had flaws * Remove a stray logrus import * Add an ACT env set to true This can be used to skip certain steps that you don't want to run locally when testing. E.g. steps that sends messages to Slack channels on successful builds etc. * Add a description about env.ACT to the readme * A new attempt at Interpolation and EvalBool * One small merge fix * Remove some fmt.Printfs * Fix some merge conflicts
Proposed changes:
Regex features can be used as training data in rasa_data.json with the following format (eg.):
"regex_features": [ { "name": "number_matcher", "intent": "provide_number", "pattern": "[0-9]+" }, { "name": "number", "entity": "insurance_number", "pattern": "[0-9]+" }
Both converters.py and training_data.py have been modified to handle this.
regex_intent_classifier.py returns the respective intent of a regexp present in input in the same format as the keyword component.
regex_entity_extractor.py returns the set of entities that match regexps that are specified in the training data and present in input.
Both components have been added to registry.py
Status: