Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

work on regex intent classifier #417

Merged
merged 29 commits into from
Jul 6, 2017
Merged

work on regex intent classifier #417

merged 29 commits into from
Jul 6, 2017

Conversation

Sarenne
Copy link

@Sarenne Sarenne commented Jun 6, 2017

Proposed changes:
Regex features can be used as training data in rasa_data.json with the following format (eg.): "regex_features": [ { "name": "number_matcher", "intent": "provide_number", "pattern": "[0-9]+" }, { "name": "number", "entity": "insurance_number", "pattern": "[0-9]+" }
Both converters.py and training_data.py have been modified to handle this.

  • regex_intent_classifier.py returns the respective intent of a regexp present in input in the same format as the keyword component.

  • regex_entity_extractor.py returns the set of entities that match regexps that are specified in the training data and present in input.
    Both components have been added to registry.py

Status:

  • ready for code review
  • there are tests for the functionality
  • documentation updated
  • changelog updated

@coveralls
Copy link

coveralls commented Jun 6, 2017

Coverage Status

Coverage decreased (-0.6%) to 92.176% when pulling 9b661ae on regex-component into d945abf on master.

def process(self, text):
# type: (Text) -> Dict[Text, Any]

return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no intent is matched with any of the regex expressions, the function shouldn't return any intent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this in line 58 - if there's no match between any regexp and the input, intent = None. Should I change it?

self.clf = clf
self.regex_dict = {}

def train(self, training_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incoming trianing_data parameter is of type TrainingData (can be found in training_data.py). Before the data gets passed through the components in the pipeline the data will be read here: https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/converters.py#L205

That function loads the different parts of the training data file, e.g. the common_examples section from the training data file (e.g. as seen in https://github.com/RasaHQ/rasa_nlu/blob/master/data/examples/rasa/demo-rasa.json#L3).

The regex expressions should be put into the training data file (demo-rasa.json) and they should be read within the load_rasa_data in the converters.py. The read regex expressions should be passed to the TrainingData object https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/training_data.py#L33 and stored. You can then access these stored regexes in this function.

Copy link
Author

@Sarenne Sarenne Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Hadn't realised that training_data was its own data type... So should the regexp should be in the "text" part of the training data file?

@coveralls
Copy link

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.4%) to 92.349% when pulling df341c0 on regex-component into d945abf on master.

Sarenne and others added 2 commits June 9, 2017 10:59
…data. Updated the train function for the intent classifier accordingly and wrote the entity extractor component. Added both components to registry.py.
Copy link
Member

@tmbo tmbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Already looks pretty good. I have added a couple of remarks and there are a couple of things that need to be done before we can merge this:

  • persist components. currently it is not possible to save the regex components (i.e. the regex_dict will be lost after storing and loading the component). The components should implement load and persist, somewhat along the lines of https://github.com/RasaHQ/rasa_nlu/blob/master/rasa_nlu/extractors/duckling_extractor.py#L102
  • parse regex expressions from luis data
  • add some tests (have a look at the existing tests, e.g. _pytest/test_extractors.py )
  • add some information to the documation about the training data format and the component

self.regex_dict = {}

def train(self, training_data):
# build regex: intent dict from training data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a function comment, e.g. """Build regex: ...""" instead of # ... describing the functionality of the function.

def train(self, training_data):
# build regex: intent dict from training data
for example in training_data.regex_features:
if ("intent" in example):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no brackets around if condition necessary.


def process(self, text):
# type: (Text) -> Dict[Text, Any]
return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should rather be something along the lines of:

result = self.parse(text)
if result is not None:
  return ...

search = (re.search(exp, text) != None)
if search:
return intent
return "None"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rather be None instead of "None"

#def is_present(x): return x in _text

for exp, intent in self.regex_dict.items():
search = (re.search(exp, text) != None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid the intermediate variable and use the expression directly in the following if

def train(self, training_data):
# build regex: intent dict from training data
for example in training_data.regex_features:
if ("entity" in example):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ( & ) needed

output_provides = ["entities"]

def __init__(self, clf=None, regex_dict=None):
self.clf = clf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed?

# type: (Doc, Language, List[Dict[Text, Any]]) -> Dict[Text, Any]

entities = self.extract_entities(text)
return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to compose this with other entity extractors, this should look something like this (feel free to have a look at the other extractors):

def process(self, text, entities):
       extracted = self.extract_entities(text)
       entities.extend(extracted)
       return {
           "entities": entities
       }

ent = regexp.search(text)
if ent != None:
entity = {
"entity": str(entity),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str is not available automatically in python 3, needs to be imported from builtins, i.e. from builtins import str. Thinking about it, is the str(...) even necessary?

entity = {
"entity": str(entity),
"value": str(ent.group()),
"start": int(ent.start()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't .start already an int?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of! the function is returning unicode strings and I wasn't sure if this was an issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ same for the use of str(...) for "entity" and "value"

Copy link
Contributor

@amn41 amn41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good - left some comments / questions for clarification.

One thing, have we actually tested whether this performs better on any particular dataset?

I'm not sure if zip codes are the best example, since the 'digit' feature will presumably already account for this.

Maybe something a little different, hyphenated words maybe? or words ending in 'ing' (verbs)?

return message.get("text_features")

def features_for_patterns(self, message):
"""Given a sentence, returns a vector of {1,0} values indicating which regexes match"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is no longer a pure function, right? it has side effects in the call to t.set ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes :(

("hey how are you today", [0., 1.], [0]),
("hey 123 how are you", [1., 1.], [0, 1]),
("blah balh random eh", [0., 0.], []),
("looks really like 123 today", [1., 0.], [3]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what do we give to the CRF in terms of features? just a binary feature for each token which says whether any regex matched it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no its a text feature saying which regex matched (can only be one though, will always be the lasted matched one if there are multiple)

this regex is used for. As you can see in the above example, you can also use the regex features to improve the intent
classification performance.

Try to create your regular expressions in a wy that they match as few words as possible. E.g. using ``hey[^\s]*``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo : 'wy'

@@ -173,6 +173,18 @@ intent_classifier_sklearn
to other classifiers it also provides rankings of the labels that did not "win". The spacy intent classifier
needs to be preceded by a featurizer in the pipeline. This featurizer creates the features used for the classification.

intent_featurizer_regex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not something we have to resolve right now, but with this change featurizers are no longer strictly intent_featurizers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, I was thinking about changing it to featurizer_regex either one has its advantages. what do you think, rename?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see two sensible solutions,

  1. keep others as intent_featurizer_x and call this either featurizer_regex or intent_entity_featurizer_regex

  2. call everything featurizer_x and rely on docs to explain the difference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the intent_entity_featurizer_regex . didn't think of that. don't want to rename all the others as that breaks any trained model.

:Short: regex feature creation to support intent and entity classification
:Outputs: ``text_features`` and ``tokens.pattern``
:Description:
During training, the regex intent featurizer creates a list of `regular expressions` collected from the training data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'collected from' is awkward wording I think. Should we just say 'defined in the training data format' ?

@@ -153,12 +154,24 @@ def rasa_nlu_data_schema():
"required": ["text"]
}

regex_feature_schema = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know if the LUIS regex syntax is valid python regex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will have a look. need to implement the reading in of luis regular expressions from their format anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, format looks good 👍

'bias': lambda doc: 'bias',
'upper': lambda doc: str(doc[0].isupper()),
'digit': lambda doc: str(doc[0].isdigit()),
'pattern': lambda doc: doc[2],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so just a binary feature to indicate that it matched one of the patterns, right?

class Featurizer(object):
pass

class Featurizer(Component):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ! this is much cleaner than what we had before

@tmbo
Copy link
Member

tmbo commented Jul 6, 2017

I went for the zip code example because it fit into the restaurant example data. It does work better with the regex feature (you are right, the crf would probably pick them up anyway but this way I only needed 2 examples, removing the regex features results in the crf not finding the zipcodes ;))

@tmbo tmbo merged commit 76fe1e3 into master Jul 6, 2017
@tmbo tmbo deleted the regex-component branch July 6, 2017 14:55
@tmbo
Copy link
Member

tmbo commented Jul 6, 2017

🎉

taytzehao pushed a commit to taytzehao/rasa that referenced this pull request Jul 14, 2023
* Test more if env variants

* The correct negation syntax is !=

* Make the Interpolate function support negated booleans from envs

* Move assert := a.New(t) into t.Run

This uncovered that some of the test premisses was wrong and the
Eval Bool function also had flaws

* Remove a stray logrus import

* Add an ACT env set to true

This can be used to skip certain steps that you don't want to run locally
when testing. E.g. steps that sends messages to Slack channels on successful
builds etc.

* Add a description about env.ACT to the readme

* A new attempt at Interpolation and EvalBool

* One small merge fix

* Remove some fmt.Printfs

* Fix some merge conflicts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants