Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible handling of non-existent message_attribute for given training sample #4445

Closed
JulianGerhard21 opened this issue Sep 12, 2019 · 6 comments
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@JulianGerhard21
Copy link
Contributor

Description of Problem:

Currently, the SpacyNLP component provides the following:

provides = ["spacy_doc", "spacy_nlp", "intent_spacy_doc", "response_spacy_doc"]

which caused the necessity to handle non-existent / None-valued attributes for a given training sample. Currently this is realized by converting None values to empty strings since spaCy can't handle None values while creating its Doc-objects upon them.

Since simply filtering out those training samples and therefore disobey their order would cause consecutive problems, we need to find a more flexible solution.

Overview of the Solution:
I am going to think about a robust solution and update this issue likewise.

Examples:
If there are no samples for the response-attribute, currently this results in a list of empty Doc-objects while calling pipe on:

docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]

[, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]

since an empty string is valid for Doc-objects but in fact is a problem for e.g. libraries like spacy-pytorch-transformers or other custom-components which can't handle this cases properly.

The coresponding forum entry to this conversation can be found here @dakshvar22

@JulianGerhard21 JulianGerhard21 added the type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR label Sep 12, 2019
@dakshvar22 dakshvar22 added the area:rasa-oss 🎡 Anything related to the open source Rasa framework label Sep 12, 2019
@sara-tagger
Copy link
Collaborator

sara-tagger commented Sep 13, 2019

Thanks for submitting this feature request 🚀@dakshvar22 will get back to you about it soon!✨

@JulianGerhard21
Copy link
Contributor Author

Hi @dakshvar22 ,

I broke my head about a solution at this point. I tried several scenarios and I am now able to comprehend your problems with the architectural decision in this situation.

At least for now I'd say that it is more or less impossible to change things on the spaCy-pytorch-transformers side without the help of spaCy. There are several problems with the pipe-method of that library, one of them is, that they extended the normal spaCy library by ._.pytt_word_pieces which relies on the fact that those transformers use masking.

The simple conclusion is: They are doing things with their Doc-object that simply can't be done with an empty Doc - at the moment, e.g. because actually there are no sents or Token objects in this Doc.

I then came back to the idea to question this part:

        if text is None:
            # converted to empty string so that it can still be passed to spacy.
            # Another option could be to neglect tokenization of the attribute of this example, but since we are
            # processing in batch mode, it would get complex to collect all processed and neglected examples.
            text = ""

The more I thought about that the more I could feel with your struggle. On idea was to change "" into " " - which would at least pass in a regular Doc-object that actually has a one-lengthed-sentence Span object and that solution worked until:

ids = torch.as_tensor(ids, dtype=torch.float64)

in pytt, which brought up the error of pytorch's inability to handle zero entries in tensors.

I understand your thoughts about obeying the order of the attribute_docs entries but I think it at least would be possible to filter out None/"" entries, handle them properly and merging this back into the former order. The question then would be:

What should/could we do with them?

I am running out of ideas.

@dakshvar22
Copy link
Contributor

Hey @JulianGerhard21 , thanks for giving this a detailed look. I think, going by your observations, we can't rely on spacy or spacy-pytorch-transformers to help us out here.
So, as you said maybe we should just find an elegant solution to filter the None/"" entries, pass other entries to pipe and merge the empty documents back in the correct order. I think it isn't extremely nasty, so we could do something like -

  • convert each element of texts list inside docs_for_training_data as a tuple of (index,text)
  • Get two lists from this - one without empty texts and one with empty texts.
  • pass non-empty texts list to nlp.pipe, get a docs list and convert them back to a tuple of (index,Doc). Here index would be the original index. A simple zip would do it.
  • Create empty docs for empty text list. Also create a tuple of (index,Doc)
  • Merge the two lists, sort them and get the docs list.

I think it makes sense to do this because people can build custom components based on pre-trained BERT using spacy docs or integrating any other library that comes up and relies on spacy docs. Since we already have a SpacyNLP component, it would make sense to adapt to their Doc object functionality.

What do you think?

@JulianGerhard21
Copy link
Contributor Author

Hi @dakshvar22,

allright - I agree with you and I am going to start to work on this this afternoon. I will get back to you with a code proposal as soon as it is ready.

Thanks for your help!

Regards
Julian

@stale
Copy link

stale bot commented Feb 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the status:stale label Feb 3, 2020
@tmbo tmbo removed the status:stale label Feb 3, 2020
@joejuzl
Copy link
Contributor

joejuzl commented Jan 28, 2021

Closing as this is in a minor release around 1.3.x.

@joejuzl joejuzl closed this as completed Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

No branches or pull requests

5 participants