Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible handling of non-existent message_attribute for given training sample #4445

JulianGerhard21 opened this issue Sep 12, 2019 · 4 comments


Copy link

@JulianGerhard21 JulianGerhard21 commented Sep 12, 2019

Description of Problem:

Currently, the SpacyNLP component provides the following:

provides = ["spacy_doc", "spacy_nlp", "intent_spacy_doc", "response_spacy_doc"]

which caused the necessity to handle non-existent / None-valued attributes for a given training sample. Currently this is realized by converting None values to empty strings since spaCy can't handle None values while creating its Doc-objects upon them.

Since simply filtering out those training samples and therefore disobey their order would cause consecutive problems, we need to find a more flexible solution.

Overview of the Solution:
I am going to think about a robust solution and update this issue likewise.

If there are no samples for the response-attribute, currently this results in a list of empty Doc-objects while calling pipe on:

docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]

[, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]

since an empty string is valid for Doc-objects but in fact is a problem for e.g. libraries like spacy-pytorch-transformers or other custom-components which can't handle this cases properly.

The coresponding forum entry to this conversation can be found here @dakshvar22


This comment has been minimized.

Copy link

@sara-tagger sara-tagger commented Sep 13, 2019

Thanks for submitting this feature request 🚀@dakshvar22 will get back to you about it soon!


This comment has been minimized.

Copy link
Contributor Author

@JulianGerhard21 JulianGerhard21 commented Sep 13, 2019

Hi @dakshvar22 ,

I broke my head about a solution at this point. I tried several scenarios and I am now able to comprehend your problems with the architectural decision in this situation.

At least for now I'd say that it is more or less impossible to change things on the spaCy-pytorch-transformers side without the help of spaCy. There are several problems with the pipe-method of that library, one of them is, that they extended the normal spaCy library by ._.pytt_word_pieces which relies on the fact that those transformers use masking.

The simple conclusion is: They are doing things with their Doc-object that simply can't be done with an empty Doc - at the moment, e.g. because actually there are no sents or Token objects in this Doc.

I then came back to the idea to question this part:

        if text is None:
            # converted to empty string so that it can still be passed to spacy.
            # Another option could be to neglect tokenization of the attribute of this example, but since we are
            # processing in batch mode, it would get complex to collect all processed and neglected examples.
            text = ""

The more I thought about that the more I could feel with your struggle. On idea was to change "" into " " - which would at least pass in a regular Doc-object that actually has a one-lengthed-sentence Span object and that solution worked until:

ids = torch.as_tensor(ids, dtype=torch.float64)

in pytt, which brought up the error of pytorch's inability to handle zero entries in tensors.

I understand your thoughts about obeying the order of the attribute_docs entries but I think it at least would be possible to filter out None/"" entries, handle them properly and merging this back into the former order. The question then would be:

What should/could we do with them?

I am running out of ideas.


This comment has been minimized.

Copy link

@dakshvar22 dakshvar22 commented Sep 13, 2019

Hey @JulianGerhard21 , thanks for giving this a detailed look. I think, going by your observations, we can't rely on spacy or spacy-pytorch-transformers to help us out here.
So, as you said maybe we should just find an elegant solution to filter the None/"" entries, pass other entries to pipe and merge the empty documents back in the correct order. I think it isn't extremely nasty, so we could do something like -

  • convert each element of texts list inside docs_for_training_data as a tuple of (index,text)
  • Get two lists from this - one without empty texts and one with empty texts.
  • pass non-empty texts list to nlp.pipe, get a docs list and convert them back to a tuple of (index,Doc). Here index would be the original index. A simple zip would do it.
  • Create empty docs for empty text list. Also create a tuple of (index,Doc)
  • Merge the two lists, sort them and get the docs list.

I think it makes sense to do this because people can build custom components based on pre-trained BERT using spacy docs or integrating any other library that comes up and relies on spacy docs. Since we already have a SpacyNLP component, it would make sense to adapt to their Doc object functionality.

What do you think?


This comment has been minimized.

Copy link
Contributor Author

@JulianGerhard21 JulianGerhard21 commented Sep 13, 2019

Hi @dakshvar22,

allright - I agree with you and I am going to start to work on this this afternoon. I will get back to you with a code proposal as soon as it is ready.

Thanks for your help!


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can’t perform that action at this time.