Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about GRU-D implementation #1

Closed
ducnx opened this issue Aug 26, 2020 · 2 comments
Closed

Question about GRU-D implementation #1

ducnx opened this issue Aug 26, 2020 · 2 comments

Comments

@ducnx
Copy link

ducnx commented Aug 26, 2020

Hi there, I have a question about calculating dp_mask for x_t and m_dp_mask for m_t in your GRU-D implementation (file gru_d.py).

First, the dp_mask is generated from GRUCell built-in function get_dropout_mask_for_cell: code
Then, the dropout mask m_dp_mask for masking vector m_t is generated by calling _generate_dropout_mask: code
By doing so, the dp_mask and m_dp_mask zero out different elements in two inputs x_t and m_t. I can reproduce your result, however, I think that the dropout masks should be the same for x_t and m_t. Can you please clarify this for me? Did I misunderstand something in the core TensorFlow implementation/your implementation?

Thanks for the great work!

@ExpectationMax
Copy link
Collaborator

Hey ducnx,

good catch! It seems like this behavior is not totally in line with the original publication of GRU-D:

We apply recurrent dropout with rate of 0.3 with same dropout samples at each time step on weights W, U, V.

Where $V$ is the weight matrix for the input mask.

It looks like we retained the behavior from the implementation this code is based on (see https://github.com/PeterChe1990/GRU-D) and did not notice the discrepancy to the original GRU-D implementation.
Unfortunately, the publication is already in print, such that we will not be able to update the description of GRU-D there.

To my understanding, the separate dropout of the input mask does not lead to values being imputed which are actually present. Instead it would reduce the models reliance on observation patterns compared to the original implementation because the independent dropout mask makes it harder for the model to differentiate observed and imputed values. Generally, I think it should not have a detrimental effect on GRU-D performance and might even be beneficial to a certain degree.

Anyway, thanks for pointing this out. I will think of a way to make this apparent for the readers of the paper and users of the code. Until than I will leave this issue open.

Cheers,
Max

@ducnx
Copy link
Author

ducnx commented Aug 27, 2020

Thanks for the comment!

I agree that the behavior from the original implementation is similar to yours. As you explained, using different dropout masks for m_t leads to stronger regularization than using the same mask as x_t. Maybe the performance won't be that much different from using the same dropout mask. I'll check that out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants