Question about GRU-D implementation #1

ducnx · 2020-08-26T00:03:29Z

Hi there, I have a question about calculating dp_mask for x_t and m_dp_mask for m_t in your GRU-D implementation (file gru_d.py).

First, the dp_mask is generated from GRUCell built-in function get_dropout_mask_for_cell: code
Then, the dropout mask m_dp_mask for masking vector m_t is generated by calling _generate_dropout_mask: code
By doing so, the dp_mask and m_dp_mask zero out different elements in two inputs x_t and m_t. I can reproduce your result, however, I think that the dropout masks should be the same for x_t and m_t. Can you please clarify this for me? Did I misunderstand something in the core TensorFlow implementation/your implementation?

Thanks for the great work!

The text was updated successfully, but these errors were encountered:

ExpectationMax · 2020-08-26T08:45:16Z

Hey ducnx,

good catch! It seems like this behavior is not totally in line with the original publication of GRU-D:

We apply recurrent dropout with rate of 0.3 with same dropout samples at each time step on weights W, U, V.

Where $V$ is the weight matrix for the input mask.

It looks like we retained the behavior from the implementation this code is based on (see https://github.com/PeterChe1990/GRU-D) and did not notice the discrepancy to the original GRU-D implementation.
Unfortunately, the publication is already in print, such that we will not be able to update the description of GRU-D there.

To my understanding, the separate dropout of the input mask does not lead to values being imputed which are actually present. Instead it would reduce the models reliance on observation patterns compared to the original implementation because the independent dropout mask makes it harder for the model to differentiate observed and imputed values. Generally, I think it should not have a detrimental effect on GRU-D performance and might even be beneficial to a certain degree.

Anyway, thanks for pointing this out. I will think of a way to make this apparent for the readers of the paper and users of the code. Until than I will leave this issue open.

Cheers,
Max

ducnx · 2020-08-27T04:34:31Z

Thanks for the comment!

I agree that the behavior from the original implementation is similar to yours. As you explained, using different dropout masks for m_t leads to stronger regularization than using the same mask as x_t. Maybe the performance won't be that much different from using the same dropout mask. I'll check that out.

ExpectationMax closed this as completed in 38ec1d7 Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about GRU-D implementation #1

Question about GRU-D implementation #1

ducnx commented Aug 26, 2020

ExpectationMax commented Aug 26, 2020

ducnx commented Aug 27, 2020

Question about GRU-D implementation #1

Question about GRU-D implementation #1

Comments

ducnx commented Aug 26, 2020

ExpectationMax commented Aug 26, 2020

ducnx commented Aug 27, 2020