Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropout before max pooling killing embedding components during training #10

Closed
KieranLitschel opened this issue Jan 3, 2021 · 1 comment
Assignees

Comments

@KieranLitschel
Copy link
Owner

KieranLitschel commented Jan 3, 2021

When a unit is dropped out its value is set to 0. As we are applying dropout directly to the word embeddings, for long input sequences, it becomes increasingly likely that at least one component in each dimension will be set to zero. This means that negative components can often die, as they get stuck with negative values due to the zeros being introduced by dropout being taken as the maximum.

This is particularly problematic as our distribution for initializing embeddings is centred at zero, meaning around half of the components are initialized as values less than zero. The histogram below exemplifies this issue.

image

One possible solution is to initialize all embedding weights with values greater than zero. This should significantly reduce the number of dying units, but units will still die if they are updated with a value less than zero.

A better solution would be to make it so that zero is ignored during the max-pooling operation. But this may slow down training significantly, which would make the first solution more preferable.

@KieranLitschel KieranLitschel self-assigned this Jan 3, 2021
@KieranLitschel KieranLitschel changed the title Dropout before max pooling killing units during training Dropout before max pooling killing embedding components during training Jan 3, 2021
@KieranLitschel
Copy link
Owner Author

It seems like the main cause of the above distribution was too high a dropout rate. We were using a dropout rate of 0.8, but switching to a dropout rate of 0.2 we get the distribution below, which looks much better.

image

We explored shifting the centre of the initialization right 0.05 so that all initialized values would be greater than or equal to zero. The distribution with this modification is shown below.

image

We observe the same pattern as with the zero centred distribution, with half the values appearing to have stayed at their initialized values. Surprisingly the distributions seem to be very similar, just the centre shifted.

So it now seems more like this behaviour is being caused by the max pool layer, with a lot of values just never being seen during training.

Hence it seems like this is more a property of SWEM-max, and is not a bug, so we are closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant