Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Residual connection fix (and highway layer) #11

Merged
merged 2 commits into from
May 29, 2017

Conversation

Spotlight0xff
Copy link
Contributor

Hi again,

This PR is two-fold.

Residual connection

In my opinion, the residual connections from the CBHG have to be applied from after the pre-net and not from the inputs directly.

There are two arguments which speak for this interpretation:

  • the residual connection is mentioned in section 3.1 on the CBHG module, not on the general model architecture which indicates that is part of the CBHG module and apply the residual connection from its input (which is after the pre-net)
  • You have used width=256 in the CBHG encoder, but in the paper is 128 mentioned. And I understand that it doesn't work if the residual connection is the input (character embedding, 256-dimensional). But if we use the connection from the pre-net it works as mentioned in the paper and we can use 128 in the whole encoder (last layer of pre-net has 128 units)

Hope this convinces you, if you have any concerns please comment.

Highway-Net

The second part of this PR is directed at the code quality of the Highway-Net construction.
In particular it uses tf.layers.dense instead of the custom dense function and uses the number of units of inputs if not specified otherwise.

By using tf.layers.dense it simplifies the code (we could replace it in the pre-net as well) in my opinion.
Also, I haven't added batch normalization to the highway net construction, because I haven't seen batchnorm used much in FC layers (didn't notice much difference) and the paper only mentions batch-norm for conv-nets.

Oh and I dropped the hp.embed_size, so the code is more modular (and its also wrong IMO, see the first part of this PR)

Cheers,
André

@Kyubyong
Copy link
Owner

For Residual Connection:
Yes, I think you're right. I also noticed that the layer size doesn't match, which I couldn't understand at that time.

For Highway-Net:
According to the original paper(https://arxiv.org/abs/1502.03167), batch normalization works well for dense layers, too. However, I agree that it doesn't make a big difference.

Thanks so much, André!

@Kyubyong Kyubyong merged commit 9adf6ba into Kyubyong:master May 29, 2017
@ethson
Copy link

ethson commented May 30, 2017

Have you guys tested these changes and seen an improvement on results?

@Spotlight0xff
Copy link
Contributor Author

I've done a few changes including this one (and increased the mini-batch to 32 according to the paper) and got to about loss=0.15. The samples were still incomprehensible though, but I trained just for a few hours (seemed to converge...).

I guess we need to do more work/debugging.

@onyedikilo
Copy link

I've also made some changes to the variables as I tried to match them to the paper after editing the code as it is in the PR. It's around 0.10 loss at the moment, will run more to see if it gets better. My batch size is 6. Currently at 0.0005 learning rate.

With Andre's changes, I think we also need to update the codes where ever embed_size is used. It was 256 before and the code is based on that, after changing it to 128 some mods are needed to replicate the paper values for the network.

@Spotlight0xff Spotlight0xff deleted the residual_connection_fix branch May 31, 2017 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants