Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add conv sequence to sequence model #686

Closed

Conversation

Superjomn
Copy link
Contributor

@Superjomn Superjomn commented Mar 6, 2018

fixes: #687

@Superjomn Superjomn requested a review from lcy-seso March 6, 2018 06:57
@lcy-seso lcy-seso requested a review from guoshengCS March 6, 2018 07:05
self.atom = Atom(pd.fc, "fc", size=size, dropout=dropout)

def __call__(self, x):
# pd.fc take dims[1:] as projecton's input, need reshape to avoid that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The num_flatten_dims attribute of fc can be used to reshape the input shape, but it doesn't matter and just a reminder.


self.fc2 = Linear(embed_dim)

def forward(self, src_tokens, src_positions):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also add __call__ in ConvEncoder to unify the interface with other modules and layers. It is also the way to call modules in Pytorch and fairseq-py. Just personal preference and feel free about it.

self.fc2 = Linear(out_embed_dim)
self.fc3 = Linear(dict_size + 1, dropout=dropout)

def forward(self, prev_output_tokens, prev_positions, encoder_out):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also add __call__ in ConvDecoder to unify the interface with other modules and layers. It is also the way to call modules in Pytorch and fairseq-py. Just personal preference and feel free about it.

pos_pad_id,
dropout=0.1):
self.dropout = dropout
self.embed_tokens = Embedding(dict_size + 1, embed_dim, pad_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess dict_size + 1 is used to include the padding index, and I personally suggest make this special detail outside the module for a common module.

in_channels,
out_channels * 2,
kernel_size,
# padding=pad,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the decoder, to guarantee no future information is available, left_pad, right_pad = [kernel_size - 1, 0] might be proper according to the Pytorch code and paper. But currently layers.conv only can pad the same sizes for left and right which Pytorch also suffers from (It seems that layers.sequence_conv support this padding schema but the python wrapper need to be enhanced). To achieve left_pad, right_pad = [kernel_size - 1, 0], we may need to use pad = kernel_size - 1 for conv followed by slice/split to remove future time steps created by padding, which is also fairseq-py uses. The following is the corresponding code but I can't find the code in the latest fairseq-py repo.

        for i, (out_channels, kernel_size) in enumerate(convolutions):
            pad = kernel_size - 1
            self.projections.append(Linear(in_channels, out_channels)
                                    if in_channels != out_channels else None)
            self.convolutions.append(
                LinearizedConv1d(in_channels, out_channels * 2, kernel_size,
                                 padding=pad, dropout=dropout))
            self.attention.append(AttentionLayer(out_channels, embed_dim)
                                  if attention[i] else None)
            in_channels = out_channels
        for proj, conv, attention in zip(self.projections, self.convolutions, self.attention):
            residual = x if proj is None else proj(x)

            x = F.dropout(x, p=self.dropout, training=self.training)
            x = conv(x)
            x = conv.remove_future_timesteps(x)
            x = F.glu(x)
    def remove_future_timesteps(self, x):
        """Remove future time steps created by padding."""
        if not self._is_incremental_eval and self.kernel_size[0] > 1 and self.padding[0] > 0:
            x = x[:-self.padding[0], :, :]
        return x


x = self.atom(x)

x = Op.reshape(x, (B, T, -1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it encounter the same problem as Transformer since both B and T may change. Maybe we can use (B, -1, C) to relax the sequence length, but I am not sure if it will effect the infershape of conv. Also the reshape_op is enhanced. PaddlePaddle/Paddle#9008

self.attention):
residual = x if proj is None else proj(x)
x = Op.dropout(x, self.dropout)
x = conv(x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the wrapped Conv1D requests input with shape B x T x C, but the x = self._transpose_if_training(x) before seems to change the shape to T x B x C. I am a little confused about this.
I can see that Conv1D can work correctly in encoder since the input shape is B x T x C. While the shape of conv input here is T x B x C which may not be suitable to the wrapped Conv1D. Since the wrapped Conv1D contains a keep-length padding scheme, which can make the input and output have the same shape, it may run but I feel the computation might be incorrect.

Here list the link in fairseq-py for reference: the ConvTBC (convolution for encoder) and LinearizedConv1d(convolution for decoder) both requests input with shape T x B x C which differs with our here wrapped Conv1D.

x = fluid.nets.glu(x, dim=2)

if attention is not None:
x = self._transpose_if_training(x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also used when the shape of input x is T x B x C and to convert the shape to B x T x C for attention calculations.
Maybe we can try to remove _transpose_if_training and _split_encoder_out included transpose and unify the shapes to B x T x C temporarily if we can confirm that our here wrapped Conv1D is not suitable to T x B x C.
I am not sure the reason why fairseq-py uses T x B x C shape for convolution, maybe it is faster according to the annotations.

if sent[-1] != end_id:
sent.append(end_id)
if len(sent) < max_len:
sent += [pad_id for i in xrange(max_len - len(sent))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can pad the batch data according to the max length in current batch and allow variant max sequence length across batches, but this may also affect the model configurations such as the above x = Op.reshape(x, (B, T, -1))

encoder_a, encoder_b = encoder_out
# here just a trick
encoder_a = Op.transpose(encoder_a, [1, 2])
encoder_b = Op.transpose(encoder_b, [1, 2])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because encoder_a, encoder_b = self._split_encoder_out(encoder_out) convert the shape from B x T x C to B x C x T.
We may can remove these two transpose and use x = pd.matmul(x, encoder_a, transpose_y=False) for dot product calculations.

@lcy-seso
Copy link
Collaborator

Because the organization of the directory has been changed. Can you move the codes from fluid into fluid/neural_machine_translation and let us merge this PR as soon as possible (once pass the code review).

@CLAassistant
Copy link

CLAassistant commented Mar 24, 2020

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need convolutional sequence to sequence model
4 participants