修改seqToseq的目标函数(cost function) #1104

coollip · 2017-01-10T03:00:14Z

Hi，我之前试用过demo中的seqToseq示例，成功训练了机器翻译nmt模型，工作正常。

最近看了一篇将align信息引入到cost function中的论文：
Wenhu Chen, “guided alignment training for topic-aware neural machine translation”.

这篇论文的思路其实比较简单：
seqToseq中的attention可以看做对齐信息，但这个对齐没有fast align的强对齐效果好，作者希望模型在训练过程中能够参考fast align的强对齐结果，对attention进行调整。基于这个思路，作者先在线下用fast align对语料进行了对齐，然后定义了fast align结果和网络attention之间的cost，将其加入到cross entropy的cost中，即：

其中：
HD就是目前seqToseq demo中使用的cross entropy
G是作者定义的align cost。
可以看到，最后cost function是HD和G两者的加权和。w1、w2控制了两者的比例。

G作为align cost，其实就是mean squared error。G中的A是二维矩阵，即fast align的对齐结果，Ati是target sentence第t个token和src sentence第i个token的对齐情况，若fast align的结果认为这两者是对齐的，则Ati被置为1(实际计算时会对A进行归一化，使得每一行的和是1)。
alpha则是网络中的attention。

OK，上面说完了背景，现在说说我怎么在paddle中尝试实现这一方案。

首先用fast align对语料做了对齐，为每个sentence pair生成了对应的A，A的行数为该pair中target sentence的token数，A的列数为src sentence的token数。根据fast align的结果，将A中相应的位置置为1，最后再对每一行进行了归一化。
完成1后，就要将A作为训练信息通过dataprovider传入到paddle中。
由于后面要计算A和attention的align cost，所以我先看了下demo中的attention，其代码在simple_attention中：
attention_weight = fc_layer(input=m, size=1, act=SequenceSoftmaxActivation(), param_attr=softmax_param_attr, name="%s_softmax" % name, bias_attr=False)

我的理解是：simple_attention方法在解码端每个time step都会执行一次，在当前t时刻时，这里的attention_weight是一个序列，序列长度是当前sentence pair的src len，序列中每个元素是一个一维的浮点数，第i个元素表示当前解码时刻t的target token和src第i个token的attention值。

相应地，我也将fast align的结果A设置为类似的格式，采用了
dense_vector_sub_sequence(1)
这种格式，假设训练样本sentence pair的src包含3个token，target包含2个token，则A的形式举例如下：
[
[ [0.5],[0.5],[0] ], //target第1个token和src每个token的对齐结果
[ [0],[0.5],[0.5] ], //target第2个token和src每个token的对齐结果
]

我按照这种格式将A传进了paddle，具体如下：
`
a = data_layer(name='target_source_align', size=1)

    decoder = recurrent_group(name=decoder_group_name,
                              step=gru_decoder_with_attention,
                              input=[
                                  StaticInput(input=encoded_vector,
                                              is_seq=True),
                                  StaticInput(input=encoded_proj,
                                              is_seq=True),
                                  trg_embedding
                              ])

`

现在我有几个问题：

我上述的理解和处理流程是否正确？
如何将align_info这种sub_sequence传入到recurrent_group->gru_decoder_with_attention中？
如果attention_weight是长度为src len的序列，那么怎么与a计算上面式子中定义的align cost(即G)
如何将两个cost加权在一起进行训练？

The text was updated successfully, but these errors were encountered:

coollip · 2017-01-10T03:20:38Z

我尝试和trg_embedding一样，直接将a传入到recurrent_group的input中：
`
decoder = recurrent_group(name=decoder_group_name,

                          step=gru_decoder_with_attention,
                          input=[
                              StaticInput(input=encoded_vector,
                                          is_seq=True),
                              StaticInput(input=encoded_proj,
                                          is_seq=True),
                              trg_embedding,
                              a
                          ])`

相应地，在gru_decoder_with_attention中也添加了a变量
def gru_decoder_with_attention(enc_vec, enc_proj, current_word, a):

结果报错：
F /home/nmt/paddle/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp:407] Check failed: ((size_t)input1.getNumSequences()) == (numSequences)
/home/nmt/paddle/install/bin/paddle: line 81: 29166 Aborted ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

我把a改成SubsequenceInput，也有问题：
[CRITICAL 2017-01-10 11:16:42,404 layers.py:2493] The sequence type of in_links should be the same in RecurrentLayerGroup

我看了下layers.py中对应位置，应该是因为：
trg_embedding是LayerOutput
a是SubsequenceInput,
导致RecurrentLayerGroupWithoutOutLinksBegin中的assert检查失败：
config_assert(in_links_has_subseq == has_subseq, "The sequence type of in_links should be the same in RecurrentLayerGroup")

coollip · 2017-01-10T03:29:27Z

另外，为了使用simple_attention中的attention_weight，我修改了networks.py中的simple_attention返回值，将attention_weight也返回了：
return pooling_layer(input=scaled, pooling_type=SumPooling(), name="%s_pooling" % name), attention_weight

然后在seqToseq_net.py中调用simple_attention的位置定义了attention_weight:
context,attention_weight = simple_attention(encoded_sequence=enc_vec, encoded_proj=enc_proj, decoder_state=decoder_mem, )

我是打算在gru_decoder_with_attention中计算a和attention_weight的cost，然后作为返回值和out变量(即softmax)一起传出去。不知道这样理解对吗？

lcy-seso · 2017-01-10T07:08:21Z

@coollip 你好~ 对流程的理解，大致是没有问题的。

先说结论，你需要的模型如果不修改Paddle的 C++ 代码，无法通过配置直接配出来。

下面是关于上面提到的 4 个问题。

正确的做法是把 attention weight 拿到 recurrent_layer_group 外面，然后和通过data_provider 给进去的fast_align 得到的“正确的”对齐信息，通过 cost layer （paddle 有 MSE 这个cost layer）G 这一部分 error。但是 attention weight 目前无法拿到 recurrent_layer_group 外面。下面会详细说明。
在你的这个例子里面，align info 是无法传进 recurrent_layer_group 里面的。
有了2 的答案，3 这种做法就行不通了。
接 2 个 cost paddle 是可以直接支持的，直接定义 2 个cost 就可以，cost layer 的接口中，可以指定权值。

coollip · 2017-01-10T09:15:24Z

@lcy-seso
谢谢你的回复。

请问如果想要修改paddle的c++代码，是具体修改哪些地方呢？如果比较复杂那只能放弃试验这个算法了。
两种cost如何在一起加权，用哪个cost layer，您能给个例子吗？
http://www.paddlepaddle.org/doc/ui/api/trainer_config_helpers/layers.html#cost-layers
这里没看到有mse的cost layer
谢谢哈

luotao1 · 2017-01-10T09:30:46Z

请问如果想要修改paddle的c++代码，是具体修改哪些地方呢

主要是修改RecurrentGradientMachine.cpp，AgentLayer.cpp相关的，这部分还是比较复杂的。

lcy-seso · 2017-01-10T10:34:22Z

@coollip 下午写了一半，我大致解释一下 recurrent_layer_group 的流程和现在的一些限制。
你需要的功能非常简单，但是改代码可能需要花一些时间去看一看 layer group 的处理逻辑。

recurrent_layer_group 是一种支持自定义 rnn 的机制，可以看作是一个被 “打开的RNN”，通过定义step 函数，来自定义 rnn 在一个时间步内的计算，recurrent_layer_group 框架本身会完成在整个序列上的展开。虽然 recurrent_layer_group 是被打开的RNN，但是定义在 step 函数中定义的layer 在layer group 外部却不是总能够直接引用的。
recurrent_layer_group 这个调用接口中的 inputs 是layer group 的输入，step 函数 return 的layer 的输出是 recurrent_layer_group 这个自定义RNN 单元的输出；
虽然 recurrent_layer_group 是被打开的RNN，不是什么样的序列都可以塞给 recurrent_layer_group 。 recurrent_layer_group 依然需要遵循普通 rnn 的一些原则。RNN 接受一个序列作为输入，每个时间步会有一个输出，也就是“多长进多长出”。这是目前 recurrent_layer_group 一个很重要的原则。
在 nmt 的例子中，target embedding 序列是 recurrent_layer_group 真正的输入，输出是decoder 的hidden 向量序列，以这个例子为例，会发现 layer group 需要“多长进多长出”；这一个限制导致 attention weight 不能拿出 layer_group
attention 的例子中，源语言序列对 recurrent_layer_group 来说应该叫做 unbounded memory，这是一种特殊的输入。每个时刻会引用 unbounded memory 中存储的内容；也就是说，recurrent_layer_group 和普通rnn 一样，可以有一个或者多个输入序列，是rnn 真正的数据输入，step 函数在数据输入反复被展开计算。这些数据输入必须等长，如果有不等长的输入，只能通过 read-only memory 的形式拿进来。
在你的例子里面，出现了 3 种不等长的输入。 recurrent layer group 还有一个限制，所有 input 的序列类型都必须是相同的，也就是要么都是sequence，要么都是 subsequence；这一个限制导致 fast_align 的 weight 不能给进 recurrent_layer_group

lcy-seso · 2017-01-10T10:40:39Z

@coollip 如果你需要改代码，可能有点麻烦，需要改 AgentLayer.cpp，让 attention weight 这个 layer 可以出 layer_group 。。。我们再想想这个问题。

coollip · 2017-01-10T11:00:13Z

@lcy-seso ，谢谢你的详细回答。

你提到说：
recurrent layer group 有一个限制，所有 input 的序列类型都必须是相同的，也就是要么都是sequence，要么都是 subsequence；这一个限制导致 fast_align 的 weight 不能给进 recurrent_layer_group

那把trg_ids改为integer_value_sub_sequence是否可行，比如把trg_ids=[1,3,2,15]这种integer_value_sequence改成下面的格式：
[[1],[3],[2],[15]]
然后和a一样，用SubsequenceInput传入到recurrent_group中：

` decoder = recurrent_group(name=decoder_group_name,

                              step=gru_decoder_with_attention,
                              input=[
                                  StaticInput(input=encoded_vector,
                                              is_seq=True),
                                  StaticInput(input=encoded_proj,
                                              is_seq=True),
                                  SubsequenceInput(trg_embedding),
                                  SubsequenceInput(a)
                              ])`

这样，在gru_decoder_with_attention中拿到解码端当前时刻的current_word和a，其中current_word是[1]这种形式，a是长度为src len的序列。
由于current_word是[1]而不是1，所以gru_decoder_with_attention中的decoder_inputs += full_matrix_projection(current_word)改为：
decoder_inputs += full_matrix_projection(first_seq(current_word))

这样运行也不行，看来我的理解还是不对。目前只能暂时搁置尝试了。。。

lcy-seso · 2017-01-10T11:19:25Z

把 target embedding 变成 subsequence 也是不可以的。

解决了 ”拿进来“的问题。然后，是想在layer group 里面用math 计算 MSE 吗？到这一步都可以做到，算完也可以直接拿出 layer_group 。但是。。这时候没办法再接 cost layer 了。网络必须以cost layer 作为末端。

如果只是拿进来。。因为输出对长度也有限制，会拿不出layer_group。（”拿进来“和 ”拿出去“这两部里都有一些重组batch 的操作，所以对输入输出长度会有限制）。

结论就是。。不改代码。。配不出来。。

coollip · 2017-01-10T11:22:05Z

@lcy-seso 好的，谢谢回复哈

lcy-seso · 2017-01-10T11:28:43Z

@coollip 这个配置我再想想，怎么样有可能比较简单的解决。谢谢你的关注啊 ~

lcy-seso · 2017-08-19T11:50:22Z

I close this issue due to inactivity. please feel free to reopen it if more information is available.

* modify transforner-rst * modify roformer tokenizer * modify roformer model * update * modify transformer * modify roformer modeling * modify decoder * update * modify tokenizer * modify token_embedding

* update auto compression docs * fix readme * fix title

luotao1 assigned lcy-seso Jan 10, 2017

hedaoyuan mentioned this issue Jan 12, 2017

Embedding在paddle如何实验的问题 #1138

Closed

qingqing01 added the enhancement label Aug 7, 2017

lcy-seso closed this as completed Aug 19, 2017

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019

flags classify and del releasenotes (PaddlePaddle#1104)

03a7fcb

lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024

update auto compression docs (PaddlePaddle#1104)

84eb98d

* update auto compression docs * fix readme * fix title

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

修改seqToseq的目标函数(cost function) #1104

修改seqToseq的目标函数(cost function) #1104

coollip commented Jan 10, 2017

coollip commented Jan 10, 2017 •

edited

Loading

coollip commented Jan 10, 2017

lcy-seso commented Jan 10, 2017

coollip commented Jan 10, 2017

luotao1 commented Jan 10, 2017

lcy-seso commented Jan 10, 2017 •

edited

Loading

lcy-seso commented Jan 10, 2017

coollip commented Jan 10, 2017

lcy-seso commented Jan 10, 2017

coollip commented Jan 10, 2017

lcy-seso commented Jan 10, 2017

lcy-seso commented Aug 19, 2017

修改seqToseq的目标函数(cost function) #1104

修改seqToseq的目标函数(cost function) #1104

Comments

coollip commented Jan 10, 2017

coollip commented Jan 10, 2017 • edited Loading

coollip commented Jan 10, 2017

lcy-seso commented Jan 10, 2017

coollip commented Jan 10, 2017

luotao1 commented Jan 10, 2017

lcy-seso commented Jan 10, 2017 • edited Loading

lcy-seso commented Jan 10, 2017

coollip commented Jan 10, 2017

lcy-seso commented Jan 10, 2017

coollip commented Jan 10, 2017

lcy-seso commented Jan 10, 2017

lcy-seso commented Aug 19, 2017

coollip commented Jan 10, 2017 •

edited

Loading

lcy-seso commented Jan 10, 2017 •

edited

Loading