add DiffCSE model #2643

1649759610 · 2022-06-24T13:56:46Z

PR types

New features

PR changes

Models

Description

commit DiffCSE model to paddleNLP repo.

tianxin1860

Leave some comments

tianxin1860 · 2022-06-29T02:47:14Z

examples/text_matching/diffcse/README.md

+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+python -u -m paddle.distributed.launch --gpus ${gpu_ids} \


launch 启动任务建议加上 --log_dir 参数指定日志输出目录，否则启动多个任务的时候会同时写到 log 目录下，日志会串行。

tianxin1860 · 2022-06-29T02:50:49Z

examples/text_matching/diffcse/README.md

+python -u -m paddle.distributed.launch --gpus ${gpu_ids} \
+    run_diffcse.py \
+    --mode "train" \
+    --extractor_name "rocketqa-zh-dureader-query-encoder" \


变量命名建议和论文术语标准保持一致，extractor_name -> sentence encoder

已修改至：encoder_name

tianxin1860 · 2022-06-29T02:51:53Z

examples/text_matching/diffcse/README.md

+
+可支持配置的参数：
+* `mode`：可选，用于指明本次运行是模型训练、模型评估还是模型预测，仅支持[train, eval, infer]三种模式；默认为 infer。
+* `extractor_name`：可选，DiffCSE模型中用于向量抽取的模型名称；默认为 ernie-1.0。


变量命名规范化

已修改至：encoder_name

tianxin1860 · 2022-06-29T02:52:53Z

examples/text_matching/diffcse/README.md

+
+python run_diffcse.py \
+    --mode "eval" \
+    --extractor_name "rocketqa-zh-dureader-query-encoder" \


已修改至：encoder_name

tianxin1860 · 2022-06-29T03:02:59Z

examples/text_matching/diffcse/model.py

+        if not with_pooler:
+            ori_cls_embedding = sequence_output[:, 0, :]


这里的分支逻辑少 1 个 Else

tianxin1860 · 2022-06-29T03:57:23Z

examples/text_matching/diffcse/run_diffcse.py

+                         key_token_type_ids=key_token_type_ids,
+                         query_attention_mask=query_attention_mask,
+                         key_attention_mask=key_attention_mask,
+                         cls_token=tokenizer.cls_token_id)


cls_token 的作用是？

解释同上

tianxin1860 · 2022-06-29T03:59:38Z

examples/text_matching/diffcse/run_diffcse.py

+            if global_step % (args.eval_steps // 10) == 0 and rank == 0:
+                print(
+                    "global step {}, epoch: {}, batch: {}, loss: {:.5f}, speed: {:.2f} step/s"
+                    .format(global_step, epoch, step, loss.item(),
+                            (args.eval_steps // 10) /
+                            (time.time() - tic_train)))


日志需要输出 RTD 任务的 Loss 和 Discriminator 预测的 Accuracy，不好分析结果。

Generator 生成的样本数据可以存一部分到本地文件，用来分析 Debug。

已添加相关指标，画图功能

tianxin1860 · 2022-06-29T04:02:49Z

examples/text_matching/diffcse/model.py

+
+        with paddle.no_grad():
+            # mask tokens for query and key input_ids and then predict mask token with generator
+            input_ids = paddle.concat([query_input_ids, key_input_ids], axis=0)


这里为什么需要把文本重复2遍拼接起来？相当于同 1 个样本进行两次不同的 Mask？

同原论文设置

tianxin1860 · 2022-06-29T04:09:38Z

examples/text_matching/diffcse/model.py

+            pred_tokens = self.generator(
+                mlm_input_ids, attention_mask=attention_mask).argmax(-1)


这里加一下必要的注释吧。mlm_input_ids 示例输入、pred_tokens 示例输出。

建议这里的 API 使用 paddle.argmax 接口，明确指出 -1 对应的参数名 axis，代码语义表示更清楚一些。

tianxin1860 · 2022-06-29T04:10:17Z

examples/text_matching/diffcse/model.py

+                mlm_input_ids, attention_mask=attention_mask).argmax(-1)
+
+        pred_tokens[:, 0] = cls_token
+        e_inputs = pred_tokens * attention_mask


这里预期的 attention_mask 输入是什么？

该attention mask 即为tokenizer输出的attention mask，其作用是将padding位置mask掉

tianxin1860

cls_token 这个参数感觉非必要，通过 tokenizer 应该可以获取到特殊字符的 ID。

tianxin1860 · 2022-06-29T07:12:37Z

examples/text_matching/diffcse/data.py

+            encoded_inputs = tokenizer(text=text,
+                                       max_seq_len=max_seq_length,
+                                       return_attention_mask=True)
+            # print(encoded_inputs)


多余的注释

tianxin1860 · 2022-06-29T09:22:22Z

examples/text_matching/diffcse/model.py

+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.scale


和 DiffCSE 官方代码保持一致，去掉对 embedding 的 Normalize 操作和 scale 参数吧。

tianxin1860 · 2022-06-29T09:49:18Z

examples/text_matching/diffcse/model.py

+            pred_tokens = self.generator(
+                mlm_input_ids, attention_mask=attention_mask).argmax(-1)


建议这里的 API 使用 paddle.argmax 接口，明确指出 -1 对应的参数名 axis，代码语义表示更清楚一些。

tianxin1860 · 2022-06-29T09:53:12Z

examples/text_matching/diffcse/model.py

+                      key_token_type_ids=None,
+                      query_attention_mask=None,
+                      key_attention_mask=None,
+                      cls_token=1):


这里 cls_token 参数必须么？我理解可以根据 tokenizer 获取到 CLS 特殊字符的 ID。

和 DiffCSE 官方代码保持一致，去掉对 embedding 的 Normalize 操作和 scale 参数吧。：已删除

建议这里的 API 使用 paddle.argmax 接口，明确指出 -1 对应的参数名 axis，代码语义表示更清楚一些。：已指定axis=-1

这里 cls_token 参数必须么？我理解可以根据 tokenizer 获取到 CLS 特殊字符的 ID。：已改为通过tokenizer获取

…o develop

tianxin1860

Leave some comments

tianxin1860 · 2022-07-19T12:01:19Z

examples/text_matching/diffcse/data.py

+def read_text_pair(data_path, is_infer=False):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_infer:
+                if len(data[0]) == 0 or len(data[1]) == 0:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1]}
+            else:
+                if len(data[0]) == 0 or len(data[1]) == 0 or len(data[2]) == 0:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1], "label": data[2]}


这个函数看起来是多余的，没有用到？

read_text_pair在加载评估数据集的时候有用到

tianxin1860 · 2022-07-19T12:02:01Z

examples/text_matching/diffcse/data.py

+                yield {"text_a": data[0], "text_b": data[1], "label": data[2]}
+
+
+def word_repetition(input_ids, token_type_ids, dup_rate=0.32):


这个函数应该也没有用到，可以删除。

word_repetition已删除

tianxin1860 · 2022-07-19T12:04:12Z

examples/text_matching/diffcse/eval_metrics.py

+from sklearn import metrics
+
+
+def eval_metrics(labels, sims):


厂内业务的评估逻辑代码不需要开源，DiffCSE 就开源论文中用的评估指标即可。

已删除当前评估逻辑，统一修改为spearman系数

tianxin1860 · 2022-07-19T12:40:12Z

examples/text_matching/diffcse/model.py

+        y = y.unsqueeze(0)
+        sim = self.cos(x, y)
+        self.record = sim.detach()
+        min_size = min(self.record.shape[0], self.record.shape[1])


x, y 2 个输入的向量个数有可能不相等么？

如果是在测试模式下，想获取x和y向量的相似度，那么x,y的向量个数必须是相等的
如果是在训练模式下，允许x和y个数不相等的，但在我们的输入数据处理场景中是相等的

tianxin1860 · 2022-07-19T12:41:41Z

examples/text_matching/diffcse/model.py

+        self.pos_avg = paddle.diag(self.record).sum().item() / min_size
+        self.neg_avg = (self.record.sum().item() - paddle.diag(


这2个变量的作用是？

pos_avg用于统计一个输入batch中，正例的平均相似度
neg_avg用于统计一个输入batch中，负例的平均相似度

…o develop

tianxin1860

LGTM

1649759610 added 5 commits June 24, 2022 13:43

initial commit

37d27f9

refine readme

d65208e

refine codestyle

ac4d644

refine readme

3f433b9

refine readme

d3f6ada

tianxin1860 self-requested a review June 24, 2022 15:18

fix model saving bug

54ed34b

tianxin1860 reviewed Jun 29, 2022

View reviewed changes

tianxin and others added 6 commits July 6, 2022 10:36

Merge branch 'develop' into develop

63b0a76

initial commit

4669194

Merge branch 'PaddlePaddle:develop' into develop

f6f93e1

Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…

fb3ade1

…o develop

initial commit

a83a902

initial commit

7bd988a

1649759610 closed this Jul 12, 2022

1649759610 reopened this Jul 12, 2022

1649759610 closed this Jul 12, 2022

Merge branch 'PaddlePaddle:develop' into develop

700810a

1649759610 reopened this Jul 12, 2022

1649759610 self-assigned this Jul 19, 2022

tianxin1860 reviewed Jul 19, 2022

View reviewed changes

1649759610 and others added 3 commits July 21, 2022 22:49

Merge branch 'PaddlePaddle:develop' into develop

68e025a

use common metric instead of eval_metrics.py and remove unuseful code

02a997b

Merge branch 'develop' of https://github.com/1649759610/PaddleNLP int…

1500e5f

…o develop

tianxin1860 approved these changes Jul 28, 2022

View reviewed changes

Merge branch 'develop' into develop

faaf5f5

1649759610 merged commit 7932dd2 into PaddlePaddle:develop Aug 1, 2022

1649759610 mentioned this pull request Aug 1, 2022

PaddleNLP 2.3.5 Release Note Candidate #2907

Closed

		export CUDA_VISIBLE_DEVICES=0,1,2,3

		python -u -m paddle.distributed.launch --gpus ${gpu_ids} \

		if not with_pooler:
		ori_cls_embedding = sequence_output[:, 0, :]

		pred_tokens = self.generator(
		mlm_input_ids, attention_mask=attention_mask).argmax(-1)

		yield {"text_a": data[0], "text_b": data[1], "label": data[2]}


		def word_repetition(input_ids, token_type_ids, dup_rate=0.32):

		self.pos_avg = paddle.diag(self.record).sum().item() / min_size
		self.neg_avg = (self.record.sum().item() - paddle.diag(

add DiffCSE model #2643

add DiffCSE model #2643

Conversation

1649759610 commented Jun 24, 2022

PR types

PR changes

Description

tianxin1860 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1649759610 Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianxin1860 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1649759610 Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

tianxin1860 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianxin1860 left a comment

Choose a reason for hiding this comment

1649759610 Jul 11, 2022 •

edited

Loading

1649759610 Jul 19, 2022 •

edited

Loading