Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MPNet Model #869

Merged
merged 11 commits into from
Sep 11, 2021
Merged

Add MPNet Model #869

merged 11 commits into from
Sep 11, 2021

Conversation

JunnYu
Copy link
Member

@JunnYu JunnYu commented Aug 10, 2021

飞桨论文复现挑战赛(第四期)MPNet: Masked and Permuted Pre-training for Language Understanding 论文复现提交。

@yingyibiao yingyibiao self-assigned this Aug 10, 2021
@gongel gongel self-requested a review August 31, 2021 02:34
@gongel gongel self-assigned this Aug 31, 2021
Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstrings for all your classes and methods which might be utilized by users. You can refer to paddlenlp.transformers.bert.

[MPNet: Masked and Permuted Pre-training for Language Understanding - Microsoft Research](https://www.microsoft.com/en-us/research/publication/mpnet-masked-and-permuted-pre-training-for-language-understanding/)

**摘要:**
BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pretraining to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting. The code and the pre-trained models are available at: https://github.com/microsoft/MPNet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please translate to Chinese.


## 快速开始

### 模型精度对齐
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove.

@@ -0,0 +1,50 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete compare.py

Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution and please give us feedback.

pretrained_resource_files_map = {
"vocab_file": {
"mpnet-base":
"https://paddlenlp.bj.bcebos.com/models/transformers/mpnet/mpnet-base/vocab.txt",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

词表貌似没有ready?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外Tokenizer默认会返回token_type_ids,但是这个在MPNet是不需要的

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

词表貌似没有ready?

done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

embedding_output = self.embeddings(input_ids, position_ids)

encoder_outputs, _ = self.encoder(embedding_output,
extended_attention_mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extended_attention_mask --->attention_mask

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mpnet (:class:`MPNetModel`):
An instance of MPNetModel.
num_classes (int, optional):
The number of classes. Defaults to `2`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下面的默认值没给哈

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@yingyibiao yingyibiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片可以删了哈,保留表格就好

@JunnYu
Copy link
Member Author

JunnYu commented Sep 10, 2021

好的,过会改

gongel
gongel previously approved these changes Sep 11, 2021
Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

处理下冲突

return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

def __call__(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个放在def __init__后面吧

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

gongel
gongel previously approved these changes Sep 11, 2021
Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM😊

@@ -0,0 +1,903 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google AI Language Team?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我偷懒从别的文件复制的,待会改

@gongel gongel merged commit b9a4cb1 into PaddlePaddle:develop Sep 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants