Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model Rembert #1701

Merged
merged 14 commits into from
Apr 14, 2022
Merged

Add model Rembert #1701

merged 14 commits into from
Apr 14, 2022

Conversation

Beacontownfc
Copy link
Contributor

Description
Add new model RemBert
The model weight:
链接:https://aistudio.baidu.com/aistudio/datasetdetail/129105

@gongel gongel self-requested a review April 8, 2022 11:25
paddlenlp/transformers/rembert/tokenizer.py Outdated Show resolved Hide resolved
paddlenlp/transformers/rembert/modeling.py Outdated Show resolved Hide resolved
paddlenlp/transformers/rembert/modeling.py Outdated Show resolved Hide resolved
paddlenlp/transformers/rembert/modeling.py Outdated Show resolved Hide resolved
@gongel
Copy link
Member

gongel commented Apr 8, 2022

感谢贡献,麻烦根据comment修改下哈😊 @Beacontownfc

@Beacontownfc
Copy link
Contributor Author

已根据您的要求进行了修改 @gongel

gongel
gongel previously approved these changes Apr 12, 2022
Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Beacontownfc
Copy link
Contributor Author

@yingyibiao 一切OK,请求批准合入😊

@gongel gongel dismissed their stale review April 12, 2022 13:06

Batch input with padding may have a problem.

@gongel
Copy link
Member

gongel commented Apr 12, 2022

@Beacontownfc padding输入有问题,麻烦check一下哈

import io
import os
import shutil
import importlib

import numpy as np
import paddle
import torch
import transformers as hfnlp
import paddlenlp
from paddlenlp.data import Pad
import paddlenlp.transformers as ppnlp

os.environ["TRANSFORMERS_CACHE"] = "./hf/"
os.environ["PPNLP_HOME"] = "./pdnlp/"


def compute_diff(torch_data, paddle_data):
	torch_data = torch_data.detach().numpy()
	paddle_data = paddle_data.numpy()
	out_dict = dict()
	diff = np.abs(torch_data - paddle_data)
	out_dict = "max: {}    mean: {}    min: {}".format(diff.max(), diff.mean(), diff.min())
	return out_dict


def compare_base(model_id):
	sentences = [
		"This is an example sentence.", 
		"Each sentence is converted .", 
		"欢迎使用 PaddlePaddle  。",
		"欢迎使用 PaddleNLP 。"
	]
	
	# Calculate HF output
	hf_tokenizer = hfnlp.RemBertTokenizer.from_pretrained('google/rembert') # google/rembert
	hf_model = hfnlp.RemBertModel.from_pretrained('google/rembert') # google/rembert
	hf_model.eval()
	with torch.no_grad():
		hf_inputs = hf_tokenizer(sentences, padding=True, return_tensors="pt")
		print(hf_inputs)
		hf_out = hf_model(**hf_inputs).last_hidden_state
	
	# Calculate Paddle output
	pd_tokenizer = ppnlp.RemBertTokenizer.from_pretrained('rembert')
	pd_model = ppnlp.RemBertModel.from_pretrained('rembert')
	pd_model.eval()
	with paddle.no_grad():
		pd_inputs = pd_tokenizer(sentences)
		input_ids = paddle.to_tensor(Pad(axis=0, pad_val=pd_tokenizer.pad_token_id)([pd_input for pd_input in pd_inputs["input_ids"]]))
		token_type_ids = paddle.to_tensor(Pad(axis=0, pad_val=pd_tokenizer.pad_token_type_id)([pd_input for pd_input in pd_inputs["token_type_ids"]]))
		print(input_ids)
		print(token_type_ids)
		pd_out = pd_model(input_ids, token_type_ids)[0]

	return compute_diff(hf_out, pd_out)

print(compare_base("rembert"))

@Beacontownfc
Copy link
Contributor Author

Beacontownfc commented Apr 12, 2022

input_ids = paddle.to_tensor(Pad(axis=0, pad_val=pd_tokenizer.pad_token_id)([pd_input for pd_input in pd_inputs["input_ids"]]))

您好,这行代码
input_ids = paddle.to_tensor(Pad(axis=0, pad_val=pd_tokenizer.pad_token_id)([pd_input for pd_input in pd_inputs["input_ids"]]))
换成BertTokenizer运行也是出现报错,我认为此行代码应该改成这样
input_ids = paddle.to_tensor( Pad(axis=0, pad_val=pd_tokenizer.pad_token_id)([pd_input["input_ids"] for pd_input in pd_inputs]))
这样代码正常运行了

@gongel
Copy link
Member

gongel commented Apr 13, 2022

请问你的paddlenlp是哪个版本呢?develop代码目前没问题

@yingyibiao yingyibiao merged commit 70649b1 into PaddlePaddle:develop Apr 14, 2022
ZeyuChen pushed a commit to ZeyuChen/PaddleNLP that referenced this pull request Apr 17, 2022
* add rembert

* add rembert

* Update tokenizer.py

* update rembert

* modify

* modify according to gongel

* Update tokenizer.py

* Update tokenizer.py

* Update modeling.py

* fix bug

Co-authored-by: gongenlei <gongel@qq.com>
@gongel
Copy link
Member

gongel commented Jun 8, 2022

@Beacontownfc 可以再适配一下Auto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants