Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NLP model interpretation #1752

Merged
merged 32 commits into from
Mar 25, 2022
Merged

Conversation

binlinquge
Copy link
Contributor

@binlinquge binlinquge commented Mar 10, 2022

PR types

upload of new module

PR changes

New module

Description

this module is used for interpreting NLP models. please see README for detail.

@CLAassistant
Copy link

CLAassistant commented Mar 10, 2022

CLA assistant check
All committers have signed the CLA.

@ZeyuChen ZeyuChen self-requested a review March 11, 2022 13:42
@ZeyuChen ZeyuChen added the enhancement New feature or request label Mar 11, 2022
@ZeyuChen ZeyuChen changed the title upload NLP interpretation Add NLP interpretation Mar 14, 2022
@ZeyuChen ZeyuChen changed the title Add NLP interpretation Add NLP model interpretation Mar 14, 2022
@ZeyuChen ZeyuChen added this to the PaddleNLP v2.3 milestone Mar 15, 2022
@ZeyuChen ZeyuChen added this to In progress in PaddleNLP v2.3 via automation Mar 15, 2022
Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除所有[.gitkeep]等空文件

[//]: shenyaozong(shenyaozong@baidu.com)


讨论
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除以下这部分内容,由于是对外的,不需要暴露icode和Hi群等信息。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return args


def dataLoad(args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataLoad -> data_load
需要统一函数命名风格

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddleNLP会发个小版本解决RobERTa的英文模型加载问题,到时可以简化掉这里的使用体验,不用让用户搬一大段代码和模型进开发目录

@@ -0,0 +1,33 @@
backports.entry-points-selectable==1.1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于进到examples的用户,你可以认为已经默认安装成功paddlenlp和paddle的cpu或者gpu版本。
所以想确定下,如果用户在已经安装了paddlepaddle和paddlenlp之后,是否还需要额外安装哪些依赖。
BTW,requiremetnt固定死版本会导致该代码与用户环境大概率出现版本冲突与兼容,不是一种好的requirements处理做法,非不得已一般采用>=的版本号

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除paddle相关依赖,==改为>=

@@ -0,0 +1,33 @@
backports.entry-points-selectable==1.1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backport依赖在哪需要呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在每个task下的tokenizer_util.py中加载lru_cache时要用到

query_att = attention[0]
title_att = attention[1]

model.clear_gradients()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle 2.0后默认推荐使用model.clear_grad()接口,与torch一致,请全局替换该函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已全局替换

print('query_att: %s' % query_att.shape)
print('title_att: %s' % title_att.shape)

# print([query_att, query, title_att, title])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除无意义注释

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# yapf: enable


def interpreter(model,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名的设计通常为动名词结构,或者纯动词,此处使用纯名词,函数的表意上好像不是特别精准

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改为纯动词interpret

@@ -0,0 +1,57 @@
TASK=similarity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本文件名需要统一小写,按照百度代码规范要求的话。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本文件名已统一为小写


dev_ds = Senti_data().read(
os.path.join(args.data_dir, 'dev'), args.language)
dev_ds.map(map_fn, batched=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一般来说,python的代码组织是按照模块来划分,个人感觉code这个目录从代码托管的角度没有形成模块的意义,建议去掉这个目录层级。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已去除code目录

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README中已做相应修改


sys.path.append('../task/similarity')
from LIME.lime_text import LimeTextExplainer
from roberta.tokenizer import RobertaTokenizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块后续可以直接从PaddleNLP调用。

return args


class Similarity_data(DatasetBuilder):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是类的coding style,应该为

class SimilarityData

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import logging
import argparse

import paddle as P
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle官方代码对外没有推荐使用,

import paddle as P

这部分可以讨论下,可以使用import paddle.functional as F.
主要从API体系和体验上,较少看见有import torch as T, 这个comment可以讨论

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已变为直接import paddle

return args


class Senti_data(DatasetBuilder):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python类名按GoogleCodingStyle采用大驼峰形式
Senti_data -> SentiData

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,30 @@
backports.entry-points-selectable>=1.1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些依赖大部分都是安装paddlenlp的时候会自动引入的,无需用户额外安装。请重新在conda创建个干净的python 3.8环境,后,确认代码是否有新增额外的依赖。
我理解像visualdl、LAC,spacy是一个会新增的额外依赖,
但像numpy,pandas这些,大部分都是paddle和paddlenlp正确安装时回自动引入的,无需要额外强调。
譬如six,默认都适用py3了,就不需要six这些库,所以请检查下这里的依赖需求,简化一下。重点只强调在安装了paddlenlp和paddle之后,为了跑你这个模块需要额外安装的东西。而不是把你正在运行环境的所有依赖都导入。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,608 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我确认下,这个tokenizer文件,你有额外的开发和插入工作吗?譬如这部分代码如果在paddlnlp library内可以访问到的话,你还需要额外搬运这个代码吗?因为比较冗余

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所有roberta线下tokenizer已替换为调用paddlenlp线上版本

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们在下周的小版本2.2.5修复了Roberta en模型缺失的问题,可以尝试本地setup安装最新的paddlenlp develop版本,这样可以减少很多代码和文档。

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于所有目录中roberta目录下的modeling.py, generation_utils.py等文件,是否都复用PaddleNLP内部文件?而不用每个任务放置一份重复的代码量巨大的文件提交出来呢?

@@ -0,0 +1,630 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些modeling代码是否也一样可以复用PaddleNLP的呢?

@@ -0,0 +1,1023 @@
# !/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我在想这里generation_utils.py modeling.py文件都可以同步删除?我理解是不是都是和paddlenlp库里面重复的文件?

# -*- coding:utf-8 -*-
##########################################################
# Copyright (c) 2019 Baidu.com, Inc. All Rights Reserved #
##########################################################
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还请注意copyright的一致

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

eos_token_id = eos_token_id if eos_token_id is not None else getattr(
self, 'eos_token_id', None)
pad_token_id = pad_token_id if pad_token_id is not None else getattr(
self, 'pad_token_id', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还请确认下是否可直接使用 paddlenlp.transformers.generation_utils ,看上去是这个稍早些的版本

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已移除roberta目录下的generation_utils.py, model_utils.py, utils.py, vocab.py,全部改为调用paddlenlp接口

return None # Overwrite for models with output embeddings

@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件也还请确认下是否可以直接使用 paddlenlp.transformers.model_utils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已移除roberta目录下的generation_utils.py, model_utils.py, utils.py, vocab.py,全部改为调用paddlenlp接口

questions,
contexts,
stride=args.doc_stride,
max_seq_len=args.max_seq_len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp中tokenizer做了些更新,为了对齐HF的行为有些break的修改,tokenizer原来返回的是list of dict,修改后会返回dict of list, https://github.com/PaddlePaddle/PaddleNLP/pull/1713/files#diff-86a1b461121c41dca0e85147910f19e6018e3e23aa374b28ddc3f60751c0fd3e

如果希望不改动其他代码的情况下,这里可以简单加下return_dict=False保持其他代码继续可用,还需要简单适配下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

添加对tokenizer结果的类型判断,手动修改结果格式到老版本对应格式

max_position_embeddings=512,
type_vocab_size=16,
initializer_range=0.02,
pad_token_id=0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意和paddlenlp/transformers/roberta/modeling.py的差异,可以也加上layer_norm_eps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已加入

loss = scaler.scale(loss)
loss.backward()
scaler.minimize(opt, loss)
model.clear_gradients()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果optimizer使用了的建议还是使用optimizer.clear_grad()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已替换model.clear_gradients()->opt.clear_grad()

# start_logits: (bsz, seq); end_logits: (bsz, seq); cls_logits: (bsz, 2)
# attention: list((bsz, head, seq, seq) * 12); embedded: (bsz, seq, emb)
_, start_logits, end_logits, cls_logits, attentions, embedded = model.forward_interpret(
*fwd_args, **fwd_kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前考虑的其实是用户通过register_forward_post_hook的形式来获取任意想要的中间结果,比如可以在hook中将需要的内容放入全局的list中,以此将各种获取中间结果的需求插件化,也能更好的复用当前model中的代码。

我们在做蒸馏整理也有类似的需求,粗略的搞了一般插件化形式的 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/distill_utils.py#L189

我们后续也看看如何把这种功能需求做成API接口提供出来更方便大家使用。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们在第二版开源中也做了类似接口,当前还没有

version_2_with_negative: bool=False,
n_best_size: int=20,
max_answer_length: int=30,
cls_threshold: float=0.5):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是会和当前paddlenlp.metric中的有区别是吗,如果是较为常用的metric的话,后面也可以看看加入paddlenlp下面搞到包里

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加了适用于本项目的一些类,并对paddlenlp.metric中已有的两个类做了修改

answerable_probs[1]
])

# Only keep the best `n_best_size` predictions.
Copy link
Collaborator

@guoshengCS guoshengCS Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还请注意格式,另外也还麻烦再确认下是否使用pre-commit了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一般结束一轮CR的时候我会用pre-commit扫一遍,现已使用pre-commit修复codestyle的问题

@@ -0,0 +1,12089 @@
[PAD]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

像这种稍大些的数据和词典是否放在bos提供链接出来呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑到提供链接还需要单独下载和放到对应目录比较麻烦,就没有单独存储字典,可否在下一版中改进,当前时间比较紧张

from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
from roberta.transformer import TransformerEncoderLayer, TransformerEncoder

__all__ = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外也看到存在了多个 roberta/modeling.py、roberta/transformer.py ,请问下这些是一样的不

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformer.py是一样的,modeling.py是不一样的

@guoshengCS guoshengCS merged commit 93cae49 into PaddlePaddle:develop Mar 25, 2022
PaddleNLP v2.3 automation moved this from In progress to Done Mar 25, 2022
ZeyuChen added a commit to ZeyuChen/PaddleNLP that referenced this pull request Apr 17, 2022
* upload NLP interpretation

* fix problems and relocate project

* remove abandoned picture

* remove abandoned picture

* fix dead link in README

* fix dead link in README

* fix code style problems

* fix CR round 1

* remove .gitkeep files

* fix code style

* fix file encoding problem

* fix code style

* delete duplicated files due to directory rebuild

* fix CR round 2

* fix code style

* fix ernie tokenizer

* fix code style

* fix problem from CR round 1

* fix bugs

* fix README

* remove duplicated files

* deal with diff of old and new tokenizer results

* fix CR round 4

* fix code style

* add missing dependence

* fix broken import path

* move some data file to cloud

* MRC upper case to lower case

Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
Co-authored-by: binlinquge <xxx>
Co-authored-by: Guo Sheng <guosheng@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants