Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

数据截断问题 #10

Open
Accagain2014 opened this issue Jan 20, 2020 · 2 comments
Open

数据截断问题 #10

Accagain2014 opened this issue Jan 20, 2020 · 2 comments

Comments

@Accagain2014
Copy link

hi, 请问在使用数据时,发现训练数据中存在,很多截断问题,有的缺少")", 有的缺少"》", 有的直接被截断,请问对原始数据划分样本的原则是什么呢?依据",。!"分割?还是依据字数呢?

(charles 1
{"text": "怎样做才能避免拒签?新浪出国频道邀请到美国使馆签证处的签证官江德力(charles", "label": {"government": {"美国使馆": [[19, 22]]}, "name": {"江德力": [[30, 32]], "(charles": [[33, 40]]}, "company": {"新浪": [[10, 11]]}, "position": {"签证官": [[27, 29]]}}}

《新老金融街争锋— 1
{"text": "【吴海花】:我代表北京《百姓地产》杂志感谢大家赶来参加今天《新老金融街争锋—", "label": {"book": {"《百姓地产》": [[11, 16]], "《
新老金融街争锋—": [[29, 37]]}, "name": {"【吴海花】": [[0, 4]]}}}

《伊苏1&2历代记(YsI&II 1
{"text": "系列2款新作,分别为1、2代强化移植合辑的《伊苏1&2历代记(YsI&II", "label": {"game": {"《伊苏1&2历代记(YsI&II": [[21, 36]]}}}

《最后的神迹(The 1
{"text": "由SQUAREENIX制作,预定4月9日推出的PC角色扮演游戏《最后的神迹(The", "label": {"game": {"《最后的神迹(The": [[31, 40]]}, "company": {"SQUAREENIX": [[1, 10]]}}}

@ConnieTong
Copy link

你好,数据是经过预处理把长文本分成了短文本,处理过程中可能造成了符号缺失。

@ConnieTong
Copy link

我们后面会对这种情况进行处理,您可以暂时先把这种数据删除掉,看对模型是否有影响

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants