-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSL 样本噪声问题 #115
Comments
我们在制作数据集时伪造关键词部分已经排除了生成出的真关键词。标签为0时序列中至少有一个关键词是伪造的
…________________________________
发件人: poke ***@***.***>
发送时间: Friday, April 9, 2021 11:28:41 AM
收件人: CLUEbenchmark/CLUE ***@***.***>
抄送: Subscribed ***@***.***>
主题: [CLUEbenchmark/CLUE] CSL 样本噪声问题 (#115)
关键词识别任务,
”csl_public.zip 取自中文论文摘要及其关键词,论文选自部分中文社会科学和自然科学核心期刊。使用tf-idf生成伪造关键词与论文真实关键词混合,构造摘要-关键词对,机器学习模型的任务目标是根据摘要判断关键词是否全部为真实关键词“
存在一个问题:tf-idf生成的可能是真关键词,在训练集和验证集中发现了一些噪声:
[image]<https://user-images.githubusercontent.com/37020799/114124225-8a1e9880-9926-11eb-93b3-b208f49299ee.png>
测试集可能也有,如何处理这种噪声?能否公开关键词混合的方法?
―
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#115>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE3SPV5YSQMQ4PJQ6WDNJIDTHZX6TANCNFSM42UENV6Q>.
|
我的截图,标出的那一行,标签=0,但是这些关键词在下面的1中都能找得到。 |
好的,我明白您的意思了,这个版本的数据集看上去确实有问题,我们在后续会发布新的CSL数据集,感谢您的提醒
…________________________________
发件人: poke ***@***.***>
发送时间: Friday, April 9, 2021 11:40:11 AM
收件人: CLUEbenchmark/CLUE ***@***.***>
抄送: Li Yudong ***@***.***>; Comment ***@***.***>
主题: Re: [CLUEbenchmark/CLUE] CSL 样本噪声问题 (#115)
我们在制作数据集时伪造关键词部分已经排除了生成出的真关键词。标签为0时序列中至少有一个关键词是伪造的
重新截个图吧,原始数据集,第3行,标签=0,键长出现2次,应该有一个是tfidf构建出来的,关键词没有去重。实际上这行关键词全是真关键词。
[image]<https://user-images.githubusercontent.com/37020799/114125010-2b5a1e80-9928-11eb-9526-89c7bb8f2bbb.png>
―
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#115 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE3SPV2ZAXFMX6XVWZALRNLTHZZJXANCNFSM42UENV6Q>.
|
请问新的数据集有发布吗?在哪里发布呢? |
您好,数据集中存在少量噪声,但不影响数据集的难度和区分性,请基于当前版本的CSL进行测评。数据集如有更新会在CLUE上发布
…________________________________
发件人: brightmart ***@***.***>
发送时间: Saturday, May 8, 2021 11:19:06 PM
收件人: CLUEbenchmark/CLUE ***@***.***>
抄送: Li Yudong ***@***.***>; Assign ***@***.***>
主题: Re: [CLUEbenchmark/CLUE] CSL 样本噪声问题 (#115)
Assigned #115<#115> to @P01son6415<https://github.com/P01son6415>.
―
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub<#115 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE3SPV5YGVXN5GA3BIGWOITTMVI6VANCNFSM42UENV6Q>.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
关键词识别任务,
”csl_public.zip 取自中文论文摘要及其关键词,论文选自部分中文社会科学和自然科学核心期刊。使用tf-idf生成伪造关键词与论文真实关键词混合,构造摘要-关键词对,机器学习模型的任务目标是根据摘要判断关键词是否全部为真实关键词“
存在一个问题:tf-idf生成的可能是真关键词,在训练集和验证集中发现了一些噪声:
测试集可能也有,如何处理这种噪声?能否公开关键词混合的方法?
The text was updated successfully, but these errors were encountered: