Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERNIE-Layout在(人名和邮箱)信息抽取的诸多问题阐述 #4031

Closed
dingidng opened this issue Dec 7, 2022 · 3 comments
Closed
Assignees
Labels
question Further information is requested stale

Comments

@dingidng
Copy link

dingidng commented Dec 7, 2022

请提出你的问题

因为没有办法贴图,这里就把要抽取的图PDF提取出来粘贴出来阐述一下问题,这个是论文首页
MahnazEkhlasi-Hundrieser,ShahlaChaichian,AbolfazlMehdizadehkashi, andAtefehVaezi Contents 1 Introduction 2 CurcuminintheTreatmentofGynecologicalCancers 3 CurcumininRestoringPlatinumDrug-InducedResistance 4 MolecularEvidencesofSynergisticAnticancerFeaturesofCurcuminandPaclitaxel InVivo 5 CurcuminandPaclitaxelintheFormofNanoformulations A.A.Momtazi-Borojeni(*) NanotechnologyResearchCenter,Bu-AliResearchInstitute,MashhadUniversityofMedical Sciences,Mashhad,Iran DepartmentofMedicalBiotechnology,StudentResearchCommittee,FacultyofMedicine, MashhadUniversityofMedicalSciences,Mashhad,Iran e-mail:momtaziaa921@mums.ac.ir;abbasmomtazi@yahoo.com J.Mosafer ResearchCenterofAdvancedTechnologiesinMedicine,TorbatHeydariehUniversityof MedicalSciences,TorbatHeydarieh,Iran B.Nikfar(*) ParsAdvancedandMinimallyInvasiveMedicalMannersResearchCenter,ParsHospital,Iran UniversityofMedicalSciences,Tehran,Iran e-mail:banafsheh.nikfar@gmail.com M.Ekhlasi-Hundrieser Werlhof-Institut,Hannover,Germany S.Chaichian MinimallyInvasiveTechniquesResearchCenterinWomen,TehranMedicalSciencesBranch, IslamicAzadUniversity,Tehran,Iran A.Mehdizadehkashi EndometriosisandGynecologicDisordersResearchCenter,IranUniversityofMedical Sciences,Tehran,Iran A.Vaezi DepartmentofCommunityMedicine,SchoolofMedicine,IsfahanUniversityofMedical Sciences,Isfahan,Iran
方法一:ERNIE-Layout因为看到功能比较强大就尝试了一下,遇到信息抽取错误,以及抽取不全等问题
`from paddlenlp import Taskflow

docprompt_en = Taskflow("document_intelligence", lang="en") # Set OCR language to English
docprompt_en({"doc": "./images/paper_1.jpg", "prompt": ["作者是谁并且邮箱是什么","人名邮箱" ]})结果是:可以看到两个实体一起抽取,只能抽取邮箱[{'prompt': '作者是谁并且邮箱是什么',
'result': [{'value': 'momtaziaa921@mums.ac.irabbasmomtazi@yahoo.com',
'prob': 1.0,
'start': 69,
'end': 79}]},
{'prompt': '人名邮箱',
'result': [{'value': 'momtaziaa921@mums.ac.irabbasmomtazi@yahoo.com',
'prob': 1.0,
'start': 69,
'end': 79}]}]进行分段抽取:docprompt_en({"doc": "./images/paper_1.jpg", "prompt": ["姓名","邮箱","其他人名","其他邮箱" ]})`

只能抽取第一个人的
[{'prompt': '姓名', 'result': [{'value': 'AA.Momtazi-Borojeni', 'prob': 0.76, 'start': 0, 'end': 4}]}, {'prompt': '邮箱', 'result': [{'value': 'momtaziaa921@mums.ac.irabbasmomtazi@yahoo.com', 'prob': 1.0, 'start': 69, 'end': 79}]}, {'prompt': '其他人名', 'result': [{'value': 'AA.Momtazi-', 'prob': 0.59, 'start': 0, 'end': 3}]}, {'prompt': '其他邮箱', 'result': [{'value': 'momtaziaa921@mums.ac.irabbasmomtazi@yahoo.com', 'prob': 0.99, 'start': 69, 'end': 79}]}]
修改了描述方式
docprompt_en({"doc": "./images/paper_1.jpg", "prompt": ["姓名","邮箱","第二个姓名","第二个邮箱" ]})

第二个人名抽取出来了,邮箱错位下去了,有一种硬匹配不会因为第二个作者没有邮箱而显示空。
[{'prompt': '姓名', 'result': [{'value': 'AA.Momtazi-Borojeni', 'prob': 0.76, 'start': 0, 'end': 4}]}, {'prompt': '邮箱', 'result': [{'value': 'momtaziaa921@mums.ac.irabbasmomtazi@yahoo.com', 'prob': 1.0, 'start': 69, 'end': 79}]}, {'prompt': '第二个姓名', 'result': [{'value': 'J.Mosafer', 'prob': 0.83, 'start': 80, 'end': 82}]}, {'prompt': '第二个邮箱', 'result': [{'value': 'banafshch.nikfar@gmail.com', 'prob': 1.0, 'start': 153, 'end': 159}]}]
尝试增大描述列表
docprompt_en({"doc": "./images/paper_1.jpg", "prompt": ["姓名","邮箱","第二个姓名","第二个邮箱","第三个姓名","第三个邮箱","第四个姓名","第四个邮箱","第五个姓名","第五个邮箱","第六个姓名","第六个邮箱","第七个姓名","第七个邮箱" ]})

整个错乱
[{'prompt': '姓名', 'result': [{'value': 'AA.Momtazi-Borojeni', 'prob': 0.76, 'start': 0, 'end': 4}]}, {'prompt': '邮箱', 'result': [{'value': 'momtaziaa921@mums.ac.irabbasmomtazi@yahoo.com', 'prob': 1.0, 'start': 69, 'end': 79}]}, {'prompt': '第二个姓名', 'result': [{'value': 'J.Mosafer', 'prob': 0.83, 'start': 80, 'end': 82}]}, {'prompt': '第二个邮箱', 'result': [{'value': 'banafshch.nikfar@gmail.com', 'prob': 1.0, 'start': 153, 'end': 159}]}, {'prompt': '第三个姓名', 'result': [{'value': 'J.Mosafer', 'prob': 0.97, 'start': 80, 'end': 82}]}, {'prompt': '第三个邮箱', 'result': [{'value': 'banafshch.nikfar@gmail.com', 'prob': 1.0, 'start': 153, 'end': 159}]}, {'prompt': '第四个姓名', 'result': [{'value': 'J.Mosafer', 'prob': 0.85, 'start': 80, 'end': 82}]}, {'prompt': '第四个邮箱', 'result': [{'value': 'banafshch.nikfar@gmail.com', 'prob': 1.0, 'start': 153, 'end': 159}]}, {'prompt': '第五个姓名', 'result': [{'value': 'J.Mosafer', 'prob': 0.85, 'start': 80, 'end': 82}]}, {'prompt': '第五个邮箱', 'result': [{'value': 'banafshch.nikfar@gmail.com', 'prob': 0.99, 'start': 153, 'end': 159}]}, {'prompt': '第六个姓名', 'result': [{'value': 'J.Mosafer', 'prob': 0.88, 'start': 80, 'end': 82}]}, {'prompt': '第六个邮箱', 'result': [{'value': 'banafshch.nikfar@gmail.com', 'prob': 1.0, 'start': 153, 'end': 159}]}, {'prompt': '第七个姓名', 'result': [{'value': 'AVaezi', 'prob': 0.87, 'start': 225, 'end': 225}]}, {'prompt': '第七个邮箱', 'result': [{'value': 'banafshch.nikfar@gmail.com', 'prob': 0.95, 'start': 153, 'end': 159}]}]

方法二:使用PDFPlumber库和PaddleNLP UIE模型抽取,遇到问题:无法把姓名和邮箱一一对应。
A.A.Momtazi-Borojeni(*) NanotechnologyResearchCenter,Bu-AliResearchInstitute,MashhadUniversityofMedical Sciences,Mashhad,Iran DepartmentofMedicalBiotechnology,StudentResearchCommittee,FacultyofMedicine, MashhadUniversityofMedicalSciences,Mashhad,Iran e-mail:momtaziaa921@mums.ac.ir;abbasmomtazi@yahoo.com
PDFPlumber抓取的文本后,暂时不知道用什么逻辑策略把这段组合成一个句子,放到UIE中抽取,希望解答。

无奈尝试第二种方法,遇到下面的问题,不过整体可行性和可控性会比第一种方案感觉可靠点。

@dingidng dingidng added the question Further information is requested label Dec 7, 2022
@github-actions github-actions bot added the triage label Dec 7, 2022
@linjieccc linjieccc self-assigned this Dec 7, 2022
@linjieccc
Copy link
Contributor

@dingidng Hi,

1)DocPrompt

可以尝试下修改topn这个参数,来返回多个答案

例如:

Taskflow("document_intelligence", topn=10)

2)UIE

可以期待下我们即将发布的UIE-X跨模态模型,融合了图片、布局这些信息,进行端到端的文档抽取

@sijunhe sijunhe removed the triage label Dec 8, 2022
@github-actions
Copy link

github-actions bot commented Feb 7, 2023

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

@github-actions github-actions bot added the stale label Feb 7, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

3 participants