Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to "disable sentence segmentation" needed #13

Closed
KoichiYasuoka opened this issue Mar 30, 2020 · 2 comments
Closed

Option to "disable sentence segmentation" needed #13

KoichiYasuoka opened this issue Mar 30, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@KoichiYasuoka
Copy link
Contributor

I used "lzh" model, but its performance for sentence segmentation seems rather worse. So I tried to disable sentence segmentation:

import spacy_udpipe
class M(spacy_udpipe.UDPipeModel):
  def tokenize(self,text:str):
    t=self.model.newTokenizer(self.model.TOKENIZER_PRESEGMENTED)
    return self._read(text=text,input_format=t)
m=M("lzh")
lzh=spacy_udpipe.UDPipeLanguage(m,m._meta)
doc=lzh("不入虎穴不得虎子")

This quick hack works well, and I think we need an option for spacy_udpipe.load to disable sentence segmentation. How do you think, @asajatovic ?

@asajatovic
Copy link
Collaborator

@KoichiYasuoka after some consideration, I think it could work as there are a few ancient languages that could benefit from pre-segmented text input.
What concerns me is how the end-user will be able to provide pre-segmented text required for this. I'd like to know what you think about this?
Unfortunately, I am too busy to work on this, so I'd like to encourage you to do it (if you are up for it)! 😄

@asajatovic
Copy link
Collaborator

asajatovic commented May 9, 2020

@KoichiYasuoka, I initially thought it would be much harder to enable than it was in #19.
It works now! 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants