Option to "disable sentence segmentation" needed #13

KoichiYasuoka · 2020-03-30T15:41:11Z

I used "lzh" model, but its performance for sentence segmentation seems rather worse. So I tried to disable sentence segmentation:

import spacy_udpipe
class M(spacy_udpipe.UDPipeModel):
  def tokenize(self,text:str):
    t=self.model.newTokenizer(self.model.TOKENIZER_PRESEGMENTED)
    return self._read(text=text,input_format=t)
m=M("lzh")
lzh=spacy_udpipe.UDPipeLanguage(m,m._meta)
doc=lzh("不入虎穴不得虎子")

This quick hack works well, and I think we need an option for spacy_udpipe.load to disable sentence segmentation. How do you think, @asajatovic ?

The text was updated successfully, but these errors were encountered:

asajatovic · 2020-04-05T15:05:43Z

@KoichiYasuoka after some consideration, I think it could work as there are a few ancient languages that could benefit from pre-segmented text input.
What concerns me is how the end-user will be able to provide pre-segmented text required for this. I'd like to know what you think about this?
Unfortunately, I am too busy to work on this, so I'd like to encourage you to do it (if you are up for it)! 😄

asajatovic · 2020-05-09T20:15:34Z

@KoichiYasuoka, I initially thought it would be much harder to enable than it was in #19.
It works now! 😅

BramVanroy mentioned this issue Apr 29, 2020

Allow pre-tokenised text #18

Closed

asajatovic self-assigned this May 5, 2020

asajatovic added the enhancement New feature or request label May 5, 2020

asajatovic mentioned this issue May 7, 2020

Feature/Pretokenized and presegmented text #19

Merged

asajatovic closed this as completed May 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to "disable sentence segmentation" needed #13

Option to "disable sentence segmentation" needed #13

KoichiYasuoka commented Mar 30, 2020

asajatovic commented Apr 5, 2020

asajatovic commented May 9, 2020 •

edited

Loading

Option to "disable sentence segmentation" needed #13

Option to "disable sentence segmentation" needed #13

Comments

KoichiYasuoka commented Mar 30, 2020

asajatovic commented Apr 5, 2020

asajatovic commented May 9, 2020 • edited Loading

asajatovic commented May 9, 2020 •

edited

Loading