Incorrect sentence parsing using ja_core_news_trf #12099

e-e · 2023-01-12T12:00:11Z

I'm not sure if this is the proper place to report this, and this is the first time that I've seen something like this, but I wanted to create an issue in case this was something that should in fact be reported.

How to reproduce the behaviour

import spacy

nlp = spacy.load("ja_core_news_trf")
doc = nlp("息子のためとあれば、火の中でも飛び込みます。")

for sent in doc.sents:
    print(sent.text)

outputs:

息子のためとあれば、火の中でも飛び
込み
ます
。

Your Environment

spaCy version: 3.4.3
Platform: macOS-12.5.1-x86_64-i386-64bit
Python version: 3.10.8
Pipelines: ja_core_news_lg (3.4.0), ja_core_news_trf (3.4.0)

The text was updated successfully, but these errors were encountered:

polm · 2023-01-13T04:37:36Z

In general issues like this fall under #3052, which basically amounts to "the models make mistakes sometimes". If the mistake is common and follows a clear pattern that might point to a fixable issue. In this case, there does seem to be something weird about how compound verbs are handled, so we'll take a closer look at that.

Note that if your goal is actually just sentence segmentation for Japanese, you should get high quality results with a punctuation-based sentencizer instead of relying on the default sentence boundaries, which are based on the parse tree.

e-e changed the title ~~Improper sentence parsing using ja_core_news_trf~~ Incorrect sentence parsing using ja_core_news_trf Jan 12, 2023

polm added lang / ja Japanese language data and models perf / accuracy Performance: accuracy labels Jan 13, 2023

polm added the feat / parser Feature: Dependency Parser label Jan 13, 2023

explosion locked and limited conversation to collaborators Jan 16, 2023

adrianeboyd converted this issue into discussion #12106 Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Incorrect sentence parsing using ja_core_news_trf #12099

Incorrect sentence parsing using ja_core_news_trf #12099

e-e commented Jan 12, 2023

polm commented Jan 13, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Incorrect sentence parsing using ja_core_news_trf #12099

Incorrect sentence parsing using ja_core_news_trf #12099

Comments

e-e commented Jan 12, 2023

How to reproduce the behaviour

Your Environment

polm commented Jan 13, 2023

This issue was moved to a discussion.