Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect sentence parsing using ja_core_news_trf #12099

Closed
e-e opened this issue Jan 12, 2023 · 1 comment
Closed

Incorrect sentence parsing using ja_core_news_trf #12099

e-e opened this issue Jan 12, 2023 · 1 comment
Labels
feat / parser Feature: Dependency Parser lang / ja Japanese language data and models perf / accuracy Performance: accuracy

Comments

@e-e
Copy link

e-e commented Jan 12, 2023

I'm not sure if this is the proper place to report this, and this is the first time that I've seen something like this, but I wanted to create an issue in case this was something that should in fact be reported.

How to reproduce the behaviour

import spacy

nlp = spacy.load("ja_core_news_trf")
doc = nlp("息子のためとあれば、火の中でも飛び込みます。")

for sent in doc.sents:
    print(sent.text)

outputs:

息子のためとあれば、火の中でも飛び
込み
ます
。

Your Environment

  • spaCy version: 3.4.3
  • Platform: macOS-12.5.1-x86_64-i386-64bit
  • Python version: 3.10.8
  • Pipelines: ja_core_news_lg (3.4.0), ja_core_news_trf (3.4.0)
@e-e e-e changed the title Improper sentence parsing using ja_core_news_trf Incorrect sentence parsing using ja_core_news_trf Jan 12, 2023
@polm polm added lang / ja Japanese language data and models perf / accuracy Performance: accuracy labels Jan 13, 2023
@polm
Copy link
Contributor

polm commented Jan 13, 2023

In general issues like this fall under #3052, which basically amounts to "the models make mistakes sometimes". If the mistake is common and follows a clear pattern that might point to a fixable issue. In this case, there does seem to be something weird about how compound verbs are handled, so we'll take a closer look at that.

Note that if your goal is actually just sentence segmentation for Japanese, you should get high quality results with a punctuation-based sentencizer instead of relying on the default sentence boundaries, which are based on the parse tree.

@polm polm added the feat / parser Feature: Dependency Parser label Jan 13, 2023
@explosion explosion locked and limited conversation to collaborators Jan 16, 2023
@adrianeboyd adrianeboyd converted this issue into discussion #12106 Jan 16, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feat / parser Feature: Dependency Parser lang / ja Japanese language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

2 participants