Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Sentence scanning updates #286

Closed

Conversation

toasted-nutbread
Copy link
Collaborator

Due to some recent comments related to sentence scanning, I have tried to make sentence scanning more consistent. The changes are mostly made to TextSourceRange, and the sentence extraction feature is mostly the same. This change also includes some refactoring to reduce redundant code.

A few things to note:

  • Newlines in Text nodes are now only treated as newlines if the CSS white-space mode lets newline characters break text.
  • Element boundaries which cause line breaks add newline characters to the text. This causes sentence detection to more correctly stop when it encounters a newline, since the newlines in the returned text are more accurate.
  • This does change the behaviour of how sentence scanning works. IMO it is likely to feel more consistent, but there are potentially some situations where the old method would return more text than the current method.
  • This also "fixes" a potential issue which used to exist where text split across lines were scanned as a single term since there was no separation text. For example:
    <div><div>小ぢん</div>まり</div>
    <div>小ぢん<div>まり</div></div>
    Scanning starting at would always detect the full word 小ぢんまり despite there being a line break. This example is contrived, but I've seen this happen to two text nodes which only incidentally form a larger phrase.
  • This change also includes an optimization where sentence scanning used to scan docSentenceExtract.extent 3x instead of 2x (only once forward and once backward should be necessary).

There is potentially another improvement that can be done to sentence extraction. Currently the algorithm will terminate a sentence at newline characters, which leads to partial phrases being detected at <br> elements or line breaks in elements with white-space=pre || pre-wrap || pre-line || break-spaces. This can potentially be improved by only terminating when there are two sequential newline characters, combined with forcing element boundaries to be treated as '\n\n' rather than just '\n'. I have not yet made this change as I first wanted to get the base changes out of the way.

This addresses #221, #273, #276.

@toasted-nutbread
Copy link
Collaborator Author

38ff0d8 resolves #134.

@siikamiika
Copy link
Collaborator

I tested this a bit, and it looks pretty cool. One improvement is that you can now scan the readings on the kanji info page. I can review the code some day next week.

@siikamiika siikamiika self-requested a review December 1, 2019 04:52
ext/fg/js/source.js Outdated Show resolved Hide resolved
@siikamiika
Copy link
Collaborator

Apart from the Google Docs bug, I didn't notice any issues. ed19a8b looked like it could have an off-by-one bug, but after closer inspection it seems to work the same.

@toasted-nutbread
Copy link
Collaborator Author

Due to how Google docs works, lines are effectively formatted as follows:

<div>... This is a single sentence</div><div>which word wraps.</div>

In English, this works (mostly) fine since there are whitespace/newline word boundaries. However, this causes issues in Japanese since characters can wrap mid-word. Take for example:

<div>...吹</div><div></div>

The previous algorithm handled this "correctly", in that the text would be detected as a single word. The updated algorithm inserts a \n character between and , which makes the word scanner not pick it up.

The issue is related to both the HTML markup and how we decide to treat visual line breaks due to the HTML presentation. For example, despite being presented identically to the example above, the following HTML would not detect 吹き as a single word, neither with the old algorithm nor the updated one:

<div>...吹</div>
<div></div>

This looks to be more complicated than I had originally thought.

@siikamiika
Copy link
Collaborator

It could make sense to add an option for this as well so that you could toggle line break autodetection on and off, and users could disable the detection on Google Docs or other problematic sites. Most of the time you will be scanning regular paragraphs. The option default could also be the other way around, because most of the issues caused by too eager scanning are related to sentence extraction which is probably less used than just scanning.

@toasted-nutbread toasted-nutbread marked this pull request as draft April 12, 2020 21:46
@toasted-nutbread toasted-nutbread mentioned this pull request Apr 17, 2020
@toasted-nutbread
Copy link
Collaborator Author

Replaced by #536.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants