Sentence scanning updates #286

toasted-nutbread · 2019-11-23T23:05:32Z

Due to some recent comments related to sentence scanning, I have tried to make sentence scanning more consistent. The changes are mostly made to TextSourceRange, and the sentence extraction feature is mostly the same. This change also includes some refactoring to reduce redundant code.

A few things to note:

Newlines in Text nodes are now only treated as newlines if the CSS white-space mode lets newline characters break text.
Element boundaries which cause line breaks add newline characters to the text. This causes sentence detection to more correctly stop when it encounters a newline, since the newlines in the returned text are more accurate.
This does change the behaviour of how sentence scanning works. IMO it is likely to feel more consistent, but there are potentially some situations where the old method would return more text than the current method.
This also "fixes" a potential issue which used to exist where text split across lines were scanned as a single term since there was no separation text. For example:
```
<div><div>小ぢん</div>まり</div>
<div>小ぢん<div>まり</div></div>
```
Scanning starting at 小 would always detect the full word 小ぢんまり despite there being a line break. This example is contrived, but I've seen this happen to two text nodes which only incidentally form a larger phrase.
This change also includes an optimization where sentence scanning used to scan docSentenceExtract.extent 3x instead of 2x (only once forward and once backward should be necessary).

There is potentially another improvement that can be done to sentence extraction. Currently the algorithm will terminate a sentence at newline characters, which leads to partial phrases being detected at <br> elements or line breaks in elements with white-space=pre || pre-wrap || pre-line || break-spaces. This can potentially be improved by only terminating when there are two sequential newline characters, combined with forcing element boundaries to be treated as '\n\n' rather than just '\n'. I have not yet made this change as I first wanted to get the base changes out of the way.

This addresses #221, #273, #276.

toasted-nutbread · 2019-11-26T00:47:26Z

38ff0d8 resolves #134.

siikamiika · 2019-12-01T04:24:26Z

I tested this a bit, and it looks pretty cool. One improvement is that you can now scan the readings on the kanji info page. I can review the code some day next week.

ext/fg/js/source.js

siikamiika · 2019-12-04T20:59:47Z

Apart from the Google Docs bug, I didn't notice any issues. ed19a8b looked like it could have an off-by-one bug, but after closer inspection it seems to work the same.

toasted-nutbread · 2019-12-08T01:11:24Z

Due to how Google docs works, lines are effectively formatted as follows:

<div>... This is a single sentence</div><div>which word wraps.</div>

In English, this works (mostly) fine since there are whitespace/newline word boundaries. However, this causes issues in Japanese since characters can wrap mid-word. Take for example:

<div>...吹</div><div>き</div>

The previous algorithm handled this "correctly", in that the text would be detected as a single word. The updated algorithm inserts a \n character between 吹 and き, which makes the word scanner not pick it up.

The issue is related to both the HTML markup and how we decide to treat visual line breaks due to the HTML presentation. For example, despite being presented identically to the example above, the following HTML would not detect 吹き as a single word, neither with the old algorithm nor the updated one:

<div>...吹</div>
<div>き</div>

This looks to be more complicated than I had originally thought.

siikamiika · 2019-12-08T19:26:44Z

It could make sense to add an option for this as well so that you could toggle line break autodetection on and off, and users could disable the detection on Google Docs or other problematic sites. Most of the time you will be scanning regular paragraphs. The option default could also be the other way around, because most of the issues caused by too eager scanning are related to sentence extraction which is probably less used than just scanning.

toasted-nutbread · 2020-06-21T20:08:22Z

Replaced by #536.

toasted-nutbread mentioned this pull request Nov 24, 2019

[Feature Request] Regex formatting when adding cards to Anki #273

Closed

toasted-nutbread mentioned this pull request Nov 26, 2019

Non-selectable text in context sentences #134

Closed

toasted-nutbread force-pushed the sentence-scanning branch from 5077ece to 48259fb Compare November 27, 2019 03:21

siikamiika self-requested a review December 1, 2019 04:52

siikamiika reviewed Dec 4, 2019

View reviewed changes

ext/fg/js/source.js Outdated Show resolved Hide resolved

toasted-nutbread force-pushed the sentence-scanning branch from 48259fb to a50fb2c Compare December 7, 2019 23:20

toasted-nutbread force-pushed the sentence-scanning branch from dea9c1a to 0d8f9ce Compare December 29, 2019 17:37

siikamiika mentioned this pull request Jan 18, 2020

Idea for Yomichan - Scroll through text with Arrow Keys #300

Open

toasted-nutbread force-pushed the sentence-scanning branch from 0d8f9ce to 8b5cfca Compare January 19, 2020 18:59

siikamiika mentioned this pull request Feb 2, 2020

add scannable tags for expression and reading #334

Merged

toasted-nutbread force-pushed the sentence-scanning branch from 8b5cfca to 57e9526 Compare February 17, 2020 03:08

This was referenced Feb 19, 2020

[Feature Request] expose sentence extent detection logic in settings #370

Closed

Document tests #375

Merged

toasted-nutbread mentioned this pull request Mar 10, 2020

Replace charCodeAt and fromCharCode with codePointAt and fromCodePoint #408

Merged

siikamiika mentioned this pull request Mar 28, 2020

translate consecutive っ into a single one #421

Closed

toasted-nutbread added 10 commits April 12, 2020 15:40

Mark private functions with underscore prefix

69ba61c

Rearrange functions

d28e714

Replace seekForward and seekBackward with a single seek function

f9f20bc

Fix issue with setStartOffset

a80acde

Remove redundant scanning

6453150

Insert text line breaks due to element styles

db27285

Remove return values from setEndOffset and setStartOffset

a6a704c

Simplify _seekForwardTextNode and _seekBackwardTextNode

7ad3a60

Replace newlines with spaces in text elements based on white-space style

8b63a0d

Break lines when moving out of an element

bcc1093

toasted-nutbread added 6 commits April 12, 2020 15:40

Improve how source scanning ignores non-visibile text

b2dbd5f

Rename constant for consistency

e18a95c

Fix lint errors

d172af7

Fix offset not always being updated when node is changed

8063034

Rename resetOffset to first

8c0ab34

Add global declarations

72936e2

toasted-nutbread force-pushed the sentence-scanning branch from 17eaf1b to 72936e2 Compare April 12, 2020 19:41

toasted-nutbread added 3 commits April 12, 2020 16:19

Update global declarations

0ba4d73

Update tests

05131c5

Add white-space: pre; test

82c05cd

toasted-nutbread marked this pull request as draft April 12, 2020 21:46

toasted-nutbread mentioned this pull request Apr 17, 2020

DOMTextScanner #458

Merged

toasted-nutbread mentioned this pull request May 10, 2020

Use DOMTextScanner #536

Merged

toasted-nutbread closed this Jun 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence scanning updates #286

Sentence scanning updates #286

toasted-nutbread commented Nov 23, 2019

toasted-nutbread commented Nov 26, 2019

siikamiika commented Dec 1, 2019

siikamiika commented Dec 4, 2019

toasted-nutbread commented Dec 8, 2019

siikamiika commented Dec 8, 2019

toasted-nutbread commented Jun 21, 2020

Sentence scanning updates #286

Sentence scanning updates #286

Conversation

toasted-nutbread commented Nov 23, 2019

toasted-nutbread commented Nov 26, 2019

siikamiika commented Dec 1, 2019

siikamiika commented Dec 4, 2019

toasted-nutbread commented Dec 8, 2019

siikamiika commented Dec 8, 2019

toasted-nutbread commented Jun 21, 2020