-
Notifications
You must be signed in to change notification settings - Fork 230
Sentence scanning updates #286
Sentence scanning updates #286
Conversation
5077ece
to
48259fb
Compare
I tested this a bit, and it looks pretty cool. One improvement is that you can now scan the readings on the kanji info page. I can review the code some day next week. |
Apart from the Google Docs bug, I didn't notice any issues. ed19a8b looked like it could have an off-by-one bug, but after closer inspection it seems to work the same. |
48259fb
to
a50fb2c
Compare
Due to how Google docs works, lines are effectively formatted as follows: <div>... This is a single sentence</div><div>which word wraps.</div> In English, this works (mostly) fine since there are whitespace/newline word boundaries. However, this causes issues in Japanese since characters can wrap mid-word. Take for example: <div>...吹</div><div>き</div> The previous algorithm handled this "correctly", in that the text would be detected as a single word. The updated algorithm inserts a The issue is related to both the HTML markup and how we decide to treat visual line breaks due to the HTML presentation. For example, despite being presented identically to the example above, the following HTML would not detect <div>...吹</div>
<div>き</div> This looks to be more complicated than I had originally thought. |
It could make sense to add an option for this as well so that you could toggle line break autodetection on and off, and users could disable the detection on Google Docs or other problematic sites. Most of the time you will be scanning regular paragraphs. The option default could also be the other way around, because most of the issues caused by too eager scanning are related to sentence extraction which is probably less used than just scanning. |
dea9c1a
to
0d8f9ce
Compare
0d8f9ce
to
8b5cfca
Compare
8b5cfca
to
57e9526
Compare
17eaf1b
to
72936e2
Compare
Replaced by #536. |
Due to some recent comments related to sentence scanning, I have tried to make sentence scanning more consistent. The changes are mostly made to
TextSourceRange
, and the sentence extraction feature is mostly the same. This change also includes some refactoring to reduce redundant code.A few things to note:
Text
nodes are now only treated as newlines if the CSSwhite-space
mode lets newline characters break text.小
would always detect the full word小ぢんまり
despite there being a line break. This example is contrived, but I've seen this happen to two text nodes which only incidentally form a larger phrase.docSentenceExtract.extent
3x instead of 2x (only once forward and once backward should be necessary).There is potentially another improvement that can be done to sentence extraction. Currently the algorithm will terminate a sentence at newline characters, which leads to partial phrases being detected at
<br>
elements or line breaks in elements withwhite-space=pre || pre-wrap || pre-line || break-spaces
. This can potentially be improved by only terminating when there are two sequential newline characters, combined with forcing element boundaries to be treated as'\n\n'
rather than just'\n'
. I have not yet made this change as I first wanted to get the base changes out of the way.This addresses #221, #273, #276.