Add new structured content features: lists and the HTML `lang` attribute #2129

stephenmk · 2022-05-13T20:16:48Z

Structured Content `lang` Attribute

Authors of dictionaries for Yomichan may want to include content from various languages, so I think this is a valuable feature. JMDict, for example, includes Japanese loanword source-words from over 60 different languages.

Characters 直次茶冷 displayed in structured content glosses

⚠The current version of Yomichan seems to apply lang="ja" attributes to standard glosses which contain Japanese characters, but not to structured content glosses. By default, most browsers will render the characters "直次茶冷" in simplified Chinese.

Structured Content Lists

The current way in which I have inserted supplemental information into JMdict glossaries (see: #1165) is a little awkward. When the "compact glossaries" option is enabled, all of the information is grouped and compacted together. It will be easier for the user to parse the information if the different types of information are broken into separate sections. My idea is to break them up into separate unordered lists, each with its own list-style-type.

読む

読む (compact glossaries)

元 ("compact glossaries" and "group related terms" mode)

アルバイト

欠席

ちりも積もれば山となる

I've included tests for these new structured content features in the file test/data/dictionaries/valid-dictionary1/term_bank_1.json.

Additionally, here is a new version of JMdict for Yomichan which uses the new features.
jmdict_english_info_glosses_2022_05_13.zip

This version takes over 30 minutes to validate during the import process, so it is probably not viable for distribution unless Yomichan's validation procedure is optimized. Glosses that do not contain supplemental information are not inserted into structured content containers (they are formatted identically to the current production version of the dictionary), so I don't think there are any additional optimizations I can make to the dictionary without cutting content.

Here is a version of the new JMdict dictionary that does not contain external reference notes (i.e. notes that indicate when an entry is referenced by another entry). On my PC it takes about 10 minutes to validate.
jmdict_english_info_glosses_no_ext_xrefs_2022_05_13.zip

I'm open to suggestions on how to improve the appearance of the new JMdict dictionary. I still need to clean up the code in my branch of yomichan-import a bit, but I think I'm out of ideas for additional features to add.

A full list of supported style types is documented here: https://developer.mozilla.org/en-US/docs/Web/CSS/list-style-type There's nothing in this code preventing a term bank from assigning, for example, a `list-style-type` style to a `div` element, but it doesn't seem like browsers will complain about things like that.

Support added for the following node types: "ruby", "rt", "rp", "table", "thead", "tbody", "tfoot", "tr", "td", "th", "span", "div", "ol", "ul", "li", "a" I couldn't get it to work for the alt-hover text on "img" tags. Tests are included in the file "test/data/dictionaries/valid-dictionary/term_bank_1.json"

toasted-nutbread · 2022-05-13T21:16:32Z

ext/data/schemas/dictionary-term-bank-v3-schema.json

+                                },
+                                "lang": {
+                                    "type": "string",
+                                    "description": "Defines the language of an element in the format defined by RFC 5646"


Nitpick: add period to the end of description value (other descriptions are punctuated).

Apply to other locations in the file also.

toasted-nutbread · 2022-05-13T21:33:10Z

Looks good, just a few comments on CSS:

The way that Anki cards have styling information generated for structured content is a special case which needs to be handled.

Modify dev/data/structured-content-overrides.css to create overrides for new additions to the ext/css/structured-content.css file. You'll want to hardcode var() values (just use its computed value) and to flag the :root... properties using
/* remove-rule */, since they aren't necessary for Anki.
You'll have to run this to update the ext/data/structured-content-style.json file:
node dev/generate-css-json.js

toasted-nutbread · 2022-05-13T22:18:13Z

⚠The current version of Yomichan seems to apply lang="ja" attributes to standard glosses which contain Japanese characters, but not to structured content glosses. By default, most browsers will render the characters "直次茶冷" in simplified Chinese.

Out of scope for this PR, but this may be something that's worth updating at some point also, since this seems like an oversight on my part. (not saying this is something you should do)

toasted-nutbread · 2022-05-13T22:40:31Z

This version takes over 30 minutes to validate during the import process, so it is probably not viable for distribution unless Yomichan's validation procedure is optimized. Glosses that do not contain supplemental information are not inserted into structured content containers (they are formatted identically to the current production version of the dictionary), so I don't think there are any additional optimizations I can make to the dictionary without cutting content.

And to this point, yes I agree that the validation is unfortunately slow. This is partially due to how non-specific JSON schemas are technically allowed to be, and I made a few optimizations a while back, but I should revisit this.

It's also potentially a motivation for having a different way to represent the type of data mentioned in #1165. While structured content can accomplish it, it is by no means the optimal way of doing it. Compare something like:

Current

[
  "読む",
  "よむ",
  "5 v5m vt",
  "v5",
  1440001,
  [
    {
      "content": [
        {
          "content": {
            "content": "now mostly used in idioms",
            "tag": "li"
          },
          "data": {
            "content": "notes"
          },
          "lang": "ja",
          "style": {
            "listStyleType": "'📝 '"
          },
          "tag": "ul"
        },
        {
          "content": [
            {
              "content": "to count",
              "tag": "li"
            },
            {
              "content": "to estimate",
              "tag": "li"
            }
          ],
          "data": {
            "content": "glossary"
          },
          "lang": "en",
          "style": {
            "listStyleType": "circle"
          },
          "tag": "ul"
        },
        {
          "content": {
            "content": [
              "see: ",
              {
                "content": "さばを読む",
                "href": "?query=さばを読む\u0026wildcards=off",
                "lang": "ja",
                "tag": "a"
              },
              {
                "content": " to manipulate figures to one's advantage; to count wrongly on purpose; to inflate or deflate one's age",
                "data": {
                  "content": "refGlosses"
                },
                "style": {
                  "fontSize": "x-small",
                  "verticalAlign": "middle"
                },
                "tag": "span"
              }
            ],
            "tag": "li"
          },
          "data": {
            "content": "references"
          },
          "lang": "en",
          "style": {
            "listStyleType": "'➡ '"
          },
          "tag": "ul"
        }
      ],
      "type": "structured-content"
    }
  ],
  1456360,
  "P ichi news6k"
]

Minimized

[
  "読む",
  "よむ",
  "5 v5m vt",
  "v5",
  1440001,
  [
    {
      "content": "now mostly used in idioms",
      "type": "note"
    },
    "to count",
    "to estimate",
    {
      "content": "さばを読む",
      "brief": "to manipulate figures to one's advantage; to count wrongly on purpose; to inflate or deflate one's age", // optional
      "href": "?query=さばを読む\u0026wildcards=off", // optional
      "type": "references"
    }
  ],
  1456360,
  "P ichi news6k"
]

For comparison, I also ran the validation on the dictionary you provided and it took roughly 18 minutes:

node dev\dictionary-validate.js jmdict_english_info_glosses_2022_05_13.zip
Validating jmdict_english_info_glosses_2022_05_13.zip...
No issues detected (1102.21s)

see: #2129

stephenmk · 2022-05-13T23:03:33Z

Thanks for the detailed feedback. The files have now been updated.

You'll want to hardcode var() values (just use its computed value) and to flag the :root... properties using /* remove-rule */, since they aren't necessary for Anki.

I added the --padding-left rule to the structured content lists so that they would align with the regular glosses within the Yomichan popup, but it doesn't seem that the regular glosses have this padding when exported to Anki. So I think the correct move is to just drop the rule (rather than hardcode a value). If that doesn't work for some reason I didn't think of, it's no problem to update it again. I made a card to test and it seemed to come out alright.

読む glossary in Anki

toasted-nutbread · 2022-05-13T23:12:52Z

dev/data/structured-content-overrides.css

+.gloss-sc-ul {
+    /* remove-property padding-left */
+}
+:root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary] {


Group these together for simplicity:

:root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary], :root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary] .gloss-sc-li, :root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary] .gloss-sc-li:not(:first-child)::before { /* remove-rule */ }

Ha, that's actually what I tried first. But the tests do not like that. I think that only works if they are defined that way (as a group) in the ext/css/structured-content.css file. But the rules are all different there, so they can't be combined.

Running test-css-json.js... Error: Could not find rule with matching selectors at generateRules (/home/stephen/Code/yomichan/dev/css-to-json-util.js:139:48) at main (/home/stephen/Code/yomichan/test/test-css-json.js:28:42) at testMain (/home/stephen/Code/yomichan/dev/util.js:127:15) at Object.<anonymous> (/home/stephen/Code/yomichan/test/test-css-json.js:35:5) at Module._compile (node:internal/modules/cjs/loader:1105:14) at Module._extensions..js (node:internal/modules/cjs/loader:1159:10) at Module.load (node:internal/modules/cjs/loader:981:32) at Module._load (node:internal/modules/cjs/loader:827:12) at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12) at node:internal/main/run_main_module:17:47

Hmm, I thought I handled that differently, but I guess not; my bad! (I should really be more familiar with my own code)

toasted-nutbread · 2022-05-14T15:02:08Z

Out of scope for this PR, but this may be something that's worth updating at some point also, since this seems like an oversight on my part. (not saying this is something you should do)

#2131

stephenmk · 2022-05-14T17:58:42Z

It's also potentially a motivation for having a different way to represent the type of data mentioned in #1165. While structured content can accomplish it, it is by no means the optimal way of doing it.

I think this is an interesting idea. It seems simple enough that even I could probably implement it, and it would probably cut back on a lot of validation time. I think a major drawback to the idea (as I understand it) is that it would very tightly couple the dictionary's display logic with yomichan, which would require [1] yomichan to be updated whenever major changes occur in the dictionary format and [2] yomichan to maintain compatibility with all prior dictionary formats (to support users who do not upgrade to the newest version of the dictionary).

If the new data type representations are sufficiently modular, perhaps this would not be an issue. However, I'm not sure we're going to be able to strike a good balance between modularity and conciseness with this JMdict data that would be worth the effort. Here's an example:

Minimized JMdict term bank entry with hypothetical data types

[
  "読む",
  "よむ",
  "5 v5m vt",
  "v5",
  1440001,
  {
    "type": "jmdictglossary",
    "sources": [
      {
        "lang": "zh",
        "language": "Chinese",
        "wasei": false,
        "content": "讀・dú",
        "type": "partial"
      }
    ],
    "notes": [
      "now mostly used in idioms"
    ],
    "glosses": [
      "to count",
      "to estimate"
    ],
    "infoGlosses": [
      {
        "type": "expl",
        "content": "The Yomi in Yomichan comes from 読む"
      },
      {
        "type": "tm",
        "content": "FooSoft® 2022"
      }
    ],
    "references": [
      {
        "kanji": "さばを読む",
        "brief": " to manipulate figures to one's advantage; to count wrongly on purpose; to inflate or deflate one's age",
        "type": "reference"
      },
      {
        "kanji": "欠",
        "reading": "あくび",
        "brief": " 欠 can be read as けつ, あくび, or かけ, so a reading has to be specified",
        "type": "antonym"
      }
    ]
  },
  1456360,
  "P ichi news6k"
]

Note that the entire glossary is contained within a single jmdictglossary typed gloss object. If the different sections are broken up into separate glosses (as in your Minimized example), then each section will be displayed in a separate list item. This wouldn't look very good in the Yomichan popup without some other major style modifications, and compact glossaries mode would group them together in a big mess.

This hypothetical jmdictglossary type isn't modular, is unlikely to be useful to any other dictionary authors, and is likely to require changes in the future. So I think that updating Yomichan to be able to use such a data type would probably end up causing a lot of hassle.

I don't know anything about JSON schema validation, and I haven't looked into how Yomichan does it or what engine it uses. My naive hope was that maybe Yomichan could be updated to use a faster validator (ajv claims to be the fastest) and perhaps that would be enough to solve the problem.

toasted-nutbread · 2022-05-14T21:51:05Z

Yeah, I agree overall that it's difficult to balance modularity, efficiency, and compatibility, hence why I haven't yet took the initiative to implement something like what I mentioned in #2129 (comment).

I don't know anything about JSON schema validation, and I haven't looked into how Yomichan does it or what engine it uses.

A custom one I wrote, which supports a limited subset. I can look into doing a comparison vs ajv. The other downside of complex structured content vs native definitions is in the database storage overhead, since all of the formatting needs to be stored.

stephenmk added 3 commits April 29, 2022 09:32

Add styles for structured content lists

5d57f58

toasted-nutbread reviewed May 13, 2022

View reviewed changes

Add override rules for new structured-content list styles

51f6189

see: #2129

toasted-nutbread reviewed May 13, 2022

View reviewed changes

toasted-nutbread merged commit 6a74746 into FooSoft:master May 14, 2022

This was referenced May 14, 2022

Structured content doesn't apply lang=ja #2130

Closed

Structured content auto language #2131

Merged

This was referenced May 17, 2022

Dictionary validate updates #2137

Merged

JSON schema validation #2138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new structured content features: lists and the HTML `lang` attribute #2129

Add new structured content features: lists and the HTML `lang` attribute #2129

stephenmk commented May 13, 2022

toasted-nutbread May 13, 2022

toasted-nutbread commented May 13, 2022

toasted-nutbread commented May 13, 2022

toasted-nutbread commented May 13, 2022

stephenmk commented May 13, 2022

toasted-nutbread May 13, 2022

stephenmk May 13, 2022

toasted-nutbread May 14, 2022

toasted-nutbread commented May 14, 2022

stephenmk commented May 14, 2022

toasted-nutbread commented May 14, 2022

Add new structured content features: lists and the HTML lang attribute #2129

Add new structured content features: lists and the HTML lang attribute #2129

Conversation

stephenmk commented May 13, 2022

Structured Content lang Attribute

Structured Content Lists

toasted-nutbread May 13, 2022

Choose a reason for hiding this comment

toasted-nutbread commented May 13, 2022

toasted-nutbread commented May 13, 2022

toasted-nutbread commented May 13, 2022

stephenmk commented May 13, 2022

toasted-nutbread May 13, 2022

Choose a reason for hiding this comment

stephenmk May 13, 2022

Choose a reason for hiding this comment

toasted-nutbread May 14, 2022

Choose a reason for hiding this comment

toasted-nutbread commented May 14, 2022

stephenmk commented May 14, 2022

toasted-nutbread commented May 14, 2022

Add new structured content features: lists and the HTML `lang` attribute #2129

Add new structured content features: lists and the HTML `lang` attribute #2129

Structured Content `lang` Attribute