Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Add new structured content features: lists and the HTML lang attribute #2129

Merged
merged 4 commits into from
May 14, 2022
Merged

Conversation

stephenmk
Copy link
Contributor

Structured Content lang Attribute

Authors of dictionaries for Yomichan may want to include content from various languages, so I think this is a valuable feature. JMDict, for example, includes Japanese loanword source-words from over 60 different languages.

Characters 直次茶冷 displayed in structured content glosses

⚠The current version of Yomichan seems to apply lang="ja" attributes to standard glosses which contain Japanese characters, but not to structured content glosses. By default, most browsers will render the characters "直次茶冷" in simplified Chinese.
html_lang

Structured Content Lists

The current way in which I have inserted supplemental information into JMdict glossaries (see: #1165) is a little awkward. When the "compact glossaries" option is enabled, all of the information is grouped and compacted together. It will be easier for the user to parse the information if the different types of information are broken into separate sections. My idea is to break them up into separate unordered lists, each with its own list-style-type.

読む

yomu

読む (compact glossaries)

yomu

元 ("compact glossaries" and "group related terms" mode)

moto

アルバイト

arubaito

欠席

kesseki

ちりも積もれば山となる

chiri

I've included tests for these new structured content features in the file test/data/dictionaries/valid-dictionary1/term_bank_1.json.

Additionally, here is a new version of JMdict for Yomichan which uses the new features.
jmdict_english_info_glosses_2022_05_13.zip

This version takes over 30 minutes to validate during the import process, so it is probably not viable for distribution unless Yomichan's validation procedure is optimized. Glosses that do not contain supplemental information are not inserted into structured content containers (they are formatted identically to the current production version of the dictionary), so I don't think there are any additional optimizations I can make to the dictionary without cutting content.

Here is a version of the new JMdict dictionary that does not contain external reference notes (i.e. notes that indicate when an entry is referenced by another entry). On my PC it takes about 10 minutes to validate.
jmdict_english_info_glosses_no_ext_xrefs_2022_05_13.zip

I'm open to suggestions on how to improve the appearance of the new JMdict dictionary. I still need to clean up the code in my branch of yomichan-import a bit, but I think I'm out of ideas for additional features to add.

A full list of supported style types is documented here:
https://developer.mozilla.org/en-US/docs/Web/CSS/list-style-type

There's nothing in this code preventing a term bank from assigning,
for example, a `list-style-type` style to a `div` element, but it
doesn't seem like browsers will complain about things like that.
Support added for the following node types:

"ruby", "rt", "rp", "table", "thead", "tbody", "tfoot", "tr", "td",
"th", "span", "div", "ol", "ul", "li", "a"

I couldn't get it to work for the alt-hover text on "img" tags.

Tests are included in the file
"test/data/dictionaries/valid-dictionary/term_bank_1.json"
},
"lang": {
"type": "string",
"description": "Defines the language of an element in the format defined by RFC 5646"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: add period to the end of description value (other descriptions are punctuated).

Apply to other locations in the file also.

@toasted-nutbread
Copy link
Collaborator

Looks good, just a few comments on CSS:

The way that Anki cards have styling information generated for structured content is a special case which needs to be handled.

  1. Modify dev/data/structured-content-overrides.css to create overrides for new additions to the ext/css/structured-content.css file. You'll want to hardcode var() values (just use its computed value) and to flag the :root... properties using
    /* remove-rule */, since they aren't necessary for Anki.
  2. You'll have to run this to update the ext/data/structured-content-style.json file:
    node dev/generate-css-json.js

@toasted-nutbread
Copy link
Collaborator

⚠The current version of Yomichan seems to apply lang="ja" attributes to standard glosses which contain Japanese characters, but not to structured content glosses. By default, most browsers will render the characters "直次茶冷" in simplified Chinese.

Out of scope for this PR, but this may be something that's worth updating at some point also, since this seems like an oversight on my part. (not saying this is something you should do)

@toasted-nutbread
Copy link
Collaborator

This version takes over 30 minutes to validate during the import process, so it is probably not viable for distribution unless Yomichan's validation procedure is optimized. Glosses that do not contain supplemental information are not inserted into structured content containers (they are formatted identically to the current production version of the dictionary), so I don't think there are any additional optimizations I can make to the dictionary without cutting content.

And to this point, yes I agree that the validation is unfortunately slow. This is partially due to how non-specific JSON schemas are technically allowed to be, and I made a few optimizations a while back, but I should revisit this.

It's also potentially a motivation for having a different way to represent the type of data mentioned in #1165. While structured content can accomplish it, it is by no means the optimal way of doing it. Compare something like:

Current
[
  "読む",
  "よむ",
  "5 v5m vt",
  "v5",
  1440001,
  [
    {
      "content": [
        {
          "content": {
            "content": "now mostly used in idioms",
            "tag": "li"
          },
          "data": {
            "content": "notes"
          },
          "lang": "ja",
          "style": {
            "listStyleType": "'📝 '"
          },
          "tag": "ul"
        },
        {
          "content": [
            {
              "content": "to count",
              "tag": "li"
            },
            {
              "content": "to estimate",
              "tag": "li"
            }
          ],
          "data": {
            "content": "glossary"
          },
          "lang": "en",
          "style": {
            "listStyleType": "circle"
          },
          "tag": "ul"
        },
        {
          "content": {
            "content": [
              "see: ",
              {
                "content": "さばを読む",
                "href": "?query=さばを読む\u0026wildcards=off",
                "lang": "ja",
                "tag": "a"
              },
              {
                "content": " to manipulate figures to one's advantage; to count wrongly on purpose; to inflate or deflate one's age",
                "data": {
                  "content": "refGlosses"
                },
                "style": {
                  "fontSize": "x-small",
                  "verticalAlign": "middle"
                },
                "tag": "span"
              }
            ],
            "tag": "li"
          },
          "data": {
            "content": "references"
          },
          "lang": "en",
          "style": {
            "listStyleType": "'➡ '"
          },
          "tag": "ul"
        }
      ],
      "type": "structured-content"
    }
  ],
  1456360,
  "P ichi news6k"
]
Minimized
[
  "読む",
  "よむ",
  "5 v5m vt",
  "v5",
  1440001,
  [
    {
      "content": "now mostly used in idioms",
      "type": "note"
    },
    "to count",
    "to estimate",
    {
      "content": "さばを読む",
      "brief": "to manipulate figures to one's advantage; to count wrongly on purpose; to inflate or deflate one's age", // optional
      "href": "?query=さばを読む\u0026wildcards=off", // optional
      "type": "references"
    }
  ],
  1456360,
  "P ichi news6k"
]

For comparison, I also ran the validation on the dictionary you provided and it took roughly 18 minutes:

node dev\dictionary-validate.js jmdict_english_info_glosses_2022_05_13.zip
Validating jmdict_english_info_glosses_2022_05_13.zip...
No issues detected (1102.21s)

@stephenmk
Copy link
Contributor Author

Thanks for the detailed feedback. The files have now been updated.

You'll want to hardcode var() values (just use its computed value) and to flag the :root... properties using /* remove-rule */, since they aren't necessary for Anki.

I added the --padding-left rule to the structured content lists so that they would align with the regular glosses within the Yomichan popup, but it doesn't seem that the regular glosses have this padding when exported to Anki. So I think the correct move is to just drop the rule (rather than hardcode a value). If that doesn't work for some reason I didn't think of, it's no problem to update it again. I made a card to test and it seemed to come out alright.

読む glossary in Anki

yomu_anki

.gloss-sc-ul {
/* remove-property padding-left */
}
:root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group these together for simplicity:

:root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary],
:root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary] .gloss-sc-li,
:root[data-glossary-layout-mode=compact] .gloss-sc-ul[data-sc-content=glossary] .gloss-sc-li:not(:first-child)::before {
    /* remove-rule */
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, that's actually what I tried first. But the tests do not like that. I think that only works if they are defined that way (as a group) in the ext/css/structured-content.css file. But the rules are all different there, so they can't be combined.

Running test-css-json.js...
Error: Could not find rule with matching selectors
    at generateRules (/home/stephen/Code/yomichan/dev/css-to-json-util.js:139:48)
    at main (/home/stephen/Code/yomichan/test/test-css-json.js:28:42)
    at testMain (/home/stephen/Code/yomichan/dev/util.js:127:15)
    at Object.<anonymous> (/home/stephen/Code/yomichan/test/test-css-json.js:35:5)
    at Module._compile (node:internal/modules/cjs/loader:1105:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1159:10)
    at Module.load (node:internal/modules/cjs/loader:981:32)
    at Module._load (node:internal/modules/cjs/loader:827:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12)
    at node:internal/main/run_main_module:17:47

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought I handled that differently, but I guess not; my bad! (I should really be more familiar with my own code)

@toasted-nutbread toasted-nutbread merged commit 6a74746 into FooSoft:master May 14, 2022
@toasted-nutbread
Copy link
Collaborator

Out of scope for this PR, but this may be something that's worth updating at some point also, since this seems like an oversight on my part. (not saying this is something you should do)

#2131

@stephenmk
Copy link
Contributor Author

It's also potentially a motivation for having a different way to represent the type of data mentioned in #1165. While structured content can accomplish it, it is by no means the optimal way of doing it.

I think this is an interesting idea. It seems simple enough that even I could probably implement it, and it would probably cut back on a lot of validation time. I think a major drawback to the idea (as I understand it) is that it would very tightly couple the dictionary's display logic with yomichan, which would require [1] yomichan to be updated whenever major changes occur in the dictionary format and [2] yomichan to maintain compatibility with all prior dictionary formats (to support users who do not upgrade to the newest version of the dictionary).

If the new data type representations are sufficiently modular, perhaps this would not be an issue. However, I'm not sure we're going to be able to strike a good balance between modularity and conciseness with this JMdict data that would be worth the effort. Here's an example:

Minimized JMdict term bank entry with hypothetical data types
[
  "読む",
  "よむ",
  "5 v5m vt",
  "v5",
  1440001,
  {
    "type": "jmdictglossary",
    "sources": [
      {
        "lang": "zh",
        "language": "Chinese",
        "wasei": false,
        "content": "讀・dú",
        "type": "partial"
      }
    ],
    "notes": [
      "now mostly used in idioms"
    ],
    "glosses": [
      "to count",
      "to estimate"
    ],
    "infoGlosses": [
      {
        "type": "expl",
        "content": "The Yomi in Yomichan comes from 読む"
      },
      {
        "type": "tm",
        "content": "FooSoft® 2022"
      }
    ],
    "references": [
      {
        "kanji": "さばを読む",
        "brief": " to manipulate figures to one's advantage; to count wrongly on purpose; to inflate or deflate one's age",
        "type": "reference"
      },
      {
        "kanji": "",
        "reading": "あくび",
        "brief": " 欠 can be read as けつ, あくび, or かけ, so a reading has to be specified",
        "type": "antonym"
      }
    ]
  },
  1456360,
  "P ichi news6k"
]

Note that the entire glossary is contained within a single jmdictglossary typed gloss object. If the different sections are broken up into separate glosses (as in your Minimized example), then each section will be displayed in a separate list item. This wouldn't look very good in the Yomichan popup without some other major style modifications, and compact glossaries mode would group them together in a big mess.

This hypothetical jmdictglossary type isn't modular, is unlikely to be useful to any other dictionary authors, and is likely to require changes in the future. So I think that updating Yomichan to be able to use such a data type would probably end up causing a lot of hassle.

I don't know anything about JSON schema validation, and I haven't looked into how Yomichan does it or what engine it uses. My naive hope was that maybe Yomichan could be updated to use a faster validator (ajv claims to be the fastest) and perhaps that would be enough to solve the problem.

@toasted-nutbread
Copy link
Collaborator

Yeah, I agree overall that it's difficult to balance modularity, efficiency, and compatibility, hence why I haven't yet took the initiative to implement something like what I mentioned in #2129 (comment).

I don't know anything about JSON schema validation, and I haven't looked into how Yomichan does it or what engine it uses.

A custom one I wrote, which supports a limited subset. I can look into doing a comparison vs ajv. The other downside of complex structured content vs native definitions is in the database storage overhead, since all of the formatting needs to be stored.

This was referenced May 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants