Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thai pronunciation data missing #231

Open
guttenberger opened this issue Apr 13, 2023 · 4 comments
Open

thai pronunciation data missing #231

guttenberger opened this issue Apr 13, 2023 · 4 comments

Comments

@guttenberger
Copy link

Hi first of all thank you for this project. I downloaded the thai dictionary from https://kaikki.org/dictionary/Thai/index.html and have notice that the romanisatzion is missing in the json entries.

For example the romanisatzion for https://en.wiktionary.org/wiki/เขียว
image

is missing inside the json file:

{ 
...
    "word": "เขียว",
    "lang": "Thai",
    "lang_code": "th",
    "sounds": [
        {
            "ipa": "/kʰia̯w˩˩˦/",
            "tags": [
                "standard"
            ]
        }
    ],
...
}

eventhough the romanization is there for synonyms

"synonyms": [
        {
            "roman": "kǐao",
            "word": "ขยว",
            "_dis1": "0 0 0 0 0"
        }
    ],

I find the Paiboon romanization particularly useful since it clearly indicates the tone that needs to be used , very important because thai is a tonal language and the correct use of tones is essential to convey meaning accurately.

@jmviz
Copy link
Contributor

jmviz commented Apr 14, 2023

The romanization appears to be intact in forms:

"forms": [
    {
      "form": "kǐao",
      "tags": [
        "romanization"
      ]
    }
  ],

@kristian-clausal
Copy link
Collaborator

Romanizations belong in forms, and I don't think we'll start adding them to pronunciation data just for Thai. Ideally, romanization data is collected and added to forms, just like we separate out hyphenation out of pronunciation sections. The alternative Royal Institute romanization is missing, so we don't collect romanization data from Pronunciation sections, which might be something to look at later.

The IPA has tones at the end: ˩˩˦.

@guttenberger
Copy link
Author

guttenberger commented Apr 14, 2023

ok thank you :)
i also noticed that the audio is missing in the jsons for intance
https://en.wiktionary.org/wiki/ฝรั่ง
image
has audio but a reference can not be found in
https://kaikki.org/dictionary/All%20languages%20combined/meaning/ฝ/ฝร/ฝรั่ง.html

@kristian-clausal
Copy link
Collaborator

Thai uses it's own special formatting with a table when every other language doesn't. It will have to be on the backburner. And it's not even a well-formatted table; the cells are visually combined (at least they're still separate cells) and the second column's header is part of the normal cell. And the first column doesn't have any header information at all. I am tempted just to sneak in and just rewrite the whole thing so that it's more like other pronunciation sections. And it's all generated in a Lua module, that doesn't make any of this easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants