Latin dialects #146

kylebgorman · 2020-04-18T12:50:11Z

This is not working: the dialect files are identical; but am just sharing in case I have missed something.

Note that this is not working as expected: the Classical and Ecclesiastical files are byte-for-byte identical and the former contains clear Ecclesiastical pronunciations (e.g., with affricates). Closes #143, or it will when/if it works.

This not working as expected, once again. The Classical and Ecclesiastical files are bit-for-bit identical and the former contains affricates (which are Ecclesiastical only). But when it does work it'll close #143.

jacksonllee

I'm probably missing something --

"This is not working" (in the pull request description) -- what is not working?
How does the whitelist work?

lfashby · 2020-04-18T14:44:42Z

Do you mean that the you scraped Latin after adding these dialects to languages.json and the contents of the TSVs for the different dialects are all the same?

kylebgorman · 2020-04-18T14:58:29Z

Sorry, this is a terribly non-descript PR!

Yes, when I run with a (truncated for speed) languages.json the Ecclesiastical and Classical files are unexpectedly identical and both contain the prons labeled "Ecclesiastical" and "Classical", respectively. Sample word with both of those dialects here.

kylebgorman · 2020-04-18T14:59:44Z

Also a bunch of unrelated changes have ended up in this PR...I have too many PRs at once. But the relevant one is to languages.json.

The whitelist file is for a separate project we're doing of...well developing whitelists for languages/dialects where there are a lot of non-native pronunciations. We're really just focusing on English for now but I did a Latin one as an example. Once we have a few of them we'll add them to the post-processing procedure and generate "filtered" files as part of the big scrape.

lfashby · 2020-04-18T15:29:21Z

I think when we added the Latin extraction function (and our other extraction functions that build their own pron selectors and don't rely on _PRON_XPATH_SELECTOR_TEMPLATE from config.py) we did not consider adding dialect support.

kylebgorman · 2020-04-18T16:06:30Z

Okay, so if there's a language customization in extract, what exactly happens to the dialect specifications? Are they completely ignored, or have to be added to the language customization, or something else?

lfashby · 2020-04-18T16:15:31Z

As far as I can tell, languages that do not interact with extract_word_pron_default (and in particular _yield_phn and its use of config.pron_xpath_selector ) in extract/default.py do not have any dialect support - meaning all languages for which we have extraction functions just ignore dialects (except Japanese, which does use config.pron_xpath_selector).

I'm not sure what the optimal solution to this might be - but whatever it is will also help me refine the Vietnamese extraction function I put together, which handled dialects for Vietnamese by basically rebuilding the pron and dialect selectors in the extraction function.

kylebgorman · 2020-04-18T16:44:04Z

At the very least we should document (or log loudly?) this. And while it's not super important not sure how useful the Latin data is without dialect specifications. I feel like I'm out of my depth with figuring out how that works given how elaborate the Latin extraction function is.

…

On Sat, Apr 18, 2020 at 12:15 PM Lucas Ashby ***@***.***> wrote: As far as I can tell, languages that do not interact with extract_word_pron_default (and in particular _yield_phn and its use of config.pron_xpath_selector ) in extract/default.py do not have any dialect support - meaning all languages for which we have extraction functions just ignore dialects. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/pull/146#issuecomment-615897078>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OMPBNELF7GT66HLEMLRNHG27ANCNFSM4MLK34AA> .

lfashby · 2020-04-18T17:28:36Z

I created an issue to track this.

jacksonllee · 2020-04-18T17:38:54Z

Lucas is correct that languages which require a non-default extraction treatment ignore the dialect labels unless they are specifically used in the respective extraction function. We should figure out how the extraction module for Latin can make use of config.pron_xpath_selector somehow (as the Japanese one does, as Lucas has pointed out). If I'm not mistaken, one possible approach would be to combine the Latin-specific pron_xpath_selector here with config.pron_xpath_selector that has the dialect info.

kylebgorman · 2020-10-13T20:03:25Z

This is quite out of date but ongoing work is happening on #143.

kylebgorman added 5 commits April 16, 2020 16:34

Add typing to big scrape.

ce6cba9

Update changelog

b85b92b

[lat] Adds whitelist.

0528c17

[lat] Add Latin dialects

5229cf7

Note that this is not working as expected: the Classical and Ecclesiastical files are byte-for-byte identical and the former contains clear Ecclesiastical pronunciations (e.g., with affricates). Closes #143, or it will when/if it works.

[lat] Add dialects.

891b328

This not working as expected, once again. The Classical and Ecclesiastical files are bit-for-bit identical and the former contains affricates (which are Ecclesiastical only). But when it does work it'll close #143.

kylebgorman requested review from jacksonllee and lfashby April 18, 2020 12:50

jacksonllee reviewed Apr 18, 2020

View reviewed changes

lfashby mentioned this pull request Apr 18, 2020

Dialect support in extraction functions #149

Closed

kylebgorman added the language support Language-specific issues label Oct 12, 2020

kylebgorman closed this Oct 13, 2020

jacksonllee deleted the latin_dialects branch October 16, 2020 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin dialects #146

Latin dialects #146

kylebgorman commented Apr 18, 2020

jacksonllee left a comment

lfashby commented Apr 18, 2020

kylebgorman commented Apr 18, 2020

kylebgorman commented Apr 18, 2020 •

edited

Loading

lfashby commented Apr 18, 2020

kylebgorman commented Apr 18, 2020

lfashby commented Apr 18, 2020 •

edited

Loading

kylebgorman commented Apr 18, 2020 via email

lfashby commented Apr 18, 2020

jacksonllee commented Apr 18, 2020 •

edited

Loading

kylebgorman commented Oct 13, 2020

Latin dialects #146

Latin dialects #146

Conversation

kylebgorman commented Apr 18, 2020

jacksonllee left a comment

Choose a reason for hiding this comment

lfashby commented Apr 18, 2020

kylebgorman commented Apr 18, 2020

kylebgorman commented Apr 18, 2020 • edited Loading

lfashby commented Apr 18, 2020

kylebgorman commented Apr 18, 2020

lfashby commented Apr 18, 2020 • edited Loading

kylebgorman commented Apr 18, 2020 via email

lfashby commented Apr 18, 2020

jacksonllee commented Apr 18, 2020 • edited Loading

kylebgorman commented Oct 13, 2020

kylebgorman commented Apr 18, 2020 •

edited

Loading

lfashby commented Apr 18, 2020 •

edited

Loading

jacksonllee commented Apr 18, 2020 •

edited

Loading