Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin dialects #146

Closed
wants to merge 5 commits into from
Closed

Latin dialects #146

wants to merge 5 commits into from

Conversation

kylebgorman
Copy link
Collaborator

This is not working: the dialect files are identical; but am just sharing in case I have missed something.

Note that this is not working as expected: the Classical and
Ecclesiastical files are byte-for-byte identical and the former contains
clear Ecclesiastical pronunciations (e.g., with affricates).

Closes #143, or it will when/if it works.
This not working as expected, once again. The Classical and
Ecclesiastical files are bit-for-bit identical and the former contains
affricates (which are Ecclesiastical only).

But when it does work it'll close #143.
Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something --

  • "This is not working" (in the pull request description) -- what is not working?
  • How does the whitelist work?

@lfashby
Copy link
Collaborator

lfashby commented Apr 18, 2020

Do you mean that the you scraped Latin after adding these dialects to languages.json and the contents of the TSVs for the different dialects are all the same?

@kylebgorman
Copy link
Collaborator Author

Sorry, this is a terribly non-descript PR!

Yes, when I run with a (truncated for speed) languages.json the Ecclesiastical and Classical files are unexpectedly identical and both contain the prons labeled "Ecclesiastical" and "Classical", respectively. Sample word with both of those dialects here.

@kylebgorman
Copy link
Collaborator Author

kylebgorman commented Apr 18, 2020

Also a bunch of unrelated changes have ended up in this PR...I have too many PRs at once. But the relevant one is to languages.json.

The whitelist file is for a separate project we're doing of...well developing whitelists for languages/dialects where there are a lot of non-native pronunciations. We're really just focusing on English for now but I did a Latin one as an example. Once we have a few of them we'll add them to the post-processing procedure and generate "filtered" files as part of the big scrape.

@lfashby
Copy link
Collaborator

lfashby commented Apr 18, 2020

I think when we added the Latin extraction function (and our other extraction functions that build their own pron selectors and don't rely on _PRON_XPATH_SELECTOR_TEMPLATE from config.py) we did not consider adding dialect support.

@kylebgorman
Copy link
Collaborator Author

Okay, so if there's a language customization in extract, what exactly happens to the dialect specifications? Are they completely ignored, or have to be added to the language customization, or something else?

@lfashby
Copy link
Collaborator

lfashby commented Apr 18, 2020

As far as I can tell, languages that do not interact with extract_word_pron_default (and in particular _yield_phn and its use of config.pron_xpath_selector ) in extract/default.py do not have any dialect support - meaning all languages for which we have extraction functions just ignore dialects (except Japanese, which does use config.pron_xpath_selector).

I'm not sure what the optimal solution to this might be - but whatever it is will also help me refine the Vietnamese extraction function I put together, which handled dialects for Vietnamese by basically rebuilding the pron and dialect selectors in the extraction function.

@kylebgorman
Copy link
Collaborator Author

kylebgorman commented Apr 18, 2020 via email

@lfashby
Copy link
Collaborator

lfashby commented Apr 18, 2020

I created an issue to track this.

@jacksonllee
Copy link
Collaborator

jacksonllee commented Apr 18, 2020

Lucas is correct that languages which require a non-default extraction treatment ignore the dialect labels unless they are specifically used in the respective extraction function. We should figure out how the extraction module for Latin can make use of config.pron_xpath_selector somehow (as the Japanese one does, as Lucas has pointed out). If I'm not mistaken, one possible approach would be to combine the Latin-specific pron_xpath_selector here with config.pron_xpath_selector that has the dialect info.

@kylebgorman kylebgorman added the language support Language-specific issues label Oct 12, 2020
@kylebgorman
Copy link
Collaborator Author

This is quite out of date but ongoing work is happening on #143.

@jacksonllee jacksonllee deleted the latin_dialects branch October 16, 2020 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language support Language-specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants