-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DoReCo 1.1 #5
Comments
@LuPaschen you mention "core speakers" above, when explaining the word token count. I guess, "core speakers" are the ones listed in the metadata for a corpus under For some corpora this is different, though:
A couple more speaker references may have typos, too. So, if my assumption about core speakers is correct, should I try to fix the speaker references when creating CLDF data or will DoReCo fix these in a new release? |
@xrotwang There may be some confusion about the use of the term "core", so let me clarify. There are "core speakers" and "core texts". The former refers to speakers for which time alignment at the phone level exists (i.e. the contents of the _ph csv files). The latter refers to all texts that contain at least one such speaker. This means that there can be "core" files that contain multiple speakers, some core and some non-core. The "core_extended" column in the _wd csv's gives information on core texts, not core speakers. Such a column could be easily created, though, by checking whether a certain speaker appears in the _ph csv for the respective file. Also, thx for picking up on the discrepancies in yuca1254 and apah1238. The former will be hot-fixed asap, the latter we'll add to our to-do-list for the next major update. |
Ah, ok. So if I only count words that are referenced in |
Yes, as ph CSV's contain only core speakers by definition (no time alignment on the phone level -> not included in ph CSV -> not a core speaker) |
DoReCo version 1.1: Summary of changes
This posts gives an overview of the changes made from DoReCo 1.0 to DoReCo 1.1, released 23 August 2022.
Metadata
CSV files
Changes to individual datasets
Global improvements
Contents of the corpus
Website
The text was updated successfully, but these errors were encountered: