Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DoReCo 1.1 #5

Open
LuPaschen opened this issue Aug 24, 2022 · 4 comments
Open

DoReCo 1.1 #5

LuPaschen opened this issue Aug 24, 2022 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@LuPaschen
Copy link

LuPaschen commented Aug 24, 2022

DoReCo version 1.1: Summary of changes

This posts gives an overview of the changes made from DoReCo 1.0 to DoReCo 1.1, released 23 August 2022.

Metadata

  • Updated word counts on the website and in metadata.csv files. Counts are calculated as follows:
    1. for core files: count all entries on wd tiers for core speakers, minus ''<p:>'' (silent pauses), tokens starting in ''<<'' (labels), and ''****'' (fillers) (How are word tokens counted? #1)
    2. for extended files: count all entries on wd tiers for all speakers
  • Several corrections to various entries throughout the entire DoReCo corpus

CSV files

  • Reworked the lang column, which used to sometimes contain ambiguous language names, or entries which didn't belong there. Now the lang column always displays the glottocode
  • Fixed a bug that caused extra whitespace to appear in some cells
  • Added a core_extended column that indicates to which set (''core'' or ''extended'') a text belongs to
  • Added a ph column to the _wd CSV
  • Added a ph_ID column, in accordance with already existing mb_ID and wd_ID
  • Changed the formatting of entries in the ID columns to have a leading ''p'', ''m'' or ''w'' according to the tier they refer to, instead of a generic ''a''
  • Fixed a bug whereby IDs were sometimes not unique within a dataset
  • IDs are now consistent across the _ph and _wd CSVs

Changes to individual datasets

  • Fixed a character encoding bug that caused some long vowels to be mapped to sequences of short vowels in the Kakabe dataset
  • Fixed a bug affecting the ph tier in files from the Pnar dataset
  • Fixed a bug creating an unbounded ft tier in one file from the Cashinahua dataset (Unbounded ft tier in Cashinahua, MB_Autobiography #2)
  • Added tokenization and morph-level alignment for one Teop file from the core dataset
  • Fixed a tier metadata mismatch in the Light Warlpiri and Sadu datasets that caused non-tokenized entries from extra tiers to be displayed in the gl column of CSV export files
  • Fixed a tier metadata mismatch in the Komnzo dataset that caused the contents of the gl tier to not be displayed in CSV export files
  • Fixed a tier metadata mismatch in the Dalabon dataset that caused wrong tiers to be displayed in CSV export files
  • Fixed a tier name issue in one file from the Movima dataset
  • Fixed a rare bug in one file from the Yucatec Maya dataset that resulted from unexpected content in the transcription_conventions mapping file

Global improvements

  • In DoReCo 1.0, entries on the ph@ tiers sometimes included <notProcessedChunk> and <usb> entries, which could result from faulty input to the MAUS forced aligner. These items occured in almost half of our 51 datasets. For DoReCo 1.1, input files were amended where possible, and most remaining <notProcessedChunk> and <usb> entries were replaced by <<ui>>, the generic placeholder label. We are aware of a few cases of <notProcessedChunk> that still occur in the data, which we will address in a later update.
  • Removed zero-length intervals.

Contents of the corpus

  • Six files from the extended set were removed from the corpus due to alignment issues:
    • doreco_port1286_08-02-13_Neli1
    • doreco_port1286_08-02-13_Neli2
    • doreco_port1286_08-21-10_JemisGarden
    • doreco_port1286_09-03-11_Abel_Germany
    • doreco_bora1263_vuurihii
    • doreco_trin1278_37
  • One file from the core set was removed from the corpus due to alignment issues:
    • doreco_sout2856_073_cut
  • Added gloss_abbreviation documentation for Dolgan, Nisvai, and Tabasaran

Website

  • Added an option to download corpus-level metadata
  • Added an option to download dataset citations in bulk
  • Added DOIs and URLs to dataset citations (Citation Download is missing doi #3)
@LuPaschen LuPaschen self-assigned this Aug 24, 2022
@LuPaschen LuPaschen pinned this issue Aug 24, 2022
@LuPaschen LuPaschen added the documentation Improvements or additions to documentation label Aug 24, 2022
@xrotwang
Copy link

@LuPaschen you mention "core speakers" above, when explaining the word token count. I guess, "core speakers" are the ones listed in the metadata for a corpus under spk_code - correct? For most corpora this does seem to work - i.e. omitting words by non-core speakers only skips <10% of words in *_wd.csv.

For some corpora this is different, though:

  • yuca1254 lists 2 and 4 as spk_code, but references 02 and 04 from wd.csv.
  • apah1238 has spk_code Lince but 153 words atrributed to fince.

A couple more speaker references may have typos, too.

So, if my assumption about core speakers is correct, should I try to fix the speaker references when creating CLDF data or will DoReCo fix these in a new release?

@LuPaschen
Copy link
Author

@xrotwang There may be some confusion about the use of the term "core", so let me clarify. There are "core speakers" and "core texts". The former refers to speakers for which time alignment at the phone level exists (i.e. the contents of the _ph csv files). The latter refers to all texts that contain at least one such speaker. This means that there can be "core" files that contain multiple speakers, some core and some non-core. The "core_extended" column in the _wd csv's gives information on core texts, not core speakers. Such a column could be easily created, though, by checking whether a certain speaker appears in the _ph csv for the respective file.

Also, thx for picking up on the discrepancies in yuca1254 and apah1238. The former will be hot-fixed asap, the latter we'll add to our to-do-list for the next major update.

@xrotwang
Copy link

Ah, ok. So if I only count words that are referenced in _ph.csv, I'll already limit the count to core speakers, correct?

@LuPaschen
Copy link
Author

Yes, as ph CSV's contain only core speakers by definition (no time alignment on the phone level -> not included in ph CSV -> not a core speaker)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants