You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This posts gives an overview of the changes made from DoReCo 1.0 to DoReCo 1.1, released 23 August 2022.
Metadata
Updated word counts on the website and in metadata.csv files. Counts are calculated as follows:
for core files: count all entries on wd tiers for core speakers, minus ''<p:>'' (silent pauses), tokens starting in ''<<'' (labels), and ''****'' (fillers) (How are word tokens counted? #1)
for extended files: count all entries on wd tiers for all speakers
Several corrections to various entries throughout the entire DoReCo corpus
CSV files
Reworked the lang column, which used to sometimes contain ambiguous language names, or entries which didn't belong there. Now the lang column always displays the glottocode
Fixed a bug that caused extra whitespace to appear in some cells
Added a core_extended column that indicates to which set (''core'' or ''extended'') a text belongs to
Added a ph column to the _wd CSV
Added a ph_ID column, in accordance with already existing mb_ID and wd_ID
Changed the formatting of entries in the ID columns to have a leading ''p'', ''m'' or ''w'' according to the tier they refer to, instead of a generic ''a''
Fixed a bug whereby IDs were sometimes not unique within a dataset
IDs are now consistent across the _ph and _wd CSVs
Changes to individual datasets
Fixed a character encoding bug that caused some long vowels to be mapped to sequences of short vowels in the Kakabe dataset
Fixed a bug affecting the ph tier in files from the Pnar dataset
Added tokenization and morph-level alignment for one Teop file from the core dataset
Fixed a tier metadata mismatch in the Light Warlpiri and Sadu datasets that caused non-tokenized entries from extra tiers to be displayed in the gl column of CSV export files
Fixed a tier metadata mismatch in the Komnzo dataset that caused the contents of the gl tier to not be displayed in CSV export files
Fixed a tier metadata mismatch in the Dalabon dataset that caused wrong tiers to be displayed in CSV export files
Fixed a tier name issue in one file from the Movima dataset
Fixed a rare bug in one file from the Yucatec Maya dataset that resulted from unexpected content in the transcription_conventions mapping file
Global improvements
In DoReCo 1.0, entries on the ph@ tiers sometimes included <notProcessedChunk> and <usb> entries, which could result from faulty input to the MAUS forced aligner. These items occured in almost half of our 51 datasets. For DoReCo 1.1, input files were amended where possible, and most remaining <notProcessedChunk> and <usb> entries were replaced by <<ui>>, the generic placeholder label. We are aware of a few cases of <notProcessedChunk> that still occur in the data, which we will address in a later update.
Removed zero-length intervals.
Contents of the corpus
Six files from the extended set were removed from the corpus due to alignment issues:
doreco_port1286_08-02-13_Neli1
doreco_port1286_08-02-13_Neli2
doreco_port1286_08-21-10_JemisGarden
doreco_port1286_09-03-11_Abel_Germany
doreco_bora1263_vuurihii
doreco_trin1278_37
One file from the core set was removed from the corpus due to alignment issues:
doreco_sout2856_073_cut
Added gloss_abbreviation documentation for Dolgan, Nisvai, and Tabasaran
Website
Added an option to download corpus-level metadata
Added an option to download dataset citations in bulk
DoReCo version 1.1: Summary of changes
This posts gives an overview of the changes made from DoReCo 1.0 to DoReCo 1.1, released 23 August 2022.
Metadata
CSV files
Changes to individual datasets
Global improvements
Contents of the corpus
Website