Skip to content

DoReCo 1.1 #5

@LuPaschen

Description

@LuPaschen

DoReCo version 1.1: Summary of changes

This posts gives an overview of the changes made from DoReCo 1.0 to DoReCo 1.1, released 23 August 2022.

Metadata

  • Updated word counts on the website and in metadata.csv files. Counts are calculated as follows:
    1. for core files: count all entries on wd tiers for core speakers, minus ''<p:>'' (silent pauses), tokens starting in ''<<'' (labels), and ''****'' (fillers) (How are word tokens counted? #1)
    2. for extended files: count all entries on wd tiers for all speakers
  • Several corrections to various entries throughout the entire DoReCo corpus

CSV files

  • Reworked the lang column, which used to sometimes contain ambiguous language names, or entries which didn't belong there. Now the lang column always displays the glottocode
  • Fixed a bug that caused extra whitespace to appear in some cells
  • Added a core_extended column that indicates to which set (''core'' or ''extended'') a text belongs to
  • Added a ph column to the _wd CSV
  • Added a ph_ID column, in accordance with already existing mb_ID and wd_ID
  • Changed the formatting of entries in the ID columns to have a leading ''p'', ''m'' or ''w'' according to the tier they refer to, instead of a generic ''a''
  • Fixed a bug whereby IDs were sometimes not unique within a dataset
  • IDs are now consistent across the _ph and _wd CSVs

Changes to individual datasets

  • Fixed a character encoding bug that caused some long vowels to be mapped to sequences of short vowels in the Kakabe dataset
  • Fixed a bug affecting the ph tier in files from the Pnar dataset
  • Fixed a bug creating an unbounded ft tier in one file from the Cashinahua dataset (Unbounded ft tier in Cashinahua, MB_Autobiography #2)
  • Added tokenization and morph-level alignment for one Teop file from the core dataset
  • Fixed a tier metadata mismatch in the Light Warlpiri and Sadu datasets that caused non-tokenized entries from extra tiers to be displayed in the gl column of CSV export files
  • Fixed a tier metadata mismatch in the Komnzo dataset that caused the contents of the gl tier to not be displayed in CSV export files
  • Fixed a tier metadata mismatch in the Dalabon dataset that caused wrong tiers to be displayed in CSV export files
  • Fixed a tier name issue in one file from the Movima dataset
  • Fixed a rare bug in one file from the Yucatec Maya dataset that resulted from unexpected content in the transcription_conventions mapping file

Global improvements

  • In DoReCo 1.0, entries on the ph@ tiers sometimes included <notProcessedChunk> and <usb> entries, which could result from faulty input to the MAUS forced aligner. These items occured in almost half of our 51 datasets. For DoReCo 1.1, input files were amended where possible, and most remaining <notProcessedChunk> and <usb> entries were replaced by <<ui>>, the generic placeholder label. We are aware of a few cases of <notProcessedChunk> that still occur in the data, which we will address in a later update.
  • Removed zero-length intervals.

Contents of the corpus

  • Six files from the extended set were removed from the corpus due to alignment issues:
    • doreco_port1286_08-02-13_Neli1
    • doreco_port1286_08-02-13_Neli2
    • doreco_port1286_08-21-10_JemisGarden
    • doreco_port1286_09-03-11_Abel_Germany
    • doreco_bora1263_vuurihii
    • doreco_trin1278_37
  • One file from the core set was removed from the corpus due to alignment issues:
    • doreco_sout2856_073_cut
  • Added gloss_abbreviation documentation for Dolgan, Nisvai, and Tabasaran

Website

  • Added an option to download corpus-level metadata
  • Added an option to download dataset citations in bulk
  • Added DOIs and URLs to dataset citations (Citation Download is missing doi #3)

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions