Forms, i.e. written denotations of the linguistic sign (cf. GOLD's FormUnit), are stored in a
FormTable
in CLDF datasets (typically Wordlist
s).
Each form is stored as a separate row in this table. Some analyses, e.g. alignments, require segmented lexical forms. If these can be supplied, they should be added in a Segments column, by default as space-separated strings.
The CLDF Ontology provides some more properties which may be supplied in corresponding columns of the FormTable
:
A value
column may be used to supply the raw value as it can be found in the source - if this is different
from Form
. This is particularly useful for "retro-digitized" datasets, where
the CLDF dataset is already the result of data clean-up.
As with any CLDF component,
- comments and references to sources can be added via
comment
andsource
columns respectively, - additional data can be supplied in additional columns.
Many examples for FormTable
can be found in the datasets in the lexibank community.
The one for the Intercontinental Dictionary Series is described here:
https://github.com/intercontinental-dictionary-series/ids/blob/v4.3/cldf/cldf-metadata.json#L59-L171
Datasets created using the lexibank workflow (implemented in the pylexibank
package)
derive the segmentation of a form using orthography profiles (see Moran and Cysouw 2018)
and the name of the profile used for a particular form is kept in the custom (non-CLDF) profile
column.
FormTable: forms.csv
Name/Property | Datatype | Cardinality | Description |
---|---|---|---|
ID | string |
singlevalued | A unique identifier for a row in a table. To allow usage of identifiers as path components of URLs IDs must only contain alphanumeric characters, underscore and hyphen. |
Language_ID | string |
singlevalued | A reference to a language (or variety) the form belongs to References LanguageTable |
Parameter_ID | string |
unspecified | A reference to the meaning denoted by the form References ParameterTable |
Form | string |
singlevalued | The written expression of the form. If possible the transcription system used for the written form should be described in CLDF metadata (e.g. via adding a common property dc:conformsTo to the column description using concept URLs of the GOLD Ontology (such as phonemicRep or phoneticRep) as values). |
Segments | list of string (separated by ) |
multivalued | A list of segments (aka a sound sequence) is understood as the strict segmental representation of a form unit of a language, which is usually given in phonetic transcription. Suprasegmental elements, like tone or accent, of sound sequences are usually represented in a sequential form, although they are usually co-articulated along with the segmental elements of a sound sequence. Alternatively, suprasegmental aspects could also be represented as part of the prosodic structure of a word form. |
Comment | string |
unspecified | A human-readable comment on a resource, providing additional context. |
Source | list of string (separated by ; ) |
multivalued | List of source specifications, of the form <source_ID>[], e.g. http://glottolog.org/resource/reference/id/318814[34], or meier2015[3-12] where meier2015 is a citation key in the accompanying BibTeX file. |