Skip to content

Latest commit

 

History

History
27 lines (22 loc) · 1.6 KB

FORMS.md

File metadata and controls

27 lines (22 loc) · 1.6 KB

Specification of form manipulation

Specification of the value-to-form processing in Lexibank datasets:

The value-to-form processing is divided into two steps, implemented as methods:

  • FormSpec.split: Splits a string into individual form chunks.
  • FormSpec.clean: Normalizes a form chunk.

These methods use the attributes of a FormSpec instance to configure their behaviour.

  • brackets: {'(': ')'} Pairs of strings that should be recognized as brackets, specified as dict mapping opening string to closing string
  • separators: ;, Iterable of single character tokens that should be recognized as word separator
  • missing_data: ('?', '-') Iterable of strings that are used to mark missing data
  • strip_inside_brackets: False Flag signaling whether to strip content in brackets (and strip leading and trailing whitespace)
  • replacements: [('[sub]1[/sub]', '₁'), ('[sub]2[/sub]', '₂'), ('[sub]3[/sub]', '₃'), ('[sup]h[/sup]', 'ʰ'), ('[sup]w[/sup]', 'ʷ'), ('[sup]y[/sup]', 'ʸ'), ('[sup][/sup]', '')] List of pairs (source, target) used to replace occurrences of source in formswith target (before stripping content in brackets)
  • first_form_only: True Flag signaling whether at most one form should be returned from split - effectively ignoring any spelling variants, etc.
  • normalize_whitespace: True Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
  • normalize_unicode: None UNICODE normalization form to use for input of split (None, 'NFD' or 'NFC')