-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a928860
commit ea8aea1
Showing
33 changed files
with
1,016 additions
and
455 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
.. _acoustics: | ||
|
||
***************** | ||
Acoustic measures | ||
***************** | ||
|
||
TODO blurb | ||
|
||
Contents: | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
acoustics_encoding.rst | ||
acoustics_querying.rst | ||
acoustics_backend.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
.. _acoustics_backend: | ||
|
||
**************** | ||
Acoustic backend | ||
**************** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
.. _acoustics_encoding: | ||
|
||
************************** | ||
Encoding acoustic measures | ||
************************** | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
.. _acoustics_querying: | ||
|
||
************************** | ||
Querying acoustic measures | ||
************************** | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
.. _enrichment: | ||
|
||
********** | ||
Enrichment | ||
********** | ||
|
||
Following import, the corpus is often fairly bare, with just word and phone annotations. An important step in analyzing | ||
corpora is therefore enriching it with other information. Most of the methods here are automatic once a function is called. | ||
|
||
|
||
Contents: | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
enrichment_syllables.rst | ||
enrichment_utterances.rst | ||
enrichment_csvs.rst | ||
enrichment_queries.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
.. _enrichment_csvs: | ||
|
||
************************ | ||
Enrichment via CSV files | ||
************************ | ||
|
||
PolyglotDB supports ways of adding arbitrary information to annotations or metadata about speakers and files by specifying | ||
a local CSV file to add information from. When constructing this CSV file, the first column should be the label used to | ||
identify which element should be enriched, and all subsequent columns are used as properties to add to the corpus. | ||
|
||
:: | ||
|
||
ID_column,property_one,property_two | ||
first_item,first_item_value_one,first_item_value_two | ||
second_item,,second_item_value_two | ||
|
||
Enriching using this file would look up elements based on the `ID_column`, and the one matching `first_item` would get | ||
both `property_one` and `property_two` (with the respective values). The one matching `second_item` would only get a | ||
`property_two` (because the value for `property_one` is empty. | ||
|
||
.. _enrich_lexicon: | ||
|
||
Enriching the lexicon | ||
===================== | ||
|
||
.. code-block:: python | ||
lexicon_csv_path = '/full/path/to/lexicon/data.csv' | ||
with CorpusContext(config) as c: | ||
c.enrich_lexicon_from_csv(lexicon_csv_path) | ||
.. note:: | ||
|
||
The function `enrich_lexicon_from_csv` accepts an optional keyword `case_sensitive` and defaults to `False`. Changing this | ||
will respect capitalization when looking up words. | ||
|
||
|
||
.. _enrich_inventory: | ||
|
||
Enriching the phonological inventory | ||
==================================== | ||
|
||
The phone inventory can be enriched with arbitrary properties via: | ||
|
||
.. code-block:: python | ||
inventory_csv_path = '/full/path/to/inventory/data.csv' | ||
with CorpusContext(config) as c: | ||
c.enrich_inventory_from_csv(inventory_csv_path) | ||
.. _enrich_speakers: | ||
|
||
Enriching speaker information | ||
============================= | ||
|
||
Speaker information can be added via: | ||
|
||
.. code-block:: python | ||
speaker_csv_path = '/full/path/to/speaker/data.csv' | ||
with CorpusContext(config) as c: | ||
c.enrich_speakers_from_csv(speaker_csv_path) | ||
.. _enrich_discourses: | ||
|
||
Enriching discourse information | ||
=============================== | ||
|
||
Metadata about the discourses or sound files can be added via: | ||
|
||
.. code-block:: python | ||
discourse_csv_path = '/full/path/to/discourse/data.csv' | ||
with CorpusContext(config) as c: | ||
c.enrich_discourses_from_csv(discourse_csv_path) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
.. _enrichment_queries: | ||
|
||
********************** | ||
Enrichment via queries | ||
********************** | ||
|
||
Queries have the functionality to set properties and create subsets of elements based on results. | ||
|
||
For instance, if you wanted to make word initial phones more easily queryable, you could perform the following: | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
q = c.query_graph(c.phone) | ||
q = q.filter(c.phone.begin == c.phone.word.begin) | ||
q.create_subset('word-initial') | ||
Once that code completes, a subsequent query could be made of: | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
q = c.query_graph(c.phone) | ||
q = q.filter(c.phone.subset == 'word-initial) | ||
print(q.all())) | ||
Or instead of a subset, a property could be encoded as: | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
q = c.query_graph(c.phone) | ||
q = q.filter(c.phone.begin == c.phone.word.begin) | ||
q.set_properties(position='word-initial') | ||
And then this property can be exported as a column in a csv: | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
q = c.query_graph(c.phone) | ||
q.columns(c.position) | ||
q.to_csv(some_csv_path) | ||
Lexicon queries can also be used in the same way to create subsets and encode properties that do not vary on a token by token basis. | ||
|
||
For instance, a subset for high vowels can be created as follows: | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
high_vowels = ['iy', 'ih','uw','uh'] | ||
q = c.query_lexicon(c.lexicon_phone) | ||
q = q.filter(c.lexicon_phone.label.in_(high_vowels)) | ||
q.create_subset('high_vowel') | ||
Which can then be used to query phone annotations: | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
q = c.query_graph(c.phone) | ||
q = q.filter(c.phone.subset == 'high_vowel') | ||
print(q.all()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
.. _enrichment_syllables: | ||
|
||
*********************** | ||
Creating syllable units | ||
*********************** | ||
|
||
Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries | ||
of words aligning with syllable boundaries (i.e., syllables cannot stretch across words). | ||
|
||
At the moment, only one algorithm is supported (`maximal onset`) because its simplicity lends it to be language agnostic. | ||
|
||
To encode syllables, there are two steps: | ||
|
||
1. :ref:`encoding_syllabics` | ||
2. :ref:`encoding_syllables` | ||
|
||
|
||
.. _encoding_syllabics: | ||
|
||
Encoding syllabic segments | ||
========================== | ||
|
||
Syllabic segments are called via a specialized function: | ||
|
||
|
||
|
||
.. code-block:: python | ||
syllabic_segments = ['aa', 'ae','ih'] | ||
with CorpusContext(config) as c: | ||
c.encode_syllabic_segments(syllabic_segments) | ||
Following this code, all phones with labels of `aa, ae, ih` will belong to the subset `syllabic`. This subset can be | ||
then queried in the future, in addition to allowing syllables to be encoded. | ||
|
||
.. _encoding_syllables: | ||
|
||
Encoding syllables | ||
================== | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
c.encode_syllables() | ||
.. note:: | ||
|
||
The function `encode_syllables` can be given a keyword argument for `call_back`, which is a function like `print` that | ||
allows for progress to be output to the console. | ||
|
||
Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of | ||
all the instances of syllables at the beginnings of words: | ||
|
||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin) | ||
print(q.all()) | ||
.. _stress_tone: | ||
|
||
Encoding syllable properties from syllabics | ||
=========================================== | ||
|
||
Often in corpora there is information about syllables contained on the vowels. For instance, if the transcription contains | ||
stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet). Tone is likewise similarly encoded | ||
in some transcription systems. This section details functions that strip this information from the vowel and place it on | ||
the syllable unit instead. | ||
|
||
.. note:: | ||
|
||
Removing the stress/tone information from the vowel makes queries easier, as getting all `AA` tokens no longer requires | ||
specifying that the label is in the set of `AA1, AA2, AA0`. This functionality can be disabled by specifying `clean_phone_label=False` | ||
in the two functions that follow. | ||
|
||
.. _stress_enrichment: | ||
|
||
Encoding stress | ||
--------------- | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
c.encode_stress_to_syllables() | ||
.. note:: | ||
|
||
By default, stress is taken to be numbers in the vowel label (i.e., `AA1` would have a stress of `1`). A different | ||
pattern to use for stress information can be specified through the optional `regex` keyword argument. | ||
|
||
|
||
.. _tone_enrichment: | ||
|
||
Encoding tone | ||
------------- | ||
|
||
.. code-block:: python | ||
with CorpusContext(config) as c: | ||
c.encode_tone_to_syllables() | ||
.. note:: | ||
|
||
As for stress, a different regex can be specified with the `regex` keyword argument. |
Oops, something went wrong.