Skip to content

Commit

Permalink
DOCS: Reogranization WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcauliffe committed Aug 6, 2017
1 parent a928860 commit ea8aea1
Show file tree
Hide file tree
Showing 33 changed files with 1,016 additions and 455 deletions.
2 changes: 1 addition & 1 deletion bin/pgdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def save_config(c):

TEMP_DIR = os.path.join(CONFIG_DIR, 'downloads')

NEO4J_VERSION = '3.0.7'
NEO4J_VERSION = '3.2.3'

INFLUXDB_VERSION = '1.1.0'

Expand Down
16 changes: 16 additions & 0 deletions docs/source/acoustics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _acoustics:

*****************
Acoustic measures
*****************

TODO blurb

Contents:

.. toctree::
:maxdepth: 2

acoustics_encoding.rst
acoustics_querying.rst
acoustics_backend.rst
5 changes: 5 additions & 0 deletions docs/source/acoustics_backend.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.. _acoustics_backend:

****************
Acoustic backend
****************
6 changes: 6 additions & 0 deletions docs/source/acoustics_encoding.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _acoustics_encoding:

**************************
Encoding acoustic measures
**************************

6 changes: 6 additions & 0 deletions docs/source/acoustics_querying.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _acoustics_querying:

**************************
Querying acoustic measures
**************************

10 changes: 5 additions & 5 deletions docs/source/api_graph.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Graph API

Queries
-------
.. currentmodule:: polyglotdb.graph.query
.. currentmodule:: polyglotdb.query.annotations.query

.. autosummary::
:toctree: generated/
Expand All @@ -20,20 +20,20 @@ Queries

Attributes
----------
.. currentmodule:: polyglotdb.graph.attributes
.. currentmodule:: polyglotdb.query.annotations.attributes.base

.. autosummary::
:toctree: generated/
:template: class.rst

Attribute
AnnotationNode
AnnotationAttribute

.. _graph_clauses_api:

Clause elements
---------------
.. currentmodule:: polyglotdb.graph.elements
.. currentmodule:: polyglotdb.query.annotations.elements

.. autosummary::
:toctree: generated/
Expand All @@ -54,7 +54,7 @@ Clause elements

Aggregate functions
-------------------
.. currentmodule:: polyglotdb.graph.func
.. currentmodule:: polyglotdb.query.base.func

.. autosummary::
:toctree: generated/
Expand Down
19 changes: 19 additions & 0 deletions docs/source/enrichment.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
.. _enrichment:

**********
Enrichment
**********

Following import, the corpus is often fairly bare, with just word and phone annotations. An important step in analyzing
corpora is therefore enriching it with other information. Most of the methods here are automatic once a function is called.


Contents:

.. toctree::
:maxdepth: 2

enrichment_syllables.rst
enrichment_utterances.rst
enrichment_csvs.rst
enrichment_queries.rst
76 changes: 76 additions & 0 deletions docs/source/enrichment_csvs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
.. _enrichment_csvs:

************************
Enrichment via CSV files
************************

PolyglotDB supports ways of adding arbitrary information to annotations or metadata about speakers and files by specifying
a local CSV file to add information from. When constructing this CSV file, the first column should be the label used to
identify which element should be enriched, and all subsequent columns are used as properties to add to the corpus.

::

ID_column,property_one,property_two
first_item,first_item_value_one,first_item_value_two
second_item,,second_item_value_two

Enriching using this file would look up elements based on the `ID_column`, and the one matching `first_item` would get
both `property_one` and `property_two` (with the respective values). The one matching `second_item` would only get a
`property_two` (because the value for `property_one` is empty.

.. _enrich_lexicon:

Enriching the lexicon
=====================

.. code-block:: python
lexicon_csv_path = '/full/path/to/lexicon/data.csv'
with CorpusContext(config) as c:
c.enrich_lexicon_from_csv(lexicon_csv_path)
.. note::

The function `enrich_lexicon_from_csv` accepts an optional keyword `case_sensitive` and defaults to `False`. Changing this
will respect capitalization when looking up words.


.. _enrich_inventory:

Enriching the phonological inventory
====================================

The phone inventory can be enriched with arbitrary properties via:

.. code-block:: python
inventory_csv_path = '/full/path/to/inventory/data.csv'
with CorpusContext(config) as c:
c.enrich_inventory_from_csv(inventory_csv_path)
.. _enrich_speakers:

Enriching speaker information
=============================

Speaker information can be added via:

.. code-block:: python
speaker_csv_path = '/full/path/to/speaker/data.csv'
with CorpusContext(config) as c:
c.enrich_speakers_from_csv(speaker_csv_path)
.. _enrich_discourses:

Enriching discourse information
===============================

Metadata about the discourses or sound files can be added via:

.. code-block:: python
discourse_csv_path = '/full/path/to/discourse/data.csv'
with CorpusContext(config) as c:
c.enrich_discourses_from_csv(discourse_csv_path)
65 changes: 65 additions & 0 deletions docs/source/enrichment_queries.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
.. _enrichment_queries:

**********************
Enrichment via queries
**********************

Queries have the functionality to set properties and create subsets of elements based on results.

For instance, if you wanted to make word initial phones more easily queryable, you could perform the following:

.. code-block:: python
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.begin == c.phone.word.begin)
q.create_subset('word-initial')
Once that code completes, a subsequent query could be made of:

.. code-block:: python
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.subset == 'word-initial)
print(q.all()))
Or instead of a subset, a property could be encoded as:

.. code-block:: python
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.begin == c.phone.word.begin)
q.set_properties(position='word-initial')
And then this property can be exported as a column in a csv:

.. code-block:: python
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q.columns(c.position)
q.to_csv(some_csv_path)
Lexicon queries can also be used in the same way to create subsets and encode properties that do not vary on a token by token basis.

For instance, a subset for high vowels can be created as follows:

.. code-block:: python
with CorpusContext(config) as c:
high_vowels = ['iy', 'ih','uw','uh']
q = c.query_lexicon(c.lexicon_phone)
q = q.filter(c.lexicon_phone.label.in_(high_vowels))
q.create_subset('high_vowel')
Which can then be used to query phone annotations:

.. code-block:: python
with CorpusContext(config) as c:
q = c.query_graph(c.phone)
q = q.filter(c.phone.subset == 'high_vowel')
print(q.all())
108 changes: 108 additions & 0 deletions docs/source/enrichment_syllables.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
.. _enrichment_syllables:

***********************
Creating syllable units
***********************

Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries
of words aligning with syllable boundaries (i.e., syllables cannot stretch across words).

At the moment, only one algorithm is supported (`maximal onset`) because its simplicity lends it to be language agnostic.

To encode syllables, there are two steps:

1. :ref:`encoding_syllabics`
2. :ref:`encoding_syllables`


.. _encoding_syllabics:

Encoding syllabic segments
==========================

Syllabic segments are called via a specialized function:



.. code-block:: python
syllabic_segments = ['aa', 'ae','ih']
with CorpusContext(config) as c:
c.encode_syllabic_segments(syllabic_segments)
Following this code, all phones with labels of `aa, ae, ih` will belong to the subset `syllabic`. This subset can be
then queried in the future, in addition to allowing syllables to be encoded.

.. _encoding_syllables:

Encoding syllables
==================

.. code-block:: python
with CorpusContext(config) as c:
c.encode_syllables()
.. note::

The function `encode_syllables` can be given a keyword argument for `call_back`, which is a function like `print` that
allows for progress to be output to the console.

Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of
all the instances of syllables at the beginnings of words:


.. code-block:: python
with CorpusContext(config) as c:
q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin)
print(q.all())
.. _stress_tone:

Encoding syllable properties from syllabics
===========================================

Often in corpora there is information about syllables contained on the vowels. For instance, if the transcription contains
stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet). Tone is likewise similarly encoded
in some transcription systems. This section details functions that strip this information from the vowel and place it on
the syllable unit instead.

.. note::

Removing the stress/tone information from the vowel makes queries easier, as getting all `AA` tokens no longer requires
specifying that the label is in the set of `AA1, AA2, AA0`. This functionality can be disabled by specifying `clean_phone_label=False`
in the two functions that follow.

.. _stress_enrichment:

Encoding stress
---------------

.. code-block:: python
with CorpusContext(config) as c:
c.encode_stress_to_syllables()
.. note::

By default, stress is taken to be numbers in the vowel label (i.e., `AA1` would have a stress of `1`). A different
pattern to use for stress information can be specified through the optional `regex` keyword argument.


.. _tone_enrichment:

Encoding tone
-------------

.. code-block:: python
with CorpusContext(config) as c:
c.encode_tone_to_syllables()
.. note::

As for stress, a different regex can be specified with the `regex` keyword argument.
Loading

0 comments on commit ea8aea1

Please sign in to comment.