DOCS: Reogranization WIP

MontrealCorpusTools · Aug 6, 2017 · ea8aea1 · ea8aea1
1 parent a928860
commit ea8aea1
Show file tree

Hide file tree

Showing 33 changed files with 1,016 additions and 455 deletions.
diff --git a/bin/pgdb.py b/bin/pgdb.py
@@ -50,7 +50,7 @@ def save_config(c):
 
 TEMP_DIR = os.path.join(CONFIG_DIR, 'downloads')
 
-NEO4J_VERSION = '3.0.7'
+NEO4J_VERSION = '3.2.3'
 
 INFLUXDB_VERSION = '1.1.0'
 

diff --git a/docs/source/acoustics.rst b/docs/source/acoustics.rst
@@ -0,0 +1,16 @@
+.. _acoustics:
+
+*****************
+Acoustic measures
+*****************
+
+TODO blurb
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   acoustics_encoding.rst
+   acoustics_querying.rst
+   acoustics_backend.rst
diff --git a/docs/source/acoustics_backend.rst b/docs/source/acoustics_backend.rst
@@ -0,0 +1,5 @@
+.. _acoustics_backend:
+
+****************
+Acoustic backend
+****************
diff --git a/docs/source/acoustics_encoding.rst b/docs/source/acoustics_encoding.rst
@@ -0,0 +1,6 @@
+.. _acoustics_encoding:
+
+**************************
+Encoding acoustic measures
+**************************
+
diff --git a/docs/source/acoustics_querying.rst b/docs/source/acoustics_querying.rst
@@ -0,0 +1,6 @@
+.. _acoustics_querying:
+
+**************************
+Querying acoustic measures
+**************************
+
diff --git a/docs/source/api_graph.rst b/docs/source/api_graph.rst
@@ -8,7 +8,7 @@ Graph API
 
 Queries
 -------
-.. currentmodule:: polyglotdb.graph.query
+.. currentmodule:: polyglotdb.query.annotations.query
 
 .. autosummary::
    :toctree: generated/
@@ -20,20 +20,20 @@ Queries
 
 Attributes
 ----------
-.. currentmodule:: polyglotdb.graph.attributes
+.. currentmodule:: polyglotdb.query.annotations.attributes.base
 
 .. autosummary::
    :toctree: generated/
    :template: class.rst
 
-   Attribute
+   AnnotationNode
    AnnotationAttribute
 
 .. _graph_clauses_api:
 
 Clause elements
 ---------------
-.. currentmodule:: polyglotdb.graph.elements
+.. currentmodule:: polyglotdb.query.annotations.elements
 
 .. autosummary::
    :toctree: generated/
@@ -54,7 +54,7 @@ Clause elements
 
 Aggregate functions
 -------------------
-.. currentmodule:: polyglotdb.graph.func
+.. currentmodule:: polyglotdb.query.base.func
 
 .. autosummary::
    :toctree: generated/

diff --git a/docs/source/enrichment.rst b/docs/source/enrichment.rst
@@ -0,0 +1,19 @@
+.. _enrichment:
+
+**********
+Enrichment
+**********
+
+Following import, the corpus is often fairly bare, with just word and phone annotations.  An important step in analyzing
+corpora is therefore enriching it with other information.  Most of the methods here are automatic once a function is called.
+
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   enrichment_syllables.rst
+   enrichment_utterances.rst
+   enrichment_csvs.rst
+   enrichment_queries.rst
diff --git a/docs/source/enrichment_csvs.rst b/docs/source/enrichment_csvs.rst
@@ -0,0 +1,76 @@
+.. _enrichment_csvs:
+
+************************
+Enrichment via CSV files
+************************
+
+PolyglotDB supports ways of adding arbitrary information to annotations or metadata about speakers and files by specifying
+a local CSV file to add information from.  When constructing this CSV file, the first column should be the label used to
+identify which element should be enriched, and all subsequent columns are used as properties to add to the corpus.
+
+::
+
+   ID_column,property_one,property_two
+   first_item,first_item_value_one,first_item_value_two
+   second_item,,second_item_value_two
+
+Enriching using this file would look up elements based on the `ID_column`, and the one matching `first_item` would get
+both `property_one` and `property_two` (with the respective values).  The one matching `second_item` would only get a
+`property_two` (because the value for `property_one` is empty.
+
+.. _enrich_lexicon:
+
+Enriching the lexicon
+=====================
+
+.. code-block:: python
+
+   lexicon_csv_path = '/full/path/to/lexicon/data.csv'
+   with CorpusContext(config) as c:
+       c.enrich_lexicon_from_csv(lexicon_csv_path)
+
+
+.. note::
+
+   The function `enrich_lexicon_from_csv` accepts an optional keyword `case_sensitive` and defaults to `False`.  Changing this
+   will respect capitalization when looking up words.
+
+
+.. _enrich_inventory:
+
+Enriching the phonological inventory
+====================================
+
+The phone inventory can be enriched with arbitrary properties via:
+
+.. code-block:: python
+
+   inventory_csv_path = '/full/path/to/inventory/data.csv'
+   with CorpusContext(config) as c:
+       c.enrich_inventory_from_csv(inventory_csv_path)
+
+.. _enrich_speakers:
+
+Enriching speaker information
+=============================
+
+Speaker information can be added via:
+
+.. code-block:: python
+
+   speaker_csv_path = '/full/path/to/speaker/data.csv'
+   with CorpusContext(config) as c:
+       c.enrich_speakers_from_csv(speaker_csv_path)
+
+.. _enrich_discourses:
+
+Enriching discourse information
+===============================
+
+Metadata about the discourses or sound files can be added via:
+
+.. code-block:: python
+
+   discourse_csv_path = '/full/path/to/discourse/data.csv'
+   with CorpusContext(config) as c:
+       c.enrich_discourses_from_csv(discourse_csv_path)
diff --git a/docs/source/enrichment_queries.rst b/docs/source/enrichment_queries.rst
@@ -0,0 +1,65 @@
+.. _enrichment_queries:
+
+**********************
+Enrichment via queries
+**********************
+
+Queries have the functionality to set properties and create subsets of elements based on results.
+
+For instance, if you wanted to make word initial phones more easily queryable, you could perform the following:
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+       q = c.query_graph(c.phone)
+       q = q.filter(c.phone.begin == c.phone.word.begin)
+       q.create_subset('word-initial')
+
+Once that code completes, a subsequent query could be made of:
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+       q = c.query_graph(c.phone)
+       q = q.filter(c.phone.subset == 'word-initial)
+       print(q.all()))
+
+Or instead of a subset, a property could be encoded as:
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+       q = c.query_graph(c.phone)
+       q = q.filter(c.phone.begin == c.phone.word.begin)
+       q.set_properties(position='word-initial')
+
+And then this property can be exported as a column in a csv:
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+       q = c.query_graph(c.phone)
+       q.columns(c.position)
+       q.to_csv(some_csv_path)
+
+
+Lexicon queries can also be used in the same way to create subsets and encode properties that do not vary on a token by token basis.
+
+For instance, a subset for high vowels can be created as follows:
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+       high_vowels = ['iy', 'ih','uw','uh']
+       q = c.query_lexicon(c.lexicon_phone)
+       q = q.filter(c.lexicon_phone.label.in_(high_vowels))
+       q.create_subset('high_vowel')
+
+Which can then be used to query phone annotations:
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+       q = c.query_graph(c.phone)
+       q = q.filter(c.phone.subset == 'high_vowel')
+       print(q.all())
diff --git a/docs/source/enrichment_syllables.rst b/docs/source/enrichment_syllables.rst
@@ -0,0 +1,108 @@
+.. _enrichment_syllables:
+
+***********************
+Creating syllable units
+***********************
+
+Syllables are groupings of phones into larger units, within words. PolyglotDB enforces a strict hierarchy, with the boundaries
+of words aligning with syllable boundaries (i.e., syllables cannot stretch across words).
+
+At the moment, only one algorithm is supported (`maximal onset`) because its simplicity lends it to be language agnostic.
+
+To encode syllables, there are two steps:
+
+1. :ref:`encoding_syllabics`
+2. :ref:`encoding_syllables`
+
+
+.. _encoding_syllabics:
+
+Encoding syllabic segments
+==========================
+
+Syllabic segments are called via a specialized function:
+
+
+
+.. code-block:: python
+
+   syllabic_segments = ['aa', 'ae','ih']
+   with CorpusContext(config) as c:
+        c.encode_syllabic_segments(syllabic_segments)
+
+
+Following this code, all phones with labels of `aa, ae, ih` will belong to the subset `syllabic`.  This subset can be
+then queried in the future, in addition to allowing syllables to be encoded.
+
+.. _encoding_syllables:
+
+Encoding syllables
+==================
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+        c.encode_syllables()
+
+.. note::
+
+   The function `encode_syllables` can be given a keyword argument for `call_back`, which is a function like `print` that
+   allows for progress to be output to the console.
+
+Following encoding, syllables are available to queried and used as any other linguistic unit. For example, to get a list of
+all the instances of syllables at the beginnings of words:
+
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+        q = c.query_graph(c.syllable).filter(c.syllable.begin == c.syllable.word.begin)
+        print(q.all())
+
+.. _stress_tone:
+
+Encoding syllable properties from syllabics
+===========================================
+
+Often in corpora there is information about syllables contained on the vowels.  For instance, if the transcription contains
+stress levels, they will be specified as numbers 0-2 on the vowels (i.e. as in Arpabet).  Tone is likewise similarly encoded
+in some transcription systems.  This section details functions that strip this information from the vowel and place it on
+the syllable unit instead.
+
+.. note::
+
+   Removing the stress/tone information from the vowel makes queries easier, as getting all `AA` tokens no longer requires
+   specifying that the label is in the set of `AA1, AA2, AA0`.  This functionality can be disabled by specifying `clean_phone_label=False`
+   in the two functions that follow.
+
+.. _stress_enrichment:
+
+Encoding stress
+---------------
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+
+        c.encode_stress_to_syllables()
+
+.. note::
+
+   By default, stress is taken to be numbers in the vowel label (i.e., `AA1` would have a stress of `1`).  A different
+   pattern to use for stress information can be specified through the optional `regex` keyword argument.
+
+
+.. _tone_enrichment:
+
+Encoding tone
+-------------
+
+.. code-block:: python
+
+   with CorpusContext(config) as c:
+
+        c.encode_tone_to_syllables()
+
+.. note::
+
+   As for stress, a different regex can be specified with the `regex` keyword argument.