# Tutorial 1: First steps

## Downloading the tutorial corpus

The tutorial corpus used here is a version of the [LibriSpeech](http://www.openslr.org/12/) test-clean subset, forced aligned with the
[Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) ([tutorial corpus download link](https://mcgill-my.sharepoint.com/:u:/g/personal/michael_haaf_mcgill_ca/EfocNOr3o7xJuCrG_-OrR3MBh_-vmQaHtkV2J7vJq61c1w?e=UEhQg7)).  Extract the files to somewhere on your local machine.

## Importing the tutorial corpus

We begin by importing the necessary classes and functions from polyglotdb as well as defining variables.  Change the path to reflect where the tutorial corpus was extracted to on your local machine.

In [2]:
from polyglotdb import CorpusContext
import polyglotdb.io as pgio

corpus_root = '/mnt/e/Data/pg_tutorial'

The import statements get the necessary classes and functions for importing, namely the CorpusContext class and
the polyglot IO module.  CorpusContext objects are how all interactions with the database are handled.  The CorpusContext is
created as a context manager in Python (the ``with ... as ...`` pattern), so that clean up and closing of connections are
automatically handled both on successful completion of the code as well as if errors are encountered.

The IO module handles all import and export functionality in polyglotdb.  The principle functions that a user will encounter
are the ``inspect_X`` functions that generate parsers for corpus formats.  In the above code, the MFA parser is used because
the tutorial corpus was aligned using the MFA.  See [Importing corpora](https://polyglotdb.readthedocs.io/en/latest/import.html) for more information on the inspect functions and parser
objects they generate for various formats.


Once the proper path to the tutorial corpus is set, it can be imported via the following code:

In [17]:
parser = pgio.inspect_mfa(corpus_root)
parser.call_back = print # To show progress output

with CorpusContext('pg_tutorial') as c:
    c.load(parser, corpus_root)

loading /mnt/e/Data/pg_tutorial with <polyglotdb.io.parsers.mfa.MfaParser object at 0x7f8d740a0208>
Finding  files...
0 0
Parsing types...
0 87
Parsing types from file 1 of 87...
0
Parsing types from file 2 of 87...
1
Parsing types from file 3 of 87...
2
Parsing types from file 4 of 87...
3
Parsing types from file 5 of 87...
4
Parsing types from file 6 of 87...
5
Parsing types from file 7 of 87...
6
Parsing types from file 8 of 87...
7
Parsing types from file 9 of 87...
8
Parsing types from file 10 of 87...
9
Parsing types from file 11 of 87...
10
Parsing types from file 12 of 87...
11
Parsing types from file 13 of 87...
12
Parsing types from file 14 of 87...
13
Parsing types from file 15 of 87...
14
Parsing types from file 16 of 87...
15
Parsing types from file 17 of 87...
16
Parsing types from file 18 of 87...
17
Parsing types from file 19 of 87...
18
Parsing types from file 20 of 87...
19
Parsing types from file 21 of 87...
20
Parsing types from file 22 of 87...
21
Parsing types fro

Importing data for speaker 28 of 40 (6829)...
Importing data for speaker 29 of 40 (6930)...
Importing data for speaker 30 of 40 (7021)...
Importing data for speaker 31 of 40 (7127)...
Importing data for speaker 32 of 40 (7176)...
Importing data for speaker 33 of 40 (7729)...
Importing data for speaker 34 of 40 (8224)...
Importing data for speaker 35 of 40 (8230)...
Importing data for speaker 36 of 40 (8455)...
Importing data for speaker 37 of 40 (8463)...
Importing data for speaker 38 of 40 (8555)...
Importing data for speaker 39 of 40 (908)...


---
#### Important

If during the running of the import code, a ``neo4j.exceptions.ServiceUnavailable`` error is raised, then double check
that the pgdb database is running.  Once polyglotdb is installed, simply call ``pgdb start``, assuming ``pgdb install``
has already been called.  See [the relevant documentation](https://polyglotdb.readthedocs.io/en/latest/getting_started.html#set-up-local-database) for more information.

---

### Resetting the corpus 

If at any point there's some error or interruption in import or other stages of the tutorial, the corpus can be reset to a
fresh state via the following code:

In [16]:
with CorpusContext('pg_tutorial') as c:
    c.reset()

---
#### Warning

Be careful when running this code as it will delete any and all information in the corpus.  For smaller corpora such
as the one presented here, the time to set up is not huge, but for larger corpora this can result in several hours worth
of time to reimport and re-enrich the corpus.

---

## Testing some simple queries 

To ensure that data import completed successfully, we can print the list of speakers, discourses, and phone types in the corpus, via:

In [11]:
with CorpusContext('pg_tutorial') as c:
    print('Speakers:', c.speakers)
    print('Discourses:', c.discourses)
    q = c.query_lexicon(c.lexicon_phone)
    
    q = q.order_by(c.lexicon_phone.label)
    q = q.columns(c.lexicon_phone.label.column_name('phone'))
    results = q.all()
    print(results)

Speakers: ['2300', '1580', '237', '260', '1995', '2830', '2961', '3570', '2094', '1089', '1188', '121', '1221', '1284', '1320', '3575', '3729', '4077', '4446', '4507', '4970', '4992', '5105', '5142', '5639', '5683', '61', '672', '6829', '6930', '7021', '7127', '7176', '7729', '8224', '8230', '8455', '8463', '8555', '908']
Discourses: ['5696', '122617', '142345', '131720', '141083', '126133', '134493', '134500', '141084', '123286', '123288', '123440', '1826', '3979', '3980', '1836', '960', '1837', '961', '5694', '5695', '134686', '134691', '133604', '121726', '123852', '123859', '127105', '135766', '135767', '1180', '1181', '134647', '122612', '170457', '6852', '13751', '13754', '2271', '2273', '2275', '16021', '29093', '29095', '23283', '41797', '41806', '28233', '28240', '28241', '33396', '36377', '36586', '36600', '40744', '32865', '32866', '32879', '70968', '70970', '122797', '68769', '68771', '75918', '76324', '81414', '79730', '79740', '79759', '85628', '75946', '75947', '88083', 

A more interesting summary query is perhaps looking at the count and average duration of different phone types across the corpus, via:

In [7]:
from polyglotdb.query.base.func import Count, Average

with CorpusContext('pg_tutorial') as c:
    # Optional: Use order_by to enforce ordering on the output for easier comparison with the sample output.
    q = c.query_graph(c.phone).order_by(c.phone.label).group_by(c.phone.label.column_name('phone'))
    results = q.aggregate(Count().column_name('count'), Average(c.phone.duration).column_name('average_duration'))
    for r in results:
        print('The phone {} had {} occurrences and an average duration of {}.'.format(r['phone'], r['count'], r['average_duration']))

The phone <SIL> had 166 occurrences and an average duration of 0.017469503012048297.
The phone AA0 had 1 occurrences and an average duration of 0.08999999999999986.
The phone AA1 had 136 occurrences and an average duration of 0.1100735294117647.
The phone AA2 had 4 occurrences and an average duration of 0.06749999999999995.
The phone AE0 had 1 occurrences and an average duration of 0.10000000000000009.
The phone AE1 had 200 occurrences and an average duration of 0.10715000000000004.
The phone AE2 had 9 occurrences and an average duration of 0.10333333333333337.
The phone AH0 had 660 occurrences and an average duration of 0.05842424242424243.
The phone AH1 had 207 occurrences and an average duration of 0.07285024154589363.
The phone AH2 had 3 occurrences and an average duration of 0.04666666666666669.
The phone AO0 had 3 occurrences and an average duration of 0.07333333333333325.
The phone AO1 had 124 occurrences and an average duration of 0.11306451612903229.
The phone AO2 had 3 occurr