email: betterment Labs at gmail for an invite to the Slack discussion group
- Create tools that allow for Expanded Corpora generation easily
- "" for Phoneme Corpora
Corpora: a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.
Expanded Corpora:
- Word (written or verbal* )
- Word occurrence frequency
- Word broken down into phonemes (component sounds), secondary priority
- Use indices [?] (where word was used — à la Strong's Concordance Numbers)
- Words occurring before & after word:
- two words before, two words after (and frequency of occurrences)
- count all end-of-sentence punctuation as a word (i.e. period "."; exclamation point "!"; question mark "?" or others, depending on language)
*verbal only will be added later
Phoneme Corpora?
- primary spelling of phoneme
- alternative spellings
- word associations
- either as its own database or as part of Word corpora above
- associated phonemes (preceding and seceding ... one or two in either direction?)
- phoneme frequencies?
- Particular language concerns:
- non-phonetic languages
- unwritten languages
- database structure:
- Mark Davies's Corpora use a SQL RDS-structured
- Would a graph database be appropriate?
- how to handle accents
- how to account for conjugations/declensions/etc in corpora
- Mark Davies's Corpora
- Graph Databases:
- [Google paper on generating Corpora](: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36801.pdf)
- English Phonemes:
- Two passes at Expanded Corpora:
- First Pass: Letters/Words
- Second Pass: Phonemes/Graphemes
- Use New Testament Greek
- Reasons:
- Strong’s Concordance has data we can check our results against
- Source texts are easily accessible online
- Team members have experience with language
- Non-Roman alphabet gives a start for non-Roman alphabet languages
- Reasons: