Expanded Corpora Generation

email: betterment Labs at gmail for an invite to the Slack discussion group

Open Source Tool

Project Definition

Create tools that allow for Expanded Corpora generation easily
"" for Phoneme Corpora

Definitions

Corpora: a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.

Expanded Corpora:

Word (written or verbal* )
Word occurrence frequency
Word broken down into phonemes (component sounds), secondary priority
Use indices [?] (where word was used — à la Strong's Concordance Numbers)
Words occurring before & after word:
- two words before, two words after (and frequency of occurrences)
- count all end-of-sentence punctuation as a word (i.e. period "."; exclamation point "!"; question mark "?" or others, depending on language)

*verbal only will be added later

Phoneme Corpora?

primary spelling of phoneme
alternative spellings
word associations
- either as its own database or as part of Word corpora above
associated phonemes (preceding and seceding ... one or two in either direction?)
phoneme frequencies?

Concerns/Questions:

Particular language concerns:
- non-phonetic languages
- unwritten languages
database structure:
- Mark Davies's Corpora use a SQL RDS-structured
- Would a graph database be appropriate?
how to handle accents
how to account for conjugations/declensions/etc in corpora

Relevant Materials & Resources:

Mark Davies's Corpora
Graph Databases:
[Google paper on generating Corpora](: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36801.pdf)
English Phonemes:

Proposed Approach:

Two passes at Expanded Corpora:
- First Pass: Letters/Words
- Second Pass: Phonemes/Graphemes
Use New Testament Greek
- Reasons:
  - Strong’s Concordance has data we can check our results against
  - Source texts are easily accessible online
  - Team members have experience with language
  - Non-Roman alphabet gives a start for non-Roman alphabet languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Expanded Corpora Generation

Open Source Tool

Project Definition

Definitions

Concerns/Questions:

Relevant Materials & Resources:

Proposed Approach:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Expanded Corpora Generation

Open Source Tool

Project Definition

Definitions

Concerns/Questions:

Relevant Materials & Resources:

Proposed Approach:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages