Skip to content

BettermentLabs/READ-Expanded-Corpora-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 

Repository files navigation

Expanded Corpora Generation

email: betterment Labs at gmail for an invite to the Slack discussion group

Open Source Tool

by Betterment Labs

Project Definition

  1. Create tools that allow for Expanded Corpora generation easily
  2. "" for Phoneme Corpora

Definitions

Corpora: a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.

Expanded Corpora:

  • Word (written or verbal* )
  • Word occurrence frequency
  • Word broken down into phonemes (component sounds), secondary priority
  • Use indices [?] (where word was used — à la Strong's Concordance Numbers)
  • Words occurring before & after word:
    • two words before, two words after (and frequency of occurrences)
    • count all end-of-sentence punctuation as a word (i.e. period "."; exclamation point "!"; question mark "?" or others, depending on language)

*verbal only will be added later

Phoneme Corpora?

  • primary spelling of phoneme
  • alternative spellings
  • word associations
    • either as its own database or as part of Word corpora above
  • associated phonemes (preceding and seceding ... one or two in either direction?)
  • phoneme frequencies?

Concerns/Questions:

  • Particular language concerns:
    • non-phonetic languages
    • unwritten languages
  • database structure:
  • how to handle accents
  • how to account for conjugations/declensions/etc in corpora

Relevant Materials & Resources:

Proposed Approach:

  • Two passes at Expanded Corpora:
    • First Pass: Letters/Words
    • Second Pass: Phonemes/Graphemes
  • Use New Testament Greek
    • Reasons:
      • Strong’s Concordance has data we can check our results against
      • Source texts are easily accessible online
      • Team members have experience with language
      • Non-Roman alphabet gives a start for non-Roman alphabet languages

Releases

No releases published

Packages

 
 
 

Contributors