This repository contains lexical pronunciation resources and modules for use in text-to-speech (TTS) systems.
Specifically, it was originally set up to track work on updating and enhancing existing resources for the NTTS project funded by the Department of Arts and Culture (DAC) of the Government of South Africa.
The copyright and licence information for scripts in ./scripts/
can be found in ./COPYRIGHT
and ./LICENCE-APACHE
/./LICENCE-MIT
. This repository also contains data from various sources under different licences in the ./data/*
directories. Copyright and licence information for data and third-party components is contained in each individual sub-directory or source file.
For more information contact: Daniel van Niekerk (http://www.nwu.ac.za/must).
- OpenFST 1.5.0 or higher with Python bindings.
- PyICU.
The top level directory structure is summarised as follows:
.
|-- data
| |-- afr
| |-- eng
| |-- sot
| |-- tsn
| |-- xho
| `-- zul
|-- examples
|-- scripts
|-- COPYRIGHT
|-- LICENCE-APACHE
|-- LICENCE-MIT
`-- README.md
- The
data
directory contains core language resources organised by language, each associated with its own LICENCE and README. - The
examples
directory contains some example outputs when running scripts as described below. - The
scripts
directory contains implementations and UNIX tools for grapheme-to-phoneme (G2P) conversion, syllabification, word decompounding and morphological analysis (some usage examples are given below).
The decompounder decomp_simple.py
requires a word list and can be run for example on the Afrikaans data as follows:
cut -d " " -f 1 data/afr/pronundict.txt | scripts/decomp_simple.py examples/afr.words5.txt > examples/afr.decomp.txt
The Zulu morphological analyser can be run as follows (simplified output):
cut -f 1 data/zul/ref/nchlt_release_20130328/nchlt_isizulu.dict | scripts/morph_dcg.py data/zul/morphrules.descr.json data/zul/morphrules.dcg.txt --simpleguess > examples/zul.morphsimple.txt
G2P conversion can be run as follows:
cut -f 1 data/zul/ref/nchlt_release_20130328/nchlt_isizulu.dict | scripts/g2p_icu.py data/zul/phonemeset.json data/zul/g2p.translit.txt > examples/zul.simple.pronun.txt
cut -f 1 data/tsn/ref/nchlt_release_20130328/nchlt_setswana.dict | scripts/g2p_icu.py data/tsn/phonemeset.json data/tsn/g2p.translit.txt > examples/tsn.simple.pronun.txt
The syllabification modules can be run on the resulting pronunciation dictionaries:
cat examples/zul.simple.pronun.txt | scripts/syl_zul.py data/zul/phonemeset.json | cut -f 1,3 > examples/zul.syll.pronun.txt
cat examples/tsn.simple.pronun.txt | scripts/syl_tsn.py data/tsn/phonemeset.json | cut -f 1,3 > examples/tsn.syll.pronun.txt