Skip to content

Lexical pronunciation resources for TTS in South African languages

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENCE-APACHE
Unknown
LICENCE-MIT
Notifications You must be signed in to change notification settings

NWU-MuST/za_lex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZA_LEX: lexical resources for South African languages

This repository contains lexical pronunciation resources and modules for use in text-to-speech (TTS) systems.

Specifically, it was originally set up to track work on updating and enhancing existing resources for the NTTS project funded by the Department of Arts and Culture (DAC) of the Government of South Africa.

The copyright and licence information for scripts in ./scripts/ can be found in ./COPYRIGHT and ./LICENCE-APACHE/./LICENCE-MIT. This repository also contains data from various sources under different licences in the ./data/* directories. Copyright and licence information for data and third-party components is contained in each individual sub-directory or source file.

For more information contact: Daniel van Niekerk (http://www.nwu.ac.za/must).

Software dependencies

Description of contents

The top level directory structure is summarised as follows:

.
|-- data
|   |-- afr
|   |-- eng
|   |-- sot
|   |-- tsn
|   |-- xho
|   `-- zul
|-- examples
|-- scripts
|-- COPYRIGHT
|-- LICENCE-APACHE
|-- LICENCE-MIT
`-- README.md
  • The data directory contains core language resources organised by language, each associated with its own LICENCE and README.
  • The examples directory contains some example outputs when running scripts as described below.
  • The scripts directory contains implementations and UNIX tools for grapheme-to-phoneme (G2P) conversion, syllabification, word decompounding and morphological analysis (some usage examples are given below).

Usage examples

Decompounding

The decompounder decomp_simple.py requires a word list and can be run for example on the Afrikaans data as follows:

cut -d " " -f 1 data/afr/pronundict.txt | scripts/decomp_simple.py examples/afr.words5.txt > examples/afr.decomp.txt

Morphological analysis

The Zulu morphological analyser can be run as follows (simplified output):

cut -f 1 data/zul/ref/nchlt_release_20130328/nchlt_isizulu.dict | scripts/morph_dcg.py data/zul/morphrules.descr.json data/zul/morphrules.dcg.txt --simpleguess > examples/zul.morphsimple.txt

Pronunciation prediction

G2P conversion can be run as follows:

cut -f 1 data/zul/ref/nchlt_release_20130328/nchlt_isizulu.dict | scripts/g2p_icu.py data/zul/phonemeset.json data/zul/g2p.translit.txt > examples/zul.simple.pronun.txt
cut -f 1 data/tsn/ref/nchlt_release_20130328/nchlt_setswana.dict | scripts/g2p_icu.py data/tsn/phonemeset.json data/tsn/g2p.translit.txt > examples/tsn.simple.pronun.txt

The syllabification modules can be run on the resulting pronunciation dictionaries:

cat examples/zul.simple.pronun.txt | scripts/syl_zul.py data/zul/phonemeset.json | cut -f 1,3 > examples/zul.syll.pronun.txt
cat examples/tsn.simple.pronun.txt | scripts/syl_tsn.py data/tsn/phonemeset.json | cut -f 1,3 > examples/tsn.syll.pronun.txt

About

Lexical pronunciation resources for TTS in South African languages

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENCE-APACHE
Unknown
LICENCE-MIT

Stars

Watchers

Forks

Packages

No packages published

Languages