ZA_LEX: lexical resources for South African languages

This repository contains lexical pronunciation resources and modules for use in text-to-speech (TTS) systems.

Specifically, it was originally set up to track work on updating and enhancing existing resources for the NTTS project funded by the Department of Arts and Culture (DAC) of the Government of South Africa.

The copyright and licence information for scripts in ./scripts/ can be found in ./COPYRIGHT and ./LICENCE-APACHE/./LICENCE-MIT. This repository also contains data from various sources under different licences in the ./data/* directories. Copyright and licence information for data and third-party components is contained in each individual sub-directory or source file.

For more information contact: Daniel van Niekerk (http://www.nwu.ac.za/must).

Software dependencies

OpenFST 1.5.0 or higher with Python bindings.
PyICU.

Description of contents

The top level directory structure is summarised as follows:

.
|-- data
|   |-- afr
|   |-- eng
|   |-- sot
|   |-- tsn
|   |-- xho
|   `-- zul
|-- examples
|-- scripts
|-- COPYRIGHT
|-- LICENCE-APACHE
|-- LICENCE-MIT
`-- README.md

The data directory contains core language resources organised by language, each associated with its own LICENCE and README.
The examples directory contains some example outputs when running scripts as described below.
The scripts directory contains implementations and UNIX tools for grapheme-to-phoneme (G2P) conversion, syllabification, word decompounding and morphological analysis (some usage examples are given below).

Usage examples

Decompounding

The decompounder decomp_simple.py requires a word list and can be run for example on the Afrikaans data as follows:

cut -d " " -f 1 data/afr/pronundict.txt | scripts/decomp_simple.py examples/afr.words5.txt > examples/afr.decomp.txt

Morphological analysis

The Zulu morphological analyser can be run as follows (simplified output):

cut -f 1 data/zul/ref/nchlt_release_20130328/nchlt_isizulu.dict | scripts/morph_dcg.py data/zul/morphrules.descr.json data/zul/morphrules.dcg.txt --simpleguess > examples/zul.morphsimple.txt

Pronunciation prediction

G2P conversion can be run as follows:

cut -f 1 data/zul/ref/nchlt_release_20130328/nchlt_isizulu.dict | scripts/g2p_icu.py data/zul/phonemeset.json data/zul/g2p.translit.txt > examples/zul.simple.pronun.txt
cut -f 1 data/tsn/ref/nchlt_release_20130328/nchlt_setswana.dict | scripts/g2p_icu.py data/tsn/phonemeset.json data/tsn/g2p.translit.txt > examples/tsn.simple.pronun.txt

The syllabification modules can be run on the resulting pronunciation dictionaries:

cat examples/zul.simple.pronun.txt | scripts/syl_zul.py data/zul/phonemeset.json | cut -f 1,3 > examples/zul.syll.pronun.txt
cat examples/tsn.simple.pronun.txt | scripts/syl_tsn.py data/tsn/phonemeset.json | cut -f 1,3 > examples/tsn.syll.pronun.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

ZA_LEX: lexical resources for South African languages

Software dependencies

Description of contents

Usage examples

Decompounding

Morphological analysis

Pronunciation prediction

About

Licenses found

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
data		data
examples		examples
scripts		scripts
.gitignore		.gitignore
COPYRIGHT		COPYRIGHT
LICENCE-APACHE		LICENCE-APACHE
LICENCE-MIT		LICENCE-MIT
README.md		README.md

License

Licenses found

NWU-MuST/za_lex

Folders and files

Latest commit

History

Repository files navigation

ZA_LEX: lexical resources for South African languages

Software dependencies

Description of contents

Usage examples

Decompounding

Morphological analysis

Pronunciation prediction

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages