Skip to content

varnamproject/libvarnam

Repository files navigation

Introduction

libvarnam is a cross platform, self learning, open source library which support transliteration and reverse transliteration for Indian languages. At the core is a C shared library providing algorithms and patterns for transliteration. libvarnam has a simple learning module built-in which can learn words to improve the transliteration experience.

Installing libvarnam

wget http://download.savannah.gnu.org/releases/varnamproject/libvarnam/source/libvarnam-$VERSION.tar.gz
tar -xvf libvarnam-$VERSION.tar.gz
cd libvarnam-$VERSION
cmake . && make
sudo make install

This will install libvarnam shared libraries and varnamc command line utility. varnamc can be used to quickly try out varnam.

Installation on Windows

In Windows, you can compile libvarnam using Visual Studio. Use the following cmake command to generate the project files.

cmake -DBUILD_TESTS=false -DBUILD_VST=false -DRUN_TESTS=false .

Usage

Transliterate

Usage: varnamc -s lang_code -t word

varnamc -s ml -t varnam
 വർണം
 വർണമേറിയത്

Reverse Transliterate

Usage: varnamc -s lang_code -r word

varnamc -s ml -r വർണം
 varnam

Word corpus

libvarnam is a learning system. It works better with a word corpus. You can obtain the word corpus and make varnam learn all the words. This will enable libvarnam to provide intelligent suggestions.

Here is an example of loading Malayalam word corpus:

mkdir words
cd words
wget http://download.savannah.gnu.org/releases/varnamproject/words/ml/ml.tar.gz
tar -xvf ml.tar.gz
varnamc  -s ml --learn-from .

This will take some time depends on how much words you are loading.

Here are some more word corpus

There is a --import-learnings-from option to import files which already has the learnt parameter. Importing these files don't take too much time as the word corpus.

What next?

If you just wanted to use varnam for input, you have the following options

If you are a programmer, you will be interested in libvarnam. You can use it to provide Indian language support in your applications. libvarnam can be used from different programming languages.

How Varnam works

  1. Scheme files and symbol tables
  2. Transliteration
  3. Learning

Scheme files and symbol tables

Scheme file maps English letters to phonetic equivalent indic letters. In this, all vowels, consonants and consonant clusters are mapped to the indic equivalent. Varnam uses the scheme file mapping to perform transliteration.

Scheme files are plain text but uses a custom DSL to make the mapping easier. This DSL is implemented using Ruby and it can contain any valid Ruby code. It also provides many helper functions to make the mapping easier.

schemes/ directory contains all the scheme files for the supported languages. Each language is represented with it's ISO language code.

Symbol tables

Compiled version of Scheme file is called as Varnam Symbol Table (vst). This compilation is done using varnamc command line utility

varnamc --compile schemes/ml

Symbol tables are binary representation of the plain text scheme files. It also contains other metadata items to make the lookup easier.

libvarnam understand only the symbol table format. Because of this, every scheme file should be compiled into vst format before it can be used with varnam.

make vst

can be used to compile all scheme files present in the schemes directory.

Symbol table lookup

Varnam can be initialized with just the ISO language code. When this happens, varnam will scan the following directories and tries to find a matching symbol table file. If one is found, it will be loaded and used for all operations.

  • "/usr/local/share/varnam/vst"
  • "/usr/share/varnam/vst"
  • "schemes"

Transliteration

varnam_transliterate(varnam *handle, const char *input, varray **output);

Is the entry point for transliteration. Transliteration converts input to the phonetic equivalent indic text. It also provides a set of matches which are possible for the given input.

Transliteration does the following steps under the hood:

Performs tokenization on the input. Varnam uses a greedy tokenizer which processes input from left to right. Tokenizer tries all possible to combinations to generate the longest possible tokens for the given input. This token will be generated by utilizing the symbol table which is provided to varnam

Generated tokens is assembled and varnam computes all possibilities of these tokens. Assume the input is malayalam, varnam generates tokens like, മ, ല, യാ, ളം ([ma], [la], [ya], [lam]) and many others. Once these tokens are generated, they are combined and tested against the learning model to get rid of garbage values and come up with most used words. Words are sorted according to the frequency value and returned to the caller function.

Renderer

All of the processing is varnam is mostly language agnostic. It should work fine for all Indian languages. However, sometimes language specific fixes might be required. Varnam handles this using Renderers. Any language can register renderers and varnam will invoke the renderers just before rendering the final output. This can have language specific rules which can't be generalized otherwise.

Learning

varnam_learn(varnam *handle, const char *word);

Varnam can learn new words. The more words it learns, the better it performs. Learning process learns the words and it's patterns.

Learning process persists the following data:

  1. Patterns: All english combinations which can be used to input the given indic text
  2. Words: Indic text itself
  3. Prefixes: Prefixes of patterns and words

When an indic word is learned, varnam tokenizes the word using the symbol table and tries to learn all possible patterns that can be used to input the word. Internally, varnam keeps a prefix tree and frequencies of all patterns. This storage structure allows varnam to retrieve matching words efficiently when a pattern is presented. Basic stemming is also performed while learning words.

When the same word/pattern combination is learned, varnam computes frequency at which it has seen this pattern. This frequency is used to sort and pick the best candidate while performing transliteration.

Learning can be initiated by calling Varnam APIs directly or using varnamc.

Input tools like ibus-engine will automatically learn the words that you are typing.

Learned data is kept in one of the following locations:

  • APPDATA\varnam\suggestions (Windows)
  • XDG_DATA_HOME/varnam/suggestions
  • HOME/.local/share/varnam/suggestions

Mozilla Public License

Copyright (c) 2016 Navaneeth.K.N

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.