If you find this GitHub useful, please consider giving it a star.
GAWA is a word-level morphological autoencoder that encodes any word — including unseen or morphologically complex words — into a dense embedding vector (eword) using character-level representations weighted by a Gaussian positional prior.
Unlike subword tokenizers (BPE, WordPiece, SentencePiece), GAWA treats each word as a sequence of characters and compresses it into a single fixed-size vector. This makes it:
- Language-agnostic: Works on any character-based language without a pretrained vocabulary
- Morphology-aware: Positional weighting captures prefix/suffix importance naturally
- Compact: The output sequence length equals the number of words, not subword tokens
GAWA is designed to plug in as the front-end morphological module of a Global Transformer, replacing the tokenizer entirely.
If you find this useful for your research, you can cite it:
@misc{gawa2026,
author = {Abdul Wahid Rukua},
title = {GAWA : Gaussian-Augmented for Word Architecture},
year = {2026},
publisher = {Github},
url = {https://github.com/Airukua/gawa}
}GAWA can distinguish words by morphological footprint and still place misspellings or variants near their correct forms. Below are examples of OOV (out-of-vocabulary) queries from Indonesian words.
OOV: makann
makanan sim=0.9846
makan sim=0.9779
mkan sim=0.8618
OOV: mkan
mknn sim=0.8653
makan sim=0.8627
makann sim=0.8618
OOV: berlarr
berlari sim=0.9857
permaenan sim=0.5540
permainan sim=0.4860
OOV: permaenan
permainan sim=0.9255
memakan sim=0.5783
berlarr sim=0.5540
The pretrained model was trained on Indonesian language data (~8.2 million unique words extracted from Indo4B: https://huggingface.co/datasets/taufiqdp/Indo4B).
- Decoder training: 2 epochs
- Accuracy: 94%
- Dataset: ~8.2 million words extracted from Indo4B
- Training time: ~12 hours
- Hardware: NVIDIA T4 (Kaggle)
Input Word (characters)
│
├──► Char Embedding (trainable)
│
├──► Gaussian Positional Encoding (fixed, non-trainable)
│ μ_j = j, σ_j = √j
│
└──► Concat → Fusion MLP
│
Weighted Pooling
(Gaussian Prior + Learnable Δ)
│
Output Projection
│
EWORD Vector ──────────────────────────┐
│
┌──────────▼──────────┐
│ GAWA Decoder │
│ Init GRU Hidden │
│ Char Emb + Concat │
│ GRU Cell │
│ Cross-Attention │
│ Residual + Logits │
└─────────────────────┘
pip install gawapip install git+https://github.com/AiRukua/gawa.gitgit clone https://github.com/AiRukua/gawa.git
cd gawa
pip install -e .pip install -e ".[dev]"GAWA expects a word list (one word per line). You can build it from raw text:
gawa-prepare --input data/raw.txt --output data/processed/train.txt --lowerUse the YAML configs in configs/:
gawa-train --config configs/gawa_small.yamlCheckpoints are saved to the directory defined in the config (default: checkpoints/).
gawa-encode \
--checkpoint checkpoints/gawa_small/best.pt \
--words "makan,memakan,makanan"Default output is JSONL. Use --output to write to a file.
gawa-evaluate --config configs/gawa_small.yaml --checkpoint checkpoints/gawa_small/best.ptfrom gawa import load_config, train_from_config
# 1) Load YAML config
cfg = load_config("configs/gawa_small.yaml")
# 2) Train from config (checkpoints saved to the directory in the YAML)
train_from_config(cfg)If you want to use the pretrained GAWA model, you can load it directly from Hugging Face:
from gawa import GAWAModel
model = GAWAModel.from_pretrained("AiRukua/gawa")
kept_words, embs = model.encode_words(["makan", "memakan", "makanan"])
kept_words, recs = model.decode_words(["makan", "memakan", "makanan"])GAWA uses a YAML config file for training (see configs/). The key sections are:
data
train_path: Path to a text file with one word per line.max_word_len: Maximum word length (characters). Words longer than this are filtered. Must matchmodel.max_word_len.
model
char_emb_dim: Character embedding dimension.pos_enc_dim: Gaussian positional encoding dimension.hidden_dim: Fusion MLP & decoder GRU hidden size.eword_dim: Output word embedding dimension.max_word_len: Must matchdata.max_word_len(set both to the same value to avoid a length mismatch error).encoder_lambda_adjust: Weight for learnable position delta.decoder_num_layers: Number of GRU layers in the decoder.decoder_num_heads: Number of cross-attention heads.
training
batch_size: Training batch size.epochs: Number of training epochs.lr: Learning rate.sample_every: How often to log reconstructions.
Example snippet:
data:
train_path: data/processed/train.txt
max_word_len: 32
model:
char_emb_dim: 64
pos_enc_dim: 64
hidden_dim: 256
eword_dim: 768
max_word_len: 32
encoder_lambda_adjust: 0.3
decoder_num_layers: 1
decoder_num_heads: 2
training:
batch_size: 256
epochs: 20
lr: 3.0e-4
sample_every: 1To train with your config:
gawa-train --config configs/gawa_small.yaml| Parameter | Default | Description |
|---|---|---|
char_emb_dim |
64 | Character embedding size |
pos_enc_dim |
64 | Gaussian PE dimension |
hidden_dim |
256 | Fusion MLP & GRU hidden size |
eword_dim |
768 | Output word embedding dimension |
max_word_len |
32 | Maximum word length in characters |
lambda_adjust |
0.3 | Weight of learnable position delta |
| Feature | BPE / WordPiece | GAWA |
|---|---|---|
| Handles unseen words | ✗ (UNK/fallback) | ✓ (char-based) |
| Morphology-aware | Partial | ✓ Explicit |
| Sequence length | Longer (subwords) | Shorter (words) |
| Language-specific vocab needed | ✓ | ✗ |
| Trainable end-to-end | ✓ | ✓ |
| Positional character weighting | ✗ | ✓ Gaussian |
model/: Encoder, decoder, and core model.training/: Training loop, scheduler, and checkpointing.data/: Data prep utilities.eval/: Evaluation and encoding helpers.scripts/: CLI entrypoints.configs/: YAML configuration examples.
MIT License. See LICENSE for details.
