GAWA — Gaussian-Weighted Abstraction for Word Architecture

If you find this GitHub useful, please consider giving it a star.

Overview

GAWA is a word-level morphological autoencoder that encodes any word — including unseen or morphologically complex words — into a dense embedding vector (eword) using character-level representations weighted by a Gaussian positional prior.

Unlike subword tokenizers (BPE, WordPiece, SentencePiece), GAWA treats each word as a sequence of characters and compresses it into a single fixed-size vector. This makes it:

Language-agnostic: Works on any character-based language without a pretrained vocabulary
Morphology-aware: Positional weighting captures prefix/suffix importance naturally
Compact: The output sequence length equals the number of words, not subword tokens

GAWA is designed to plug in as the front-end morphological module of a Global Transformer, replacing the tokenizer entirely.

Citation (BibTeX)

If you find this useful for your research, you can cite it:

@misc{gawa2026,
  author = {Abdul Wahid Rukua},
  title  = {GAWA : Gaussian-Augmented for Word Architecture},
  year   = {2026},
  publisher = {Github},
  url = {https://github.com/Airukua/gawa}
}

OOV Similarity (Indonesian)

GAWA can distinguish words by morphological footprint and still place misspellings or variants near their correct forms. Below are examples of OOV (out-of-vocabulary) queries from Indonesian words.

OOV: makann
  makanan       sim=0.9846
  makan         sim=0.9779
  mkan          sim=0.8618

OOV: mkan
  mknn          sim=0.8653
  makan         sim=0.8627
  makann        sim=0.8618

OOV: berlarr
  berlari       sim=0.9857
  permaenan     sim=0.5540
  permainan     sim=0.4860

OOV: permaenan
  permainan     sim=0.9255
  memakan       sim=0.5783
  berlarr       sim=0.5540

The pretrained model was trained on Indonesian language data (~8.2 million unique words extracted from Indo4B: https://huggingface.co/datasets/taufiqdp/Indo4B).

Training Details

Decoder training: 2 epochs
Accuracy: 94%
Dataset: ~8.2 million words extracted from Indo4B
Training time: ~12 hours
Hardware: NVIDIA T4 (Kaggle)

Architecture

Input Word (characters)
        │
        ├──► Char Embedding  (trainable)
        │
        ├──► Gaussian Positional Encoding  (fixed, non-trainable)
        │         μ_j = j,   σ_j = √j
        │
        └──► Concat → Fusion MLP
                          │
                    Weighted Pooling
                    (Gaussian Prior + Learnable Δ)
                          │
                    Output Projection
                          │
                       EWORD Vector  ──────────────────────────┐
                                                               │
                                                    ┌──────────▼──────────┐
                                                    │    GAWA Decoder     │
                                                    │  Init GRU Hidden    │
                                                    │  Char Emb + Concat  │
                                                    │  GRU Cell           │
                                                    │  Cross-Attention    │
                                                    │  Residual + Logits  │
                                                    └─────────────────────┘

Installation

1. Install via PyPI (pip)

pip install gawa

2. Install via GitHub (pip)

pip install git+https://github.com/AiRukua/gawa.git

3. Local Development Install

git clone https://github.com/AiRukua/gawa.git
cd gawa
pip install -e .

4. Optional Dev Dependencies

pip install -e ".[dev]"

Quick Start (CLI)

1. Prepare Data

GAWA expects a word list (one word per line). You can build it from raw text:

gawa-prepare --input data/raw.txt --output data/processed/train.txt --lower

2. Train

Use the YAML configs in configs/:

gawa-train --config configs/gawa_small.yaml

Checkpoints are saved to the directory defined in the config (default: checkpoints/).

3. Encode Word Embeddings

gawa-encode \
  --checkpoint checkpoints/gawa_small/best.pt \
  --words "makan,memakan,makanan"

Default output is JSONL. Use --output to write to a file.

4. Evaluate / Reconstructions

gawa-evaluate --config configs/gawa_small.yaml --checkpoint checkpoints/gawa_small/best.pt

Quick Start (Python - Training)

from gawa import load_config, train_from_config

# 1) Load YAML config
cfg = load_config("configs/gawa_small.yaml")

# 2) Train from config (checkpoints saved to the directory in the YAML)
train_from_config(cfg)

Pretrained Model (Hugging Face)

If you want to use the pretrained GAWA model, you can load it directly from Hugging Face:

from gawa import GAWAModel

model = GAWAModel.from_pretrained("AiRukua/gawa")

kept_words, embs = model.encode_words(["makan", "memakan", "makanan"])
kept_words, recs = model.decode_words(["makan", "memakan", "makanan"])

Config Guide (YAML)

GAWA uses a YAML config file for training (see configs/). The key sections are:

data

train_path: Path to a text file with one word per line.
max_word_len: Maximum word length (characters). Words longer than this are filtered. Must match model.max_word_len.

model

char_emb_dim: Character embedding dimension.
pos_enc_dim: Gaussian positional encoding dimension.
hidden_dim: Fusion MLP & decoder GRU hidden size.
eword_dim: Output word embedding dimension.
max_word_len: Must match data.max_word_len (set both to the same value to avoid a length mismatch error).
encoder_lambda_adjust: Weight for learnable position delta.
decoder_num_layers: Number of GRU layers in the decoder.
decoder_num_heads: Number of cross-attention heads.

training

batch_size: Training batch size.
epochs: Number of training epochs.
lr: Learning rate.
sample_every: How often to log reconstructions.

Example snippet:

data:
  train_path: data/processed/train.txt
  max_word_len: 32

model:
  char_emb_dim: 64
  pos_enc_dim: 64
  hidden_dim: 256
  eword_dim: 768
  max_word_len: 32
  encoder_lambda_adjust: 0.3
  decoder_num_layers: 1
  decoder_num_heads: 2

training:
  batch_size: 256
  epochs: 20
  lr: 3.0e-4
  sample_every: 1

To train with your config:

gawa-train --config configs/gawa_small.yaml

Model Dimensions

Parameter	Default	Description
`char_emb_dim`	64	Character embedding size
`pos_enc_dim`	64	Gaussian PE dimension
`hidden_dim`	256	Fusion MLP & GRU hidden size
`eword_dim`	768	Output word embedding dimension
`max_word_len`	32	Maximum word length in characters
`lambda_adjust`	0.3	Weight of learnable position delta

Why GAWA?

Feature	BPE / WordPiece	GAWA
Handles unseen words	✗ (UNK/fallback)	✓ (char-based)
Morphology-aware	Partial	✓ Explicit
Sequence length	Longer (subwords)	Shorter (words)
Language-specific vocab needed	✓	✗
Trainable end-to-end	✓	✓
Positional character weighting	✗	✓ Gaussian

Project Structure

model/: Encoder, decoder, and core model.
training/: Training loop, scheduler, and checkpointing.
data/: Data prep utilities.
eval/: Evaluation and encoding helpers.
scripts/: CLI entrypoints.
configs/: YAML configuration examples.

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
configs		configs
data		data
docs		docs
eval		eval
experiments		experiments
gawa		gawa
model		model
scripts		scripts
tests		tests
training		training
.codex		.codex
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LLM_testing.ipynb		LLM_testing.ipynb
README.md		README.md
gawa.png		gawa.png
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GAWA — Gaussian-Weighted Abstraction for Word Architecture

Overview

Citation (BibTeX)

OOV Similarity (Indonesian)

Training Details

Architecture

Installation

1. Install via PyPI (pip)

2. Install via GitHub (pip)

3. Local Development Install

4. Optional Dev Dependencies

Quick Start (CLI)

1. Prepare Data

2. Train

3. Encode Word Embeddings

4. Evaluate / Reconstructions

Quick Start (Python - Training)

Pretrained Model (Hugging Face)

Config Guide (YAML)

Model Dimensions

Why GAWA?

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GAWA — Gaussian-Weighted Abstraction for Word Architecture

Overview

Citation (BibTeX)

OOV Similarity (Indonesian)

Training Details

Architecture

Installation

1. Install via PyPI (pip)

2. Install via GitHub (pip)

3. Local Development Install

4. Optional Dev Dependencies

Quick Start (CLI)

1. Prepare Data

2. Train

3. Encode Word Embeddings

4. Evaluate / Reconstructions

Quick Start (Python - Training)

Pretrained Model (Hugging Face)

Config Guide (YAML)

Model Dimensions

Why GAWA?

Project Structure

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages