Skip to content
@LazarusNLP

Lazarus NLP

Lazarus NLP is a collective initiative to revive the dying languages of Indonesia through speech and language technology.

Lazarus NLP is a collective initiative to revive the dying languages of Indonesia through speech and language technology.

logo

Projects

IndoT5: T5 Language Models for the Indonesian Language

IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the IndoNLG (text generation) benchmark.

Indonesian Sentence Embedding Models

We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like IndoBERT and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.

Indonesian Natural Language Inference Models

Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.

Many-to-Many Multilingual Translation Models

Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the NusaX benchmark.

Pinned

  1. indonesian-sentence-embeddings indonesian-sentence-embeddings Public

    Embedding Representation for Indonesian Sentences!

    Jupyter Notebook 4 2

  2. machine-translation machine-translation Public

    Many-to-Many Multilingual Translation Model for Languages of Indonesia

    Python 1

  3. IndoT5 IndoT5 Public

    T5 Language Models for the Indonesian Language!

    Python 1

  4. NusaBERT NusaBERT Public

    NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

    Python

Repositories

Showing 10 of 13 repositories

Most used topics

Loadingā€¦