Indonesian Sentence Embeddings

Inspired by Thai Sentence Vector Benchmark, we decided to embark on the journey of training Indonesian sentence embedding models!

Evaluation

Semantic Textual Similarity

We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the STS-B dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.

You can find the translated dataset on 🤗 HuggingFace Hub.

Retrieval

To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance.

Classification

For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of IndoNLU, respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained.

Further, we will evaluate our models using the official MTEB code that contains two Indonesian classification subtasks: MassiveIntentClassification (id) and MassiveScenarioClassification (id).

Pair Classification

We followed MTEB's PairClassification evaluation procedure for pair classification. Specifically for zero-shot natural language inference tasks, all neutral pairs are dropped, while contradictions and entailments are re-mapped as 0s and 1s. The maximum average precision (AP) score is found by finding the best threshold value.

We leverage the IndoNLI dataset's two test subsets: test_lay and test_expert.

Methods

(Unsupervised) SimCSE

We followed SimCSE: Simple Contrastive Learning of Sentence Embeddings and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the Sentence Transformer implementation of SimCSE.

ConGen

Like SimCSE, ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the official ConGen implementation which was written on top of the Sentence Transformers library.

SCT

SCT: An Efficient Self-Supervised Cross-View Training For Sentence Embedding is another unsupervised technique to train a sentence embedding model. It is very similar to ConGen in its knowledge distillation methodology, but also supports self-supervised training procedure without a teacher model. The original paper proposes back-translation as its data augmentation technique, but we implemented single-word deletion and found it to perform better than our backtranslated corpus. We used the official SCT implementation which was written on top of the Sentence Transformers library.

Pretrained Models

Model	#params	Base/Student Model	Teacher Model	Train Dataset	Supervised
SimCSE-IndoBERT Base	125M	IndoBERT Base	N/A	Wikipedia
ConGen-IndoBERT Lite Base	12M	IndoBERT Lite Base	paraphrase-multilingual-mpnet-base-v2	Wikipedia
ConGen-IndoBERT Base	125M	IndoBERT Base	paraphrase-multilingual-mpnet-base-v2	Wikipedia
ConGen-SimCSE-IndoBERT Base	125M	SimCSE-IndoBERT Base	paraphrase-multilingual-mpnet-base-v2	Wikipedia
ConGen-Indo-e5 Small	118M	multilingual-e5-small	paraphrase-multilingual-mpnet-base-v2	Wikipedia
SCT-IndoBERT Base	125M	IndoBERT Base	paraphrase-multilingual-mpnet-base-v2	Wikipedia
all-IndoBERT Base	125M	IndoBERT Base	N/A	See: README	✅
all-IndoBERT Base-v2	125M	IndoBERT Base	N/A	See: README	✅
all-Indo-e5 Small-v2	118M	multilingual-e5-small	N/A	See: README	✅
all-Indo-e5 Small-v3	118M	multilingual-e5-small	N/A	See: README	✅
distiluse-base-multilingual-cased-v2	134M	DistilBERT Base Multilingual	mUSE	See: SBERT	✅
paraphrase-multilingual-mpnet-base-v2	125M	XLM-RoBERTa Base	paraphrase-mpnet-base-v2	See: SBERT	✅
multilingual-e5-small	118M	Multilingual-MiniLM-L12-H384	See: arXiv	See: 🤗	✅
multilingual-e5-base	278M	XLM-RoBERTa Base	See: arXiv	See: 🤗	✅
multilingual-e5-large	560M	XLM-RoBERTa Large	See: arXiv	See: 🤗	✅

??? example "Deprecated Models"

| Model                                                                                    | #params | Base/Student Model                                                                | Teacher Model | Train Dataset                                                                 | Supervised |
| ---------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------- | :--------: |
| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) |   12M   | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1)  | N/A           | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) |            |
| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base)     |  125M   | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A           | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) |            |
| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco)       |  125M   | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1)            | N/A           | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco)                   |     ✅      |
| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2)           |  125M   | [IndoBERT Base p2](https://huggingface.co/indobenchmark/indobert-base-p2)         | N/A           | See: [README](./training/all/)                                                |     ✅      |

Results

Semantic Textual Similarity

Machine Translated Indonesian STS-B

Model	Spearman's Correlation (%) ↑
SimCSE-IndoBERT Base	70.13
ConGen-IndoBERT Lite Base	79.97
ConGen-IndoBERT Base	80.47
ConGen-SimCSE-IndoBERT Base	81.16
ConGen-Indo-e5 Small	80.94
SCT-IndoBERT Base	74.56
all-IndoBERT Base	73.84
all-IndoBERT Base-v2	76.03
all-Indo-e5 Small-v2	79.57
all-Indo-e5 Small-v3	79.95
distiluse-base-multilingual-cased-v2	75.08
paraphrase-multilingual-mpnet-base-v2	83.83
multilingual-e5-small	78.89
multilingual-e5-base	79.72
multilingual-e5-large	79.44

Retrieval

MIRACL

Model	R@1 (%) ↑	MRR@10 (%) ↑	nDCG@10 (%) ↑
SimCSE-IndoBERT Base	36.04	48.25	39.70
ConGen-IndoBERT Lite Base	46.04	59.06	51.01
ConGen-IndoBERT Base	45.93	58.58	49.95
ConGen-SimCSE-IndoBERT Base	45.83	58.27	49.91
ConGen-Indo-e5 Small	55.00	66.74	58.95
SCT-IndoBERT Base	40.41	47.29	40.68
all-IndoBERT Base	65.52	75.92	70.13
all-IndoBERT Base-v2	67.18	76.59	70.16
all-Indo-e5 Small-v2	68.33	78.33	73.04
all-Indo-e5 Small-v3	68.12	78.22	73.09
distiluse-base-multilingual-cased-v2	41.35	54.93	48.79
paraphrase-multilingual-mpnet-base-v2	52.81	65.07	57.97
multilingual-e5-small	70.20	79.61	74.80
multilingual-e5-base	70.00	79.50	75.16
multilingual-e5-large	70.83	80.58	76.16

TyDiQA

Model	R@1 (%) ↑	MRR@10 (%) ↑	nDCG@10 (%) ↑
SimCSE-IndoBERT Base	61.94	69.89	73.52
ConGen-IndoBERT Lite Base	75.22	81.55	84.13
ConGen-IndoBERT Base	73.09	80.32	83.29
ConGen-SimCSE-IndoBERT Base	72.38	79.37	82.51
ConGen-Indo-e5 Small	84.60	89.30	91.27
SCT-IndoBERT Base	76.81	83.16	85.87
all-IndoBERT Base	88.14	91.47	92.91
all-IndoBERT Base-v2	87.61	90.91	92.31
all-Indo-e5 Small-v2	93.27	95.63	96.46
all-Indo-e5 Small-v3	93.27	95.72	96.58
distiluse-base-multilingual-cased-v2	70.44	77.94	81.56
paraphrase-multilingual-mpnet-base-v2	81.41	87.05	89.44
multilingual-e5-small	91.50	94.34	95.39
multilingual-e5-base	93.45	95.88	96.69
multilingual-e5-large	94.69	96.71	97.44

Classification

MTEB - Massive Intent Classification `(id)`

Model	Accuracy (%) ↑	F1 Macro (%) ↑
SimCSE-IndoBERT Base	59.71	57.70
ConGen-IndoBERT Lite Base	62.41	60.94
ConGen-IndoBERT Base	61.14	60.02
ConGen-SimCSE-IndoBERT Base	60.93	59.50
ConGen-Indo-e5 Small	62.92	60.18
SCT-IndoBERT Base	55.66	54.48
all-IndoBERT Base	58.40	57.21
all-IndoBERT Base-v2	58.31	57.11
all-Indo-e5 Small-v2	61.51	59.24
all-Indo-e5 Small-v3	61.63	59.29
distiluse-base-multilingual-cased-v2	55.99	52.44
paraphrase-multilingual-mpnet-base-v2	65.43	63.55
multilingual-e5-small	64.16	61.33
multilingual-e5-base	66.63	63.88
multilingual-e5-large	70.04	67.66

MTEB - Massive Scenario Classification `(id)`

Model	Accuracy (%) ↑	F1 Macro (%) ↑
SimCSE-IndoBERT Base	66.14	65.56
ConGen-IndoBERT Lite Base	67.25	66.53
ConGen-IndoBERT Base	67.72	67.32
ConGen-SimCSE-IndoBERT Base	67.12	66.64
ConGen-Indo-e5 Small	66.92	66.29
SCT-IndoBERT Base	61.89	60.97
all-IndoBERT Base	66.37	66.31
all-IndoBERT Base-v2	66.02	65.97
all-Indo-e5 Small-v2	67.02	66.86
all-Indo-e5 Small-v3	67.27	67.13
distiluse-base-multilingual-cased-v2	65.25	63.45
paraphrase-multilingual-mpnet-base-v2	70.72	70.58
multilingual-e5-small	67.92	67.23
multilingual-e5-base	70.70	70.26
multilingual-e5-large	74.11	73.82

IndoNLU - Emotion Classification (EmoT)

Model	Accuracy (%) ↑	F1 Macro (%) ↑
SimCSE-IndoBERT Base	55.45	55.78
ConGen-IndoBERT Lite Base	58.18	58.84
ConGen-IndoBERT Base	57.04	57.06
ConGen-SimCSE-IndoBERT Base	59.54	60.37
ConGen-Indo-e5 Small	60.00	60.52
SCT-IndoBERT Base	61.13	61.70
all-IndoBERT Base	57.27	57.47
all-IndoBERT Base-v2	58.86	59.31
all-Indo-e5 Small-v2	58.18	57.99
all-Indo-e5 Small-v3	56.81	56.46
distiluse-base-multilingual-cased-v2	63.63	64.13
paraphrase-multilingual-mpnet-base-v2	63.18	63.78
multilingual-e5-small	64.54	65.04
multilingual-e5-base	68.63	69.07
multilingual-e5-large	74.77	74.66

IndoNLU - Sentiment Analysis (SmSA)

Model	Accuracy (%) ↑	F1 Macro (%) ↑
SimCSE-IndoBERT Base	85.6	81.50
ConGen-IndoBERT Lite Base	81.2	75.59
ConGen-IndoBERT Base	85.4	82.12
ConGen-SimCSE-IndoBERT Base	83.0	78.74
ConGen-Indo-e5 Small	84.2	80.21
SCT-IndoBERT Base	82.0	76.92
all-IndoBERT Base	84.4	79.79
all-IndoBERT Base-v2	83.4	79.04
all-Indo-e5 Small-v2	82.0	78.15
all-Indo-e5 Small-v3	82.6	78.98
distiluse-base-multilingual-cased-v2	78.8	73.64
paraphrase-multilingual-mpnet-base-v2	89.6	86.56
multilingual-e5-small	83.6	79.51
multilingual-e5-base	89.4	86.22
multilingual-e5-large	90.0	86.50

Pair Classification

IndoNLI

Model	`test_lay` AP (%) ↑	`test_expert` AP (%) ↑
SimCSE-IndoBERT Base	56.06	50.72
ConGen-IndoBERT Lite Base	69.44	53.74
ConGen-IndoBERT Base	71.14	56.35
ConGen-SimCSE-IndoBERT Base	70.80	56.59
ConGen-Indo-e5 Small	70.51	55.67
SCT-IndoBERT Base	59.82	53.41
all-IndoBERT Base	72.01	56.79
all-IndoBERT Base-v2	71.36	56.83
all-Indo-e5 Small-v2	76.29	57.05
all-Indo-e5 Small-v3	75.21	56.62
distiluse-base-multilingual-cased-v2	58.48	50.50
paraphrase-multilingual-mpnet-base-v2	74.87	57.96
multilingual-e5-small	63.97	51.85
multilingual-e5-base	60.25	50.91
multilingual-e5-large	61.39	51.62

Credits

Indonesian Sentence Embeddings is developed with love by:

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
data		data
docs		docs
evaluation		evaluation
notebooks		notebooks
training		training
unsupervised_learning		unsupervised_learning
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

License

LazarusNLP/indonesian-sentence-embeddings

Folders and files

Latest commit

History

Repository files navigation

Indonesian Sentence Embeddings

Evaluation

Semantic Textual Similarity

Retrieval

Classification

Pair Classification

Methods

(Unsupervised) SimCSE

ConGen

SCT

Pretrained Models

Results

Semantic Textual Similarity

Machine Translated Indonesian STS-B

Retrieval

MIRACL

TyDiQA

Classification

MTEB - Massive Intent Classification (id)

MTEB - Massive Scenario Classification (id)

IndoNLU - Emotion Classification (EmoT)

IndoNLU - Sentiment Analysis (SmSA)

Pair Classification

IndoNLI

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

MTEB - Massive Intent Classification `(id)`

MTEB - Massive Scenario Classification `(id)`