Skip to content

LazarusNLP/indonesian-sentence-embeddings

Repository files navigation

Indonesian Sentence Embeddings

DOI

Inspired by Thai Sentence Vector Benchmark, we decided to embark on the journey of training Indonesian sentence embedding models!

logo

Evaluation

Semantic Textual Similarity

We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the STS-B dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.

You can find the translated dataset on 🤗 HuggingFace Hub.

Retrieval

To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance.

Classification

For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of IndoNLU, respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained.

Further, we will evaluate our models using the official MTEB code that contains two Indonesian classification subtasks: MassiveIntentClassification (id) and MassiveScenarioClassification (id).

Pair Classification

We followed MTEB's PairClassification evaluation procedure for pair classification. Specifically for zero-shot natural language inference tasks, all neutral pairs are dropped, while contradictions and entailments are re-mapped as 0s and 1s. The maximum average precision (AP) score is found by finding the best threshold value.

We leverage the IndoNLI dataset's two test subsets: test_lay and test_expert.

Methods

(Unsupervised) SimCSE

We followed SimCSE: Simple Contrastive Learning of Sentence Embeddings and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the Sentence Transformer implementation of SimCSE.

ConGen

Like SimCSE, ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the official ConGen implementation which was written on top of the Sentence Transformers library.

SCT

SCT: An Efficient Self-Supervised Cross-View Training For Sentence Embedding is another unsupervised technique to train a sentence embedding model. It is very similar to ConGen in its knowledge distillation methodology, but also supports self-supervised training procedure without a teacher model. The original paper proposes back-translation as its data augmentation technique, but we implemented single-word deletion and found it to perform better than our backtranslated corpus. We used the official SCT implementation which was written on top of the Sentence Transformers library.

Pretrained Models

Model #params Base/Student Model Teacher Model Train Dataset Supervised
SimCSE-IndoBERT Base 125M IndoBERT Base N/A Wikipedia
ConGen-IndoBERT Lite Base 12M IndoBERT Lite Base paraphrase-multilingual-mpnet-base-v2 Wikipedia
ConGen-IndoBERT Base 125M IndoBERT Base paraphrase-multilingual-mpnet-base-v2 Wikipedia
ConGen-SimCSE-IndoBERT Base 125M SimCSE-IndoBERT Base paraphrase-multilingual-mpnet-base-v2 Wikipedia
ConGen-Indo-e5 Small 118M multilingual-e5-small paraphrase-multilingual-mpnet-base-v2 Wikipedia
SCT-IndoBERT Base 125M IndoBERT Base paraphrase-multilingual-mpnet-base-v2 Wikipedia
all-IndoBERT Base 125M IndoBERT Base N/A See: README
all-IndoBERT Base-v2 125M IndoBERT Base N/A See: README
all-Indo-e5 Small-v2 118M multilingual-e5-small N/A See: README
all-Indo-e5 Small-v3 118M multilingual-e5-small N/A See: README
distiluse-base-multilingual-cased-v2 134M DistilBERT Base Multilingual mUSE See: SBERT
paraphrase-multilingual-mpnet-base-v2 125M XLM-RoBERTa Base paraphrase-mpnet-base-v2 See: SBERT
multilingual-e5-small 118M Multilingual-MiniLM-L12-H384 See: arXiv See: 🤗
multilingual-e5-base 278M XLM-RoBERTa Base See: arXiv See: 🤗
multilingual-e5-large 560M XLM-RoBERTa Large See: arXiv See: 🤗

??? example "Deprecated Models"

| Model                                                                                    | #params | Base/Student Model                                                                | Teacher Model | Train Dataset                                                                 | Supervised |
| ---------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------- | :--------: |
| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) |   12M   | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1)  | N/A           | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) |            |
| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base)     |  125M   | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A           | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) |            |
| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco)       |  125M   | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1)            | N/A           | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco)                   |     ✅      |
| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2)           |  125M   | [IndoBERT Base p2](https://huggingface.co/indobenchmark/indobert-base-p2)         | N/A           | See: [README](./training/all/)                                                |     ✅      |

Results

Semantic Textual Similarity

Machine Translated Indonesian STS-B

Model Spearman's Correlation (%) ↑
SimCSE-IndoBERT Base 70.13
ConGen-IndoBERT Lite Base 79.97
ConGen-IndoBERT Base 80.47
ConGen-SimCSE-IndoBERT Base 81.16
ConGen-Indo-e5 Small 80.94
SCT-IndoBERT Base 74.56
all-IndoBERT Base 73.84
all-IndoBERT Base-v2 76.03
all-Indo-e5 Small-v2 79.57
all-Indo-e5 Small-v3 79.95
distiluse-base-multilingual-cased-v2 75.08
paraphrase-multilingual-mpnet-base-v2 83.83
multilingual-e5-small 78.89
multilingual-e5-base 79.72
multilingual-e5-large 79.44

Retrieval

MIRACL

Model R@1 (%) ↑ MRR@10 (%) ↑ nDCG@10 (%) ↑
SimCSE-IndoBERT Base 36.04 48.25 39.70
ConGen-IndoBERT Lite Base 46.04 59.06 51.01
ConGen-IndoBERT Base 45.93 58.58 49.95
ConGen-SimCSE-IndoBERT Base 45.83 58.27 49.91
ConGen-Indo-e5 Small 55.00 66.74 58.95
SCT-IndoBERT Base 40.41 47.29 40.68
all-IndoBERT Base 65.52 75.92 70.13
all-IndoBERT Base-v2 67.18 76.59 70.16
all-Indo-e5 Small-v2 68.33 78.33 73.04
all-Indo-e5 Small-v3 68.12 78.22 73.09
distiluse-base-multilingual-cased-v2 41.35 54.93 48.79
paraphrase-multilingual-mpnet-base-v2 52.81 65.07 57.97
multilingual-e5-small 70.20 79.61 74.80
multilingual-e5-base 70.00 79.50 75.16
multilingual-e5-large 70.83 80.58 76.16

TyDiQA

Model R@1 (%) ↑ MRR@10 (%) ↑ nDCG@10 (%) ↑
SimCSE-IndoBERT Base 61.94 69.89 73.52
ConGen-IndoBERT Lite Base 75.22 81.55 84.13
ConGen-IndoBERT Base 73.09 80.32 83.29
ConGen-SimCSE-IndoBERT Base 72.38 79.37 82.51
ConGen-Indo-e5 Small 84.60 89.30 91.27
SCT-IndoBERT Base 76.81 83.16 85.87
all-IndoBERT Base 88.14 91.47 92.91
all-IndoBERT Base-v2 87.61 90.91 92.31
all-Indo-e5 Small-v2 93.27 95.63 96.46
all-Indo-e5 Small-v3 93.27 95.72 96.58
distiluse-base-multilingual-cased-v2 70.44 77.94 81.56
paraphrase-multilingual-mpnet-base-v2 81.41 87.05 89.44
multilingual-e5-small 91.50 94.34 95.39
multilingual-e5-base 93.45 95.88 96.69
multilingual-e5-large 94.69 96.71 97.44

Classification

MTEB - Massive Intent Classification (id)

Model Accuracy (%) ↑ F1 Macro (%) ↑
SimCSE-IndoBERT Base 59.71 57.70
ConGen-IndoBERT Lite Base 62.41 60.94
ConGen-IndoBERT Base 61.14 60.02
ConGen-SimCSE-IndoBERT Base 60.93 59.50
ConGen-Indo-e5 Small 62.92 60.18
SCT-IndoBERT Base 55.66 54.48
all-IndoBERT Base 58.40 57.21
all-IndoBERT Base-v2 58.31 57.11
all-Indo-e5 Small-v2 61.51 59.24
all-Indo-e5 Small-v3 61.63 59.29
distiluse-base-multilingual-cased-v2 55.99 52.44
paraphrase-multilingual-mpnet-base-v2 65.43 63.55
multilingual-e5-small 64.16 61.33
multilingual-e5-base 66.63 63.88
multilingual-e5-large 70.04 67.66

MTEB - Massive Scenario Classification (id)

Model Accuracy (%) ↑ F1 Macro (%) ↑
SimCSE-IndoBERT Base 66.14 65.56
ConGen-IndoBERT Lite Base 67.25 66.53
ConGen-IndoBERT Base 67.72 67.32
ConGen-SimCSE-IndoBERT Base 67.12 66.64
ConGen-Indo-e5 Small 66.92 66.29
SCT-IndoBERT Base 61.89 60.97
all-IndoBERT Base 66.37 66.31
all-IndoBERT Base-v2 66.02 65.97
all-Indo-e5 Small-v2 67.02 66.86
all-Indo-e5 Small-v3 67.27 67.13
distiluse-base-multilingual-cased-v2 65.25 63.45
paraphrase-multilingual-mpnet-base-v2 70.72 70.58
multilingual-e5-small 67.92 67.23
multilingual-e5-base 70.70 70.26
multilingual-e5-large 74.11 73.82

IndoNLU - Emotion Classification (EmoT)

Model Accuracy (%) ↑ F1 Macro (%) ↑
SimCSE-IndoBERT Base 55.45 55.78
ConGen-IndoBERT Lite Base 58.18 58.84
ConGen-IndoBERT Base 57.04 57.06
ConGen-SimCSE-IndoBERT Base 59.54 60.37
ConGen-Indo-e5 Small 60.00 60.52
SCT-IndoBERT Base 61.13 61.70
all-IndoBERT Base 57.27 57.47
all-IndoBERT Base-v2 58.86 59.31
all-Indo-e5 Small-v2 58.18 57.99
all-Indo-e5 Small-v3 56.81 56.46
distiluse-base-multilingual-cased-v2 63.63 64.13
paraphrase-multilingual-mpnet-base-v2 63.18 63.78
multilingual-e5-small 64.54 65.04
multilingual-e5-base 68.63 69.07
multilingual-e5-large 74.77 74.66

IndoNLU - Sentiment Analysis (SmSA)

Model Accuracy (%) ↑ F1 Macro (%) ↑
SimCSE-IndoBERT Base 85.6 81.50
ConGen-IndoBERT Lite Base 81.2 75.59
ConGen-IndoBERT Base 85.4 82.12
ConGen-SimCSE-IndoBERT Base 83.0 78.74
ConGen-Indo-e5 Small 84.2 80.21
SCT-IndoBERT Base 82.0 76.92
all-IndoBERT Base 84.4 79.79
all-IndoBERT Base-v2 83.4 79.04
all-Indo-e5 Small-v2 82.0 78.15
all-Indo-e5 Small-v3 82.6 78.98
distiluse-base-multilingual-cased-v2 78.8 73.64
paraphrase-multilingual-mpnet-base-v2 89.6 86.56
multilingual-e5-small 83.6 79.51
multilingual-e5-base 89.4 86.22
multilingual-e5-large 90.0 86.50

Pair Classification

IndoNLI

Model test_lay AP (%) ↑ test_expert AP (%) ↑
SimCSE-IndoBERT Base 56.06 50.72
ConGen-IndoBERT Lite Base 69.44 53.74
ConGen-IndoBERT Base 71.14 56.35
ConGen-SimCSE-IndoBERT Base 70.80 56.59
ConGen-Indo-e5 Small 70.51 55.67
SCT-IndoBERT Base 59.82 53.41
all-IndoBERT Base 72.01 56.79
all-IndoBERT Base-v2 71.36 56.83
all-Indo-e5 Small-v2 76.29 57.05
all-Indo-e5 Small-v3 75.21 56.62
distiluse-base-multilingual-cased-v2 58.48 50.50
paraphrase-multilingual-mpnet-base-v2 74.87 57.96
multilingual-e5-small 63.97 51.85
multilingual-e5-base 60.25 50.91
multilingual-e5-large 61.39 51.62

Credits

Indonesian Sentence Embeddings is developed with love by: