|
| 1 | +# Indonesian Sentence Embeddings |
| 2 | + |
| 3 | +Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we decided to embark on the journey of training Indonesian sentence embedding models! |
| 4 | + |
| 5 | +To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. We hope this repository can serve as a benchmark for future research on Indonesian sentence embeddings. |
| 6 | + |
| 7 | +## Evaluation |
| 8 | + |
| 9 | +### Machine Translated STS-B |
| 10 | + |
| 11 | +We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. |
| 12 | + |
| 13 | +> You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id). |
| 14 | +
|
| 15 | +Moreover, we will further evaluate the transferrability of our models on downstream tasks (e.g. text classification, natural language inference, etc.) and compare them with existing pre-trained language models (PLMs). |
| 16 | + |
| 17 | +### Text Classification |
| 18 | + |
| 19 | +For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of [IndoNLU](https://huggingface.co/datasets/indonlp/indonlu), respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained. |
| 20 | + |
| 21 | +## Methods |
| 22 | + |
| 23 | +### (Unsupervised) SimCSE |
| 24 | + |
| 25 | +We followed [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821) and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the [Sentence Transformer implementation](https://www.sbert.net/examples/unsupervised_learning/README.html#simcse) of [SimCSE](https://github.com/princeton-nlp/SimCSE). |
| 26 | + |
| 27 | +### ConGen |
| 28 | + |
| 29 | +Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation](https://github.com/KornWtp/ConGen) is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the [official ConGen implementation](https://github.com/KornWtp/ConGen) which was written on top of the Sentence Transformers library. |
| 30 | + |
| 31 | +## Results |
| 32 | + |
| 33 | +### Machine Translated Indonesian Semantic Textual Similarity Benchmark (STSB-MT-ID) |
| 34 | + |
| 35 | +| Model | Spearman's Correlation (%) | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | |
| 36 | +| --------------------------------------------------------------------------------------------------------------------------- | :------------------------: | :-----: | ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | :--------: | |
| 37 | +| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 44.08 | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | |
| 38 | +| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 61.26 | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | |
| 39 | +| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 70.13 | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | |
| 40 | +| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 79.97 | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | |
| 41 | +| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 80.47 | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | |
| 42 | +| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 81.16 | 125M | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | |
| 43 | +| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 75.08 | 134M | [DistilBERT Base Multilingual](https://huggingface.co/distilbert-base-multilingual-cased) | mUSE | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | |
| 44 | +| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 83.83 | 125M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | |
| 45 | + |
| 46 | +### Emotion Classification (EmoT) |
| 47 | + |
| 48 | +| Model | Accuracy (%) | F1 Macro (%) | |
| 49 | +| --------------------------------------------------------------------------------------------------------------------------- | :----------: | :----------: | |
| 50 | +| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 41.13 | 40.70 | |
| 51 | +| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 50.45 | 50.75 | |
| 52 | +| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 55.45 | 55.78 | |
| 53 | +| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 58.18 | 58.84 | |
| 54 | +| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 57.04 | 57.06 | |
| 55 | +| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 59.54 | 60.37 | |
| 56 | +| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 63.63 | 64.13 | |
| 57 | +| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 63.18 | 63.78 | |
| 58 | + |
| 59 | +### Sentiment Analysis (SmSA) |
| 60 | + |
| 61 | +| Model | Accuracy (%) | F1 Macro (%) | |
| 62 | +| --------------------------------------------------------------------------------------------------------------------------- | :----------: | :----------: | |
| 63 | +| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 68.8 | 63.37 | |
| 64 | +| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 76.2 | 70.42 | |
| 65 | +| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 85.6 | 81.50 | |
| 66 | +| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 81.2 | 75.59 | |
| 67 | +| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 85.4 | 82.12 | |
| 68 | +| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 83.0 | 78.74 | |
| 69 | +| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 78.8 | 73.64 | |
| 70 | +| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 89.6 | 86.56 | |
| 71 | + |
| 72 | +## References |
| 73 | + |
| 74 | +```bibtex |
| 75 | +@misc{Thai-Sentence-Vector-Benchmark-2022, |
| 76 | + author = {Limkonchotiwat, Peerat}, |
| 77 | + title = {Thai-Sentence-Vector-Benchmark}, |
| 78 | + year = {2022}, |
| 79 | + publisher = {GitHub}, |
| 80 | + journal = {GitHub repository}, |
| 81 | + howpublished = {\url{https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark}} |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +```bibtex |
| 86 | +@inproceedings{reimers-2019-sentence-bert, |
| 87 | + title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
| 88 | + author = "Reimers, Nils and Gurevych, Iryna", |
| 89 | + booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
| 90 | + month = "11", |
| 91 | + year = "2019", |
| 92 | + publisher = "Association for Computational Linguistics", |
| 93 | + url = "https://arxiv.org/abs/1908.10084", |
| 94 | +} |
| 95 | +``` |
| 96 | + |
| 97 | +```bibtex |
| 98 | +@inproceedings{gao2021simcse, |
| 99 | + title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings}, |
| 100 | + author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi}, |
| 101 | + booktitle={Empirical Methods in Natural Language Processing (EMNLP)}, |
| 102 | + year={2021} |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +```bibtex |
| 107 | +@inproceedings{limkonchotiwat-etal-2022-congen, |
| 108 | + title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation", |
| 109 | + author = "Limkonchotiwat, Peerat and |
| 110 | + Ponwitayarat, Wuttikorn and |
| 111 | + Lowphansirikul, Lalita and |
| 112 | + Udomcharoenchaikit, Can and |
| 113 | + Chuangsuwanich, Ekapol and |
| 114 | + Nutanong, Sarana", |
| 115 | + booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", |
| 116 | + year = "2022", |
| 117 | + publisher = "Association for Computational Linguistics", |
| 118 | +} |
| 119 | +``` |
0 commit comments