Skip to content

Commit 76fcf22

Browse files
committed
Changed Projects as Hardlinks
1 parent 9f12192 commit 76fcf22

File tree

2 files changed

+224
-2
lines changed

2 files changed

+224
-2
lines changed

docs/projects/indo-sentence-embeddings.md

Lines changed: 0 additions & 1 deletion
This file was deleted.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Indonesian Sentence Embeddings
2+
3+
Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we decided to embark on the journey of training Indonesian sentence embedding models!
4+
5+
To the best of our knowledge, there is no official benchmark on Indonesian sentence embeddings. We hope this repository can serve as a benchmark for future research on Indonesian sentence embeddings.
6+
7+
## Evaluation
8+
9+
### Machine Translated STS-B
10+
11+
We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set.
12+
13+
> You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id).
14+
15+
Moreover, we will further evaluate the transferrability of our models on downstream tasks (e.g. text classification, natural language inference, etc.) and compare them with existing pre-trained language models (PLMs).
16+
17+
### Text Classification
18+
19+
For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of [IndoNLU](https://huggingface.co/datasets/indonlp/indonlu), respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained.
20+
21+
## Methods
22+
23+
### (Unsupervised) SimCSE
24+
25+
We followed [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821) and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the [Sentence Transformer implementation](https://www.sbert.net/examples/unsupervised_learning/README.html#simcse) of [SimCSE](https://github.com/princeton-nlp/SimCSE).
26+
27+
### ConGen
28+
29+
Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation](https://github.com/KornWtp/ConGen) is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the [official ConGen implementation](https://github.com/KornWtp/ConGen) which was written on top of the Sentence Transformers library.
30+
31+
## Results
32+
33+
### Machine Translated Indonesian Semantic Textual Similarity Benchmark (STSB-MT-ID)
34+
35+
| Model | Spearman's Correlation (%) | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised |
36+
| --------------------------------------------------------------------------------------------------------------------------- | :------------------------: | :-----: | ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | :--------: |
37+
| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 44.08 | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | |
38+
| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 61.26 | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | |
39+
| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 70.13 | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | |
40+
| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 79.97 | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | |
41+
| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 80.47 | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | |
42+
| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 81.16 | 125M | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | |
43+
| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 75.08 | 134M | [DistilBERT Base Multilingual](https://huggingface.co/distilbert-base-multilingual-cased) | mUSE | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) ||
44+
| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 83.83 | 125M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) ||
45+
46+
### Emotion Classification (EmoT)
47+
48+
| Model | Accuracy (%) | F1 Macro (%) |
49+
| --------------------------------------------------------------------------------------------------------------------------- | :----------: | :----------: |
50+
| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 41.13 | 40.70 |
51+
| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 50.45 | 50.75 |
52+
| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 55.45 | 55.78 |
53+
| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 58.18 | 58.84 |
54+
| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 57.04 | 57.06 |
55+
| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 59.54 | 60.37 |
56+
| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 63.63 | 64.13 |
57+
| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 63.18 | 63.78 |
58+
59+
### Sentiment Analysis (SmSA)
60+
61+
| Model | Accuracy (%) | F1 Macro (%) |
62+
| --------------------------------------------------------------------------------------------------------------------------- | :----------: | :----------: |
63+
| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 68.8 | 63.37 |
64+
| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 76.2 | 70.42 |
65+
| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 85.6 | 81.50 |
66+
| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 81.2 | 75.59 |
67+
| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 85.4 | 82.12 |
68+
| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 83.0 | 78.74 |
69+
| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 78.8 | 73.64 |
70+
| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 89.6 | 86.56 |
71+
72+
## References
73+
74+
```bibtex
75+
@misc{Thai-Sentence-Vector-Benchmark-2022,
76+
author = {Limkonchotiwat, Peerat},
77+
title = {Thai-Sentence-Vector-Benchmark},
78+
year = {2022},
79+
publisher = {GitHub},
80+
journal = {GitHub repository},
81+
howpublished = {\url{https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark}}
82+
}
83+
```
84+
85+
```bibtex
86+
@inproceedings{reimers-2019-sentence-bert,
87+
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
88+
author = "Reimers, Nils and Gurevych, Iryna",
89+
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
90+
month = "11",
91+
year = "2019",
92+
publisher = "Association for Computational Linguistics",
93+
url = "https://arxiv.org/abs/1908.10084",
94+
}
95+
```
96+
97+
```bibtex
98+
@inproceedings{gao2021simcse,
99+
title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
100+
author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
101+
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
102+
year={2021}
103+
}
104+
```
105+
106+
```bibtex
107+
@inproceedings{limkonchotiwat-etal-2022-congen,
108+
title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation",
109+
author = "Limkonchotiwat, Peerat and
110+
Ponwitayarat, Wuttikorn and
111+
Lowphansirikul, Lalita and
112+
Udomcharoenchaikit, Can and
113+
Chuangsuwanich, Ekapol and
114+
Nutanong, Sarana",
115+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
116+
year = "2022",
117+
publisher = "Association for Computational Linguistics",
118+
}
119+
```

docs/projects/machine-translation.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)