Detecting Unintentional Bilingual and Translation Instances in NLP Datasets

Python implementation of Google's "Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability" (Briakou et al. 2023), using open source tools. Some differences with respect to the paper:

Per-token language detection is done with Kevers (2022)'s CoSwID model, instead of Google's CMX model (Zhang et al., 2018).
In case the CoSwID model is very unsure over a subsequent series of tokens, we use Facebook's FastText-langdetect (Joulin et al., 2016) to label the entire uncertain sequence.

Installation

1. Install CoSwID as indicated in the CoSwID repository:

Install the pre-requisites.

Required libraries:

unzip
g++
python3

Required Python packages:

python-daemon
numpy
fasttext-langdetect
iso-639
levenshtein

Clone the CoSwID repositories. Let's call the root directory where you want to clone them <WORKING_DIR>, then you'd want to do:

cd <WORKING_DIR>
git clone https://github.com/lkevers/ldig-python3.git
git clone https://github.com/lkevers/dicServer.git
git clone https://github.com/lkevers/coswid.git

Generate the language model following the instructions on the CoSwID repository. For example, the FILTER2 model was proposed in the CoSwID paper. It detects English, Italian, German, French, Portuguese, Spanish, Dutch, Romanian and Corsican and works well for our purposes. You can generate FILTER2 (default settings) with:

cd <WORKING_DIR>/coswid/data_lgID_learn
unzip filter2.zip
cat filter2/LEARN_data_filter2_* >>filter2/LEARN_data_filter2_ALL.txt
mkdir ../models/filter2
cd ../../ldig-python3/maxsubst
g++ -o maxsubst maxsubst.cpp -Icybozulib/include
chmod +x maxsubst
cd ../
python3 ldig.py -m ../coswid/models/filter2 -x maxsubst/maxsubst --init ../coswid/data_lgID_learn/filter2/LEARN_data_filter2_ALL.txt
python3 ldig.py -m ../coswid/models/filter2 --learn ../coswid/data_lgID_learn/filter2/LEARN_data_filter2_ALL.txt -e 0.5
python3 ldig.py -m ../coswid/models/filter2 --shrink

Modify the script <WORKING_DIR>/coswid/src/coswid.py as instructed in the CoSwID repository's README.
Run the dicServer with:

cd <WORKING_DIR>/dicServer
python3 dicServer.py .

Test if the coswid.py installation was successful:

cd <WORKING_DIR>/coswid/src
python3 coswid.py -m FILTER2 -t "Voici un texte à analyser in order to predict the languages" -c 2 -f 0 -g 0.1 -v dico

The results will be written to the default.out file.

Once the coswid.py script is working properly, you can move on to the next step.

2. TODO

Usage

TODO

References

Briakou, E., Cherry, C., & Foster, G. (2023). Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability. arXiv preprint arXiv:2305.10266. [Link]
Kevers, L. (2022). CoSwID, a Code Switching Identification Method Suitable for Under-Resourced Languages. In Proceedings of 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (SIGUL 2022) (pp. 112-121). Marseille, France. [Link]
Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., & Weiss, D. (2018). A Fast, Compact, Accurate Model for Language Identification of Codemixed Text. arXiv preprint arXiv:1810.04142. [Link]
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016). FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651. [Link]
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. [Link]

License

The source code in this repository falls under the MIT License. For CoSwID's license, please refer to its repository.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
scripts		scripts
LICENSE		LICENSE
README.md		README.md
count_mono_bilingual_instance.py		count_mono_bilingual_instance.py
count_translation_instances.py		count_translation_instances.py
evaluate_perplexity.py		evaluate_perplexity.py
language_detection.py		language_detection.py
main.py		main.py
requirements.txt		requirements.txt
translation_mining.py		translation_mining.py
upload_ablation_datasets.py		upload_ablation_datasets.py
upload_raw_langdetect_results.py		upload_raw_langdetect_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Unintentional Bilingual and Translation Instances in NLP Datasets

Installation

1. Install CoSwID as indicated in the CoSwID repository:

2. TODO

Usage

References

License

About

Releases

Packages

Languages

License

RaiBP/incidental-bilingualism

Folders and files

Latest commit

History

Repository files navigation

Detecting Unintentional Bilingual and Translation Instances in NLP Datasets

Installation

1. Install CoSwID as indicated in the CoSwID repository:

2. TODO

Usage

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages