Skip to content

Python program for detecting unintentional bilingual and translation instances in NLP datasets.

License

Notifications You must be signed in to change notification settings

RaiBP/incidental-bilingualism

Repository files navigation

Detecting Unintentional Bilingual and Translation Instances in NLP Datasets

Python implementation of Google's "Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability" (Briakou et al. 2023), using open source tools. Some differences with respect to the paper:

  1. Per-token language detection is done with Kevers (2022)'s CoSwID model, instead of Google's CMX model (Zhang et al., 2018).
  2. In case the CoSwID model is very unsure over a subsequent series of tokens, we use Facebook's FastText-langdetect (Joulin et al., 2016) to label the entire uncertain sequence.

Installation

1. Install CoSwID as indicated in the CoSwID repository:

  1. Install the pre-requisites.

Required libraries:

unzip
g++
python3

Required Python packages:

python-daemon
numpy
fasttext-langdetect
iso-639
levenshtein
  1. Clone the CoSwID repositories. Let's call the root directory where you want to clone them <WORKING_DIR>, then you'd want to do:
cd <WORKING_DIR>
git clone https://github.com/lkevers/ldig-python3.git
git clone https://github.com/lkevers/dicServer.git
git clone https://github.com/lkevers/coswid.git
  1. Generate the language model following the instructions on the CoSwID repository. For example, the FILTER2 model was proposed in the CoSwID paper. It detects English, Italian, German, French, Portuguese, Spanish, Dutch, Romanian and Corsican and works well for our purposes. You can generate FILTER2 (default settings) with:
cd <WORKING_DIR>/coswid/data_lgID_learn
unzip filter2.zip
cat filter2/LEARN_data_filter2_* >>filter2/LEARN_data_filter2_ALL.txt
mkdir ../models/filter2
cd ../../ldig-python3/maxsubst
g++ -o maxsubst maxsubst.cpp -Icybozulib/include
chmod +x maxsubst
cd ../
python3 ldig.py -m ../coswid/models/filter2 -x maxsubst/maxsubst --init ../coswid/data_lgID_learn/filter2/LEARN_data_filter2_ALL.txt
python3 ldig.py -m ../coswid/models/filter2 --learn ../coswid/data_lgID_learn/filter2/LEARN_data_filter2_ALL.txt -e 0.5
python3 ldig.py -m ../coswid/models/filter2 --shrink
  1. Modify the script <WORKING_DIR>/coswid/src/coswid.py as instructed in the CoSwID repository's README.

  2. Run the dicServer with:

cd <WORKING_DIR>/dicServer
python3 dicServer.py .
  1. Test if the coswid.py installation was successful:
cd <WORKING_DIR>/coswid/src
python3 coswid.py -m FILTER2 -t "Voici un texte à analyser in order to predict the languages" -c 2 -f 0 -g 0.1 -v dico

The results will be written to the default.out file.

Once the coswid.py script is working properly, you can move on to the next step.

2. TODO

Usage

TODO

References

  • Briakou, E., Cherry, C., & Foster, G. (2023). Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability. arXiv preprint arXiv:2305.10266. [Link]
  • Kevers, L. (2022). CoSwID, a Code Switching Identification Method Suitable for Under-Resourced Languages. In Proceedings of 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (SIGUL 2022) (pp. 112-121). Marseille, France. [Link]
  • Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., & Weiss, D. (2018). A Fast, Compact, Accurate Model for Language Identification of Codemixed Text. arXiv preprint arXiv:1810.04142. [Link]
  • Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016). FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651. [Link]
  • Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. [Link]

License

The source code in this repository falls under the MIT License. For CoSwID's license, please refer to its repository.

Releases

No releases published

Packages

No packages published

Languages