Collection for WikiNLI Corpus

This is the code we used to collect WikiNLI corpus through implementation of pre-trained Opus-MT models on the WikiMatrix data.

Data

We choose Indonesian, Czech, French, Japanese, German corpus from the WikiMatrix data, and create our corpus by sampling from the corpus with all the range of scores.

Models

We use the processor in the Opus-MT framework to load the Public MT-OPUS models (including the SentencePiece-based pre-processors and the models). The models translate the sentences on WikiMatrix into English based on the Marian-NMT, where more command-line options can be found to change the setting of translation.

The post-processor resolves the encoding issues in the translation and filters out the sentence pairs that are detected as in other languages. The datasets are finally saved as csv files.

Implementation

To implement the collection, run the launch.sbatch.sh file with the language code. There are 5 languages that can be chosen from: Indonesian(id), Japanese(ja), French(fr), Czech(cs), German(de). For example:

chmod +x launch.sbatch.sh
./launch.sbatch.sh fr

Evaluation

To get the evaluation of data, configure CoreNLP by following the instructions and start the client, then run the evaluation.py file with the input as the json data file path to get the results.

References

[1] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[2] Rico Sennrich, Barry Haddow and Alexandra Birch (2015). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
apply_bpe.py		apply_bpe.py
content_processor.py		content_processor.py
evaluations.py		evaluations.py
extract.py		extract.py
launch.sbatch.sh		launch.sbatch.sh
postprocessor.py		postprocessor.py
preprocessor.py		preprocessor.py
service.json		service.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

apply_bpe.py

apply_bpe.py

content_processor.py

content_processor.py

evaluations.py

evaluations.py

extract.py

extract.py

launch.sbatch.sh

launch.sbatch.sh

postprocessor.py

postprocessor.py

preprocessor.py

preprocessor.py

service.json

service.json

Repository files navigation

Collection for WikiNLI Corpus

Data

Models

Implementation

Evaluation

References

About

Releases

Packages

Languages

rachel-chen-3959/WikiNLI

Folders and files

Latest commit

History

Repository files navigation

Collection for WikiNLI Corpus

Data

Models

Implementation

Evaluation

References

About

Resources

Stars

Watchers

Forks

Languages