Skip to content

Latest commit

 

History

History
218 lines (167 loc) · 7.92 KB

jptasks.md

File metadata and controls

218 lines (167 loc) · 7.92 KB

JP Tasks

JSQuAD

JSQuAD is a Japanese version of SQuAD (Rajpurkar+, 2016), one of the datasets of reading comprehension. Each instance in the dataset consists of a question regarding a given context (Wikipedia article) and its answer. JSQuAD is based on SQuAD 1.1 (there are no unanswerable questions). We used the Japanese Wikipedia dump as of 20211101.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jsquad-1.1-0.2" \
    --num_fewshot "2" \
    --output_path "result.json"

JCommonsenseQA

JCommonsenseQA is a Japanese version of CommonsenseQA (Talmor+, 2019), which is a multiple-choice question answering dataset that requires commonsense reasoning ability. It is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jcommonsenseqa-1.1-0.2" \
    --num_fewshot "3" \
    --output_path "result.json"

JNLI

JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. NLI is a task to recognize the inference relation that a premise sentence has to a hypothesis sentence. The inference relations are 含意, 矛盾, and 中立.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jnli-1.1-0.2" \
    --num_fewshot "3" \
    --output_path "result.json"

MARC-ja

MARC-ja is a dataset of the text classification task. This dataset is based on the Japanese portion of Multilingual Amazon Reviews Corpus (MARC) (Keung+, 2020).

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "marc_ja-1.1-0.2" \
    --num_fewshot "3" \
    --output_path "result.json"

JCoLA

JCoLA is a Japanese version of CoLA (Warstadt+, 2019), which is a dataset for targeted syntactic evaluations of language models in Japanese, which consists of 10,020 sentences with acceptability judgments by linguists. The sentences are manually extracted from linguistics journals, handbooks and textbooks.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jcola-0.0-0.1" \
    --num_fewshot "3" \
    --output_path "result.json"

Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jaquad-1.1-0.2" \
    --num_fewshot "2" \
    --output_path "result.json"

JBLiMP is a novel dataset for targeted syntactic evaluations of language models in Japanese. JBLiMP consists of 331 minimal pairs, which are created based on acceptability judgments extracted from journal articles in theoretical linguistics. These minimal pairs are grouped into 11 categories, each covering a different linguistic phenomenon.

NOTE: JBLiMP is not used in official evaluations because it is too small compared to other datasets.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jblimp" \
    --num_fewshot "0" \
    --output_path "result.json"

Wikilingua is a summarization task using documents constructed from Wikihow articles. While the original dataset is multilingual, for now it just the Japanese version has been added.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "wikilingua_ja" \
    --num_fewshot "1" \
    --output_path "result.json"

This is a filtered Japanese subset of XLSum based on ROUGE-2, where PaLM 2 uses.

main features

sample scripts

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "xlsum_ja" \
    --num_fewshot "1" \
    --output_path "result.json"

XWinograd is a set of Winograd Schema sentence pairs. For example:

  • ボブはトムに尋ねた。トムはお金をいくらか貸してくれるかと。
  • ボブはトムに尋ねた。ボブはお金をいくらか貸してくれるかと。

In this case the first sentence is correct, because it doesn't make sense for Bob to ask Tom how much money Bob himself will loan.

The task is for the model to assign the higher log likelihood to the reasonable sentence. Because of the way the task is defined, it's always zero-shot with no prompt.

While XWinograd is a multilingual task, this only uses the Japanese subset, which has 959 pairs.

sample scripts

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks xwinograd_ja \
    --num_fewshot 0 \
    --output_path result.json

Japanese Questions on Knowledge of Entity (JAQKET) is a Japanese open-domain QA dataset where the answers are Wikipedia article titles. In version 2 (v2), the dataset is designed to be more similar to a typical quiz competition by excluding answer choices and not providing the answer list used in v1.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jaqket_v2" \
    --num_fewshot "1" \
    --output_path "result.json"

Japanese Questions on Knowledge of Entity (JAQKET) is a Japanese open-domain QA dataset where the answers are Wikipedia article titles. In version 1 (v1), Each question is accompanied by 20 candidate answers, meaning that the setup involves solving a multiple-choice question with 20 options to choose from.

sample script

python main.py \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks "jaqket_v1" \
    --num_fewshot "1" \
    --output_path "result.json"

Multilingual Grade School Math is a set of 250 math word problems in Japanese, and the task is to get the right integer solution to the problem. The dataset includes step-by-step solutions of example problems which are included in the prompt to induce Chain-of-Thought reasoning.

The task is rather hard, with gpt3/text-davinci-002 achieving 26% accuracy at 4-shot. Though this dataset was translated to Japanese by a human, some errors nonetheless occur including in the few-shot examples that are included in the prompt at every generation.

The implementation here features an auto-reduction of the number of examples included in the prompt if the prompt becomes larger than the model's context length.

Use mgsm-1.0-0.0 by default and mgsm-1.0-0.4 for the rinna instruction prompt.

sample scripts

Hugging Face models

python main.py \
    --model hf-causal
    --model_args $MODEL_ARGS
    --tasks mgsm-1.0-0.4
    --num_fewshot "4"
    --device "cuda"
    --output_path "result.mgsm.json"