Skip to content

Files

Latest commit

b1cc2ca · Dec 12, 2024

History

History

Data

This directory contains all data files used for the experiments and analysis in the paper. The data is organized into the following subdirectories:

data/evaluation_benchmarks_afr_release/: Contains the full translations of Winogrande, Belebele, and MMLU ("college medicine", "clinical knowledge", and "virology"). This is the folder one should use if they just want the benchmarks.

data/gpt_performance/: Contains raw OpenAI Batch API response .jsonl files of the GPT-3.5, GPT-4, and GPT-4o out-of-the-box responses for Winogrande, Belebele, and MMLU. Each model was run 3 times on each benchmark, and each file named according to the following format: <model>_generations_<run_number>.jsonl, where <model> is one of gpt-3.5, gpt-4, or gpt-4o and <run_number> indicates the trial number for the experiment, beginning at 0 (Run #1) and ending at 2 (Run #3).

data/translations_and_llm_responses/ contains the following data files:

  • 1. Data Dictionary.pdf: A data dictionary describing the contents of each data file within this subdirectory and the meaning of each column within each .csv file.
  • 2. Winogrande Data.csv: The raw human translation results for the Winogrande dataset.
  • 3. Winogrande Cultural Surveys.csv: The raw human survey results for quality and cultural appropriateness assessment.
  • 4. Winogrande Upworker Profiles.csv: The anonymized profiles and qualifications of each Upworker who was hired to do any of the Winogrande translation/assessment tasks. This is what was used to make Appendix Table 25.
  • 5. LLM Responses.csv: The complete set of all LLM responses to every single question asked to an LLM during the experiments conducted for the AAAI paper released with this code repository. This includes LLM responses from out-of-the-box experiments, fine-tuning experiments using the full fine-tuning datasets, and fine-tuning experiments using quality x quantity sampling on the fine-tuning datasets.
  • 6. Evaluation Data.csv: The complete set of all evaluation benchmark questions, including machine-translated versions.
  • 7. Fine-Tuning Datasets.csv: The set of the actual fine-tuning datasets used for our experiments, given as lists of evaluation benchmark IDs that match those given in 6. Evaluation Data.csv. Note that the quality x quantity fine-tuning datasets reproduced for this repository (i.e. in results/fine-tuning_datasets/quality_x_quantity/) may not match those given in this CSV file (due to the randomness of GPT-4o responses). As such, this CSV file should be used to select fine-tuning dataset rows if one wanted to use the exact same rows that we used, instead of just the same method to generate the rows that we used.
  • 8. MMLU Data.csv: The raw human translation results for the MMLU dataset.
  • 9. Belebele Data.csv: The raw human translation results for the Belebele dataset.

Note that this is the folder that should be used if one wanted to conduct additional analyses using our translation results or raw LLM responses (e.g. perhaps there is a correlation between ROUGE-1 score and LLM performance).

data/parquet_ready_release/ contains a version of data/evaluation_benchmarks_afr_release/ with just our human-translated Winogrande and MMLU contributions formatted in a way suitable for HuggingFace's Dataset Viewer (i.e. it can be automatically converted to Parquet by HuggingFace). It was generated with create_parquet_ready_release.py.