Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
[Paper] [ACL Anthology page] [Poster]
This work proposes methods to identify the data to be annotated, to balance model performance and annotation costs.
Vision-language models work poorly on data from underrepresented countries. This is primarily due to the diverse appearance of topics (objects and actions) across countries (e.g., ``toothbrush''). However, collecting diverse global data is very expensive. As solutions to budget annotations, we propose to: (1) annotate the images visually different from the ones in high-resource datasets such as LAION or ImageNet; (2) supplement data from low-resource countries with data from visually similar countries.
We hope our work contributes to building more inclusive and affordable vision-language models and datasets to help democratize AI globally.
For more information, read our COLING 2024 paper:
By Oana Ignat, Longju Bai, Joan Nwatu, and Rada Mihalcea.
This repository includes the obtained results.
-
The data before and after pre-processing and the topic mapping is shown in data/data_pre-processing.csv
-
The removed (topic, country) pairs with less than 10 images are shown in data/data_removed.csv
-
The RQ1 answer, all the (topic, country) pairs that are consistently dissimilar to the high-resource data are in data/output_RQ1.csv
-
The RQ2 answer, all the (topic, country) pairs, and their most similar countries are in data/output_RQ2.csv
@inproceedings{ignat-etal-2024-budget,
title = "Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost",
author = "Ignat, Oana and
Bai, Longju and
Nwatu, Joan and
Mihalcea, Rada",
booktitle = "TODO",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/TODO",
pages = "TODO",
series = {COLING '24}
}