Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

This work proposes methods to identify the data to be annotated, to balance model performance and annotation costs.

Vision-language models work poorly on data from underrepresented countries. This is primarily due to the diverse appearance of topics (objects and actions) across countries (e.g., ``toothbrush''). However, collecting diverse global data is very expensive. As solutions to budget annotations, we propose to: (1) annotate the images visually different from the ones in high-resource datasets such as LAION or ImageNet; (2) supplement data from low-resource countries with data from visually similar countries.

We hope our work contributes to building more inclusive and affordable vision-language models and datasets to help democratize AI globally.

For more information, read our COLING 2024 paper:

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

By Oana Ignat, Longju Bai, Joan Nwatu, and Rada Mihalcea.

This repository includes the obtained results.

Obtained Results

The data before and after pre-processing and the topic mapping is shown in data/data_pre-processing.csv
The removed (topic, country) pairs with less than 10 images are shown in data/data_removed.csv
The RQ1 answer, all the (topic, country) pairs that are consistently dissimilar to the high-resource data are in data/output_RQ1.csv
The RQ2 answer, all the (topic, country) pairs, and their most similar countries are in data/output_RQ2.csv

Citation

@inproceedings{ignat-etal-2024-budget,
    title = "Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost",
    author = "Ignat, Oana  and
      Bai, Longju  and
      Nwatu, Joan  and
      Mihalcea, Rada",
    booktitle = "TODO",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/TODO",
    pages = "TODO",
    series = {COLING '24}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
.gitignore		.gitignore
README.md		README.md
task_overview.png		task_overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

README.md

README.md

task_overview.png

task_overview.png

Repository files navigation

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

Obtained Results

Citation

About

Releases

Packages

MichiganNLP/visual_diversity_budget

Folders and files

Latest commit

History

Repository files navigation

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

Obtained Results

Citation

About

Topics

Resources

Stars

Watchers

Forks