Skip to content

MichiganNLP/visual_diversity_budget

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

[Paper] [ACL Anthology page] [Poster]

This work proposes methods to identify the data to be annotated, to balance model performance and annotation costs.

Vision-language models work poorly on data from underrepresented countries. This is primarily due to the diverse appearance of topics (objects and actions) across countries (e.g., ``toothbrush''). However, collecting diverse global data is very expensive. As solutions to budget annotations, we propose to (1) annotate the images visually different from the ones in high-resource datasets such as LAION or ImageNet; (2) supplement data from low-resource countries with data from visually similar countries.

Vision-language models work poorly on data from underrepresented countries. This is primarily due to the diverse appearance of topics (objects and actions) across countries (e.g., ``toothbrush''). However, collecting diverse global data is very expensive. As solutions to budget annotations, we propose to: (1) annotate the images visually different from the ones in high-resource datasets such as LAION or ImageNet; (2) supplement data from low-resource countries with data from visually similar countries.

We hope our work contributes to building more inclusive and affordable vision-language models and datasets to help democratize AI globally.

For more information, read our COLING 2024 paper:

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

By Oana Ignat, Longju Bai, Joan Nwatu, and Rada Mihalcea.

This repository includes the obtained results.

Obtained Results

  1. The data before and after pre-processing and the topic mapping is shown in data/data_pre-processing.csv

  2. The removed (topic, country) pairs with less than 10 images are shown in data/data_removed.csv

  3. The RQ1 answer, all the (topic, country) pairs that are consistently dissimilar to the high-resource data are in data/output_RQ1.csv

  4. The RQ2 answer, all the (topic, country) pairs, and their most similar countries are in data/output_RQ2.csv

Citation

@inproceedings{ignat-etal-2024-budget,
    title = "Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost",
    author = "Ignat, Oana  and
      Bai, Longju  and
      Nwatu, Joan  and
      Mihalcea, Rada",
    booktitle = "TODO",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/TODO",
    pages = "TODO",
    series = {COLING '24}
}

Releases

No releases published

Packages

No packages published