FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

To enable the generation of high-quality figure captions, we introduce FigCaps-HF, a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences.

We release a large-scale benchmark dataset with human feedback on figure-caption pairs to enable further evaluation and development of RLHF techniques for this problem.

Benchmark Dataset

The benchmark dataset can be downloaded here: [Download Link](8.34 GB)

Folder Structure

├── No-Subfig-Img                       #contains figure-image files for each split of the dataset
│	├── Train
│	├── Val
│	└── Test
├── Caption-All                         #contains corresponding figure-captions and (precomputed) inferred human-feedback metadata
│	├── Train
│	├── Val
│	└── Test
├── human-feedback.csv                  #contains human evaluations of a subset of figure image-caption pairs
├── arxiv-metadata-oai-snapshot.json    #arXiv paper metadata (from arXiv dataset) 
└── List-of-Files-for-Each-Experiments  #list of figure names used in each experiment 
    ├── Single-Sentence-Caption
    │   ├── No-Subfig
    │   │   ├── Train
    │	│   ├── Val
    │	│   └── Test
    │	└── Yes-Subfig
    │       ├── Train
    │       ├── Val
    │       └── Test
    ├── First-Sentence                  #Same as in Single-Sentence-Caption
    └── Caption-No-More-Than-100-Tokens #Same as in Single-Sentence-Caption

Human Feedback Benchmark Data

The included human-feedback.csv contains human evaluations of 439 figure image-caption pairs from the dataset. These evaluations consist of ratings, for each image-caption pair, of the “helpfulness”, “OCR (quality)”, “takeaway” and “visual (descriptiveness)”, each scored on a 1-5 point scale (5 being the highest). Additionally, the annotations include a boolean indicating whether each pair “has-image-error”, “has-caption-error”, “has-classification-error” or “has-subfigure-error”. For convenience, the image-file name and url of the originating arXiv-paper are also included.

Number of Figures in Each Subset

	Train	Validate	Test
Benchmark	106,834	13,354	13,355

JSON Data Format (for each figure-caption in Caption-All)

Example JSON

{
  "contains-subfigure": true, 
  "Img-text": ["(b)", "s]", "[m", "fs", "et", "e", "of", "T", "im", "Attack", "duration", "[s]", "350", "300", "250", "200", "150", "100", "50", "0", "50", "100", "150", "200", "250", "300", "0", "(a)", "]", "[", "m", "fs", "et", "e", "of", "ta", "nc", "D", "is", "Attack", "duration", "[s]", "10000", "9000", "8000", "7000", "6000", "5000", "4000", "3000", "2000", "1000", "0", "50", "100", "150", "200", "250", "300", "0"], 
  "paper-ID": "1001.0025v1", 
  "figure-ID": "1001.0025v1-Figure2-1.png", 
  "figure-type": "Graph Plot", 
  "human-feedback":{
    "helpfulness": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
    "ocr": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
    "visual": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
    "takeaway": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
  }
  "0-originally-extracted": "Figure 2: Impact of the replay attack, as a function of the spoofing attack duration. (a) Location offset or error: Distance between the attack-induced and the actual victim receiver position. (b) Time offset or error: Time difference between the attack-induced clock value and the actual time.", 
  "1-lowercase-and-token-and-remove-figure-index": {
    "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
    "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
    "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
  }, 
  "2-normalized": {
    "2-1-basic-num": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }, 
    "2-2-advanced-equation-bracket": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position . BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position .", "BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "tokens": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "BRACKET-TK", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "BRACKET-TK", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }
    }
  }

JSON Schema

contains-subfigure: boolean (if figure-image contains subfigures)
paper-ID: the unique paper ID in the arXiv dataset
figure-ID: the extracted figure ID of paper (the index is not the same as the label in the caption)
figure-type: the figure type
0-originally-extracted: original figure-caption
1-lowercase-and-token-and-remove-figure-index: Removed figure index and the captions in lowercase
2-normalized:
- 2-1-basic-num: caption after replacing the number
- 2-2-advanced-euqation-bracket: caption after replacing the equations and contents in the bracket
Img-text: texts extracted from the figure, such as the texts for labels, legends ... etc.

Within the caption content, we have three attributes:

caption: caption after each normalization
sentence: a list of segmented sentences
token: a list of tokenized words

Within the human-feedback field, we have the inferred human-feedback for the different metrics (helpfulness, ocr, takeaway, and visual). The tokens are decided based on the median score of the dataset on that metric.

human-feedback:

helpfulness: Expert's rating on how helpful a caption is to understand a scientific figure
- Score: predicted score
- Token: [Good]/[Bad]
- caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended
takeaway: Expert's rating on the takeaway from the scientific image
- Score: predicted score
- Token: [Good]/[Bad]
- caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended
ocr: Expert's rating on the OCRs expressiveness
- Score: predicted score
- Token: [Good]/[Bad]
- caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended
visual: Expert's rating on the visualness of the scientific figure
- Score: predicted score
- Token: [Good]/[Bad]
- caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended

Installation

#We first need to clone this repository, install the requirements, and download the benchmark dataset
pip install --upgrade pip
git clone https://github.com/FigCapsHF/FigCapsHF
cd FigCapsHF
pip install -r requirements.txt
wget  https://figshare.com/ndownloader/files/41222934 -O benchmark.zip
unzip benchmark.zip

Example Usage

RLHF Fine-tuning

#Code edits to implement a baseline are also included in train_blip.py
#Preferred training on GPUs. If training on CPU, add "--cpu" flag.
python train_blip.py --mixed_precision fp16 --hf_score_type helpfulness --benchmark_path /benchmark

Inference

Our RLHF Fine-tuned BLIP Model can be downloaded here: [Download Link](2.5 GB) or using the code below

wget https://figshare.com/ndownloader/files/41359434 -O checkpoint_09.pth

#Generate caption for a single image
python inference.py --figure_path /Figures/sample.png --model_path /checkpoint_09.pth

#Generate evaluation metrics on the test dataset
python test_blip.py --benchmark_path /benchmark --model_path /checkpoint_09.pth

Visualization

#For the following sections, we initialize a FigCapsHF object
from FigCapsHF import FigCapsHF
FigCapsHF = FigCapsHF("/benchmark")

#Visualize sample from dataset
FigCapsHF.get_image_caption_pair(data_split = "train", image_name = "1001.0025v1-Figure5-1")

#Visualize sample from the human annotated dataset and associated metadata
FigCapsHF.get_image_caption_pair_hf(image_name = "1907.11521v1-Figure6-1")

Human Feedback Generation

#Generate human-feedback metadata for the dataset
inferred_hf_df = FigCapsHF.infer_hf_training_set(hf_score_type = "helpfulness", embedding_model = "BERT", max_num_samples = 100, quantization_levels = 3, mapped_hf_labels = ["Bad", "Neutral", "Good"])

#Generate a human-feedback score for a single figure-caption pair
hf_ds_embeddings, scores = FigCapsHF.generate_embeddings_hf_anno(hf_score_type = "helpfulness", embedding_model = "BERT")
scoring_model = FigCapsHF.train_scoring_model(hf_ds_embeddings, scores)

image_path = "/Figures/sample.png"
caption = "the graph indicates the loss of the model over successive generations"

embedding = FigCapsHF.generate_embeddings([image_path], [caption], embedding_model = "BERT")
inferred_hf_score = scoring_model.predict(embedding)

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This dataset uses data in the arXiv dataset. The arXiv dataset uses the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license for the metadata, which grants permission to remix, remake, annotate, and publish the metadata.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
Figures		Figures
configs		configs
models		models
FigCapsHF.py		FigCapsHF.py
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
test_blip.py		test_blip.py
train_blip.py		train_blip.py

FigCapsHF/FigCapsHF

Folders and files

Latest commit

History

Repository files navigation