Skip to content

FigCapsHF/FigCapsHF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

[Website] [Benchmark Dataset]

To enable the generation of high-quality figure captions, we introduce FigCaps-HF, a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences.

We release a large-scale benchmark dataset with human feedback on figure-caption pairs to enable further evaluation and development of RLHF techniques for this problem.

Benchmark Dataset

The benchmark dataset can be downloaded here: [Download Link](8.34 GB)

Folder Structure

├── No-Subfig-Img                       #contains figure-image files for each split of the dataset
│	├── Train
│	├── Val
│	└── Test
├── Caption-All                         #contains corresponding figure-captions and (precomputed) inferred human-feedback metadata
│	├── Train
│	├── Val
│	└── Test
├── human-feedback.csv                  #contains human evaluations of a subset of figure image-caption pairs
├── arxiv-metadata-oai-snapshot.json    #arXiv paper metadata (from arXiv dataset) 
└── List-of-Files-for-Each-Experiments  #list of figure names used in each experiment 
    ├── Single-Sentence-Caption
    │   ├── No-Subfig
    │   │   ├── Train
    │	│   ├── Val
    │	│   └── Test
    │	└── Yes-Subfig
    │       ├── Train
    │       ├── Val
    │       └── Test
    ├── First-Sentence                  #Same as in Single-Sentence-Caption
    └── Caption-No-More-Than-100-Tokens #Same as in Single-Sentence-Caption

Human Feedback Benchmark Data

The included human-feedback.csv contains human evaluations of 439 figure image-caption pairs from the dataset. These evaluations consist of ratings, for each image-caption pair, of the “helpfulness”, “OCR (quality)”, “takeaway” and “visual (descriptiveness)”, each scored on a 1-5 point scale (5 being the highest). Additionally, the annotations include a boolean indicating whether each pair “has-image-error”, “has-caption-error”, “has-classification-error” or “has-subfigure-error”. For convenience, the image-file name and url of the originating arXiv-paper are also included.

Number of Figures in Each Subset

Train Validate Test
Benchmark 106,834 13,354 13,355

JSON Data Format (for each figure-caption in Caption-All)

Example JSON

{
  "contains-subfigure": true, 
  "Img-text": ["(b)", "s]", "[m", "fs", "et", "e", "of", "T", "im", "Attack", "duration", "[s]", "350", "300", "250", "200", "150", "100", "50", "0", "50", "100", "150", "200", "250", "300", "0", "(a)", "]", "[", "m", "fs", "et", "e", "of", "ta", "nc", "D", "is", "Attack", "duration", "[s]", "10000", "9000", "8000", "7000", "6000", "5000", "4000", "3000", "2000", "1000", "0", "50", "100", "150", "200", "250", "300", "0"], 
  "paper-ID": "1001.0025v1", 
  "figure-ID": "1001.0025v1-Figure2-1.png", 
  "figure-type": "Graph Plot", 
  "human-feedback":{
    "helpfulness": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
    "ocr": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
    "visual": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
    "takeaway": {
      "score": XXXX,
      "label": "[GOOD]/[BAD]",
      "caption-prepend": "[GOOD]/[BAD] actual caption...",
    },
  }
  "0-originally-extracted": "Figure 2: Impact of the replay attack, as a function of the spoofing attack duration. (a) Location offset or error: Distance between the attack-induced and the actual victim receiver position. (b) Time offset or error: Time difference between the attack-induced clock value and the actual time.", 
  "1-lowercase-and-token-and-remove-figure-index": {
    "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
    "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
    "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
  }, 
  "2-normalized": {
    "2-1-basic-num": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }, 
    "2-2-advanced-equation-bracket": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position . BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position .", "BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "tokens": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "BRACKET-TK", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "BRACKET-TK", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }
    }
  }

JSON Schema

  • contains-subfigure: boolean (if figure-image contains subfigures)
  • paper-ID: the unique paper ID in the arXiv dataset
  • figure-ID: the extracted figure ID of paper (the index is not the same as the label in the caption)
  • figure-type: the figure type
  • 0-originally-extracted: original figure-caption
  • 1-lowercase-and-token-and-remove-figure-index: Removed figure index and the captions in lowercase
  • 2-normalized:
    • 2-1-basic-num: caption after replacing the number
    • 2-2-advanced-euqation-bracket: caption after replacing the equations and contents in the bracket
  • Img-text: texts extracted from the figure, such as the texts for labels, legends ... etc.

Within the caption content, we have three attributes:

  • caption: caption after each normalization
  • sentence: a list of segmented sentences
  • token: a list of tokenized words

Within the human-feedback field, we have the inferred human-feedback for the different metrics (helpfulness, ocr, takeaway, and visual). The tokens are decided based on the median score of the dataset on that metric.

human-feedback:

  • helpfulness: Expert's rating on how helpful a caption is to understand a scientific figure
    • Score: predicted score
    • Token: [Good]/[Bad]
    • caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended
  • takeaway: Expert's rating on the takeaway from the scientific image
    • Score: predicted score
    • Token: [Good]/[Bad]
    • caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended
  • ocr: Expert's rating on the OCRs expressiveness
    • Score: predicted score
    • Token: [Good]/[Bad]
    • caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended
  • visual: Expert's rating on the visualness of the scientific figure
    • Score: predicted score
    • Token: [Good]/[Bad]
    • caption-prepend: 1-lowercase-and-token-and-remove-figure-index caption with the token pre-pended

Installation

#We first need to clone this repository, install the requirements, and download the benchmark dataset
pip install --upgrade pip
git clone https://github.com/FigCapsHF/FigCapsHF
cd FigCapsHF
pip install -r requirements.txt
wget  https://figshare.com/ndownloader/files/41222934 -O benchmark.zip
unzip benchmark.zip

Example Usage

RLHF Fine-tuning

#Code edits to implement a baseline are also included in train_blip.py
#Preferred training on GPUs. If training on CPU, add "--cpu" flag.
python train_blip.py --mixed_precision fp16 --hf_score_type helpfulness --benchmark_path /benchmark

Inference

Our RLHF Fine-tuned BLIP Model can be downloaded here: [Download Link](2.5 GB) or using the code below

wget https://figshare.com/ndownloader/files/41359434 -O checkpoint_09.pth
#Generate caption for a single image
python inference.py --figure_path /Figures/sample.png --model_path /checkpoint_09.pth
#Generate evaluation metrics on the test dataset
python test_blip.py --benchmark_path /benchmark --model_path /checkpoint_09.pth

Visualization

#For the following sections, we initialize a FigCapsHF object
from FigCapsHF import FigCapsHF
FigCapsHF = FigCapsHF("/benchmark")
#Visualize sample from dataset
FigCapsHF.get_image_caption_pair(data_split = "train", image_name = "1001.0025v1-Figure5-1")
#Visualize sample from the human annotated dataset and associated metadata
FigCapsHF.get_image_caption_pair_hf(image_name = "1907.11521v1-Figure6-1")

Human Feedback Generation

#Generate human-feedback metadata for the dataset
inferred_hf_df = FigCapsHF.infer_hf_training_set(hf_score_type = "helpfulness", embedding_model = "BERT", max_num_samples = 100, quantization_levels = 3, mapped_hf_labels = ["Bad", "Neutral", "Good"])
#Generate a human-feedback score for a single figure-caption pair
hf_ds_embeddings, scores = FigCapsHF.generate_embeddings_hf_anno(hf_score_type = "helpfulness", embedding_model = "BERT")
scoring_model = FigCapsHF.train_scoring_model(hf_ds_embeddings, scores)

image_path = "/Figures/sample.png"
caption = "the graph indicates the loss of the model over successive generations"

embedding = FigCapsHF.generate_embeddings([image_path], [caption], embedding_model = "BERT")
inferred_hf_score = scoring_model.predict(embedding)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This dataset uses data in the arXiv dataset. The arXiv dataset uses the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license for the metadata, which grants permission to remix, remake, annotate, and publish the metadata.

About

A Figure-to-Caption Generative Modeling Framework with RLHF

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages