# Make Image dataset on Hugging Face Datasets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/EvolvingLMMs-Lab/lmms-eval/blob/main/tools/make_image_hf_dataset.ipynb)

This notebook will guide you to make correct format of Huggingface dataset, in proper parquet format and visualizable in Huggingface dataset hub.

We will take the example of the dataset [`pufanyi/VQAv2_Example`](https://huggingface.co/datasets/lmms-lab/VQAv2) and convert it to the proper format.

## Download Dataset

We have uploaded the zip file of the dataset to [Hugging Face](https://huggingface.co/datasets/pufanyi/VQAv2_TOY/tree/main/source_data) for download. This dataset is a subset of the [VQAv2](https://visualqa.org/) dataset, with $10$ entries each from the `val`, `test`, and `test-dev` splits, for easier downloading.

In [45]:
!wget https://huggingface.co/datasets/pufanyi/VQAv2_TOY/resolve/main/source_data/sample_data.zip -P data
!unzip data/sample_data.zip -d data

--2024-06-19 14:09:51--  https://huggingface.co/datasets/pufanyi/VQAv2_TOY/resolve/main/source_data/sample_data.zip
Resolving huggingface.co (huggingface.co)... 13.33.30.114, 13.33.30.49, 13.33.30.76, ...
Connecting to huggingface.co (huggingface.co)|13.33.30.114|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/c9/82/c9827770a5c0b13c1b646a275968813f8705db30ac0de29f118bb316c2b2a4eb/8cc2e821b7c6e4b5726a6feeb6214cd2d4810d53f568a5f3565d78e6d1ee5403?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27sample_data.zip%3B+filename%3D%22sample_data.zip%22%3B&response-content-type=application%2Fzip&Expires=1719036591&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTAzNjU5MX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2M5LzgyL2M5ODI3NzcwYTVjMGIxM2MxYjY0NmEyNzU5Njg4MTNmODcwNWRiMzBhYzBkZTI5ZjExOGJiMzE2YzJiMmE0ZWIvOGNjMmU4MjFiN2M2ZTRiNTcyNmE2ZmVl

We can open `data/questions` to take a view of the dataset organization. We found that the toy-`VQAv2` dataset is organized as follows:

```json
{
    "info": { /* some infomation */ },
    "task_type": "TASK_TYPE", "data_type": "mscoco",
    "license": { /* some license */ },
    "questions": [
        {
            "image_id": 262144, // integer id of the image
            "question": "Is the ball flying towards the batter?",
            "question_id": 262144000
        },
        /* ... */
    ]
}
```

## Define Dataset Features _(Optional<sup>*</sup>)_

You can define the features of the dataset. For more details, please refer to the [official documentation](https://huggingface.co/docs/datasets/en/about_dataset_features).

<sup>*</sup> _Note that if the dataset features are consistent and all entries in your dataset table are non-null **for all splits of data**, you can skip this step._

In [None]:
import datasets

features = datasets.Features(
    {
        "question": datasets.Value("string"),
        "question_id": datasets.Value("int64"),
        "image_id": datasets.Value("string"),
        "image": datasets.Image(),
        "answers": datasets.Sequence(datasets.Sequence(feature={"answer": datasets.Value("string"), "answer_confidence": datasets.Value("string"), "answer_id": datasets.Value("int64")})),
        "answer_type": datasets.Value("string"),
        "multiple_choice_answer": datasets.Value("string"),
        "question_type": datasets.Value("string"),
    }
)

## Define Data Generator

We use [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.from_generator) to create the dataset.

The generator function should `yield` dictionaries with the keys corresponding to the dataset features. This can save memory when loading large datasets.

For the image data, we can convert the image to [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) object.

Note that if some columns are missing in some splits of the dataset (for example, the `answer` column is usually missing in the `test` split), we need to set these columns to null to ensure that all splits have the same features.

In [None]:
import os
import json
from PIL import Image

KEYS = ["question", "question_id", "image_id", "answers", "answer_type", "multiple_choice_answer", "question_type"]

def generator(qa_file, image_folder, image_prefix):
    # Open and load the question-answer file
    with open(qa_file, "r") as f:
        data = json.load(f)
        qa = data["questions"]

    for q in qa:
        # Get the image id
        image_id = q["image_id"]
        # Construct the image path
        image_path = os.path.join(image_folder, f"{image_prefix}_{image_id:012}.jpg")
        # Open the image and add it to the question-answer dictionary
        q["image"] = Image.open(image_path)
        # Check if all keys are present in the question-answer dictionary, if not add them with None value
        for key in KEYS:
            if key not in q:
                q[key] = None
        # Yield the question-answer dictionary
        yield q

## Generate Dataset

We generate the dataset using the generator function.

Note that if you skip the step of defining dataset features, there is no need to pass the `features` argument. The dataset infer the features from the dataset automatically.

In [None]:
NUM_PROC = 32 # number of processes to use for multiprocessing, set to 1 for no multiprocessing

data_val = datasets.Dataset.from_generator(
    generator,
    gen_kwargs={
        "qa_file": "data/questions/v2_OpenEnded_mscoco_val2014_questions.json",
        "image_folder": "data/images/val2014",
        "image_prefix": "COCO_val2014",
    },
    features=features,
    num_proc=NUM_PROC,
)

data_test = datasets.Dataset.from_generator(
    generator,
    gen_kwargs={
        "qa_file": "data/questions/v2_OpenEnded_mscoco_test2015_questions.json",
        "image_folder": "data/images/test2015",
        "image_prefix": "COCO_test2015",
    },
    features=features,
    num_proc=NUM_PROC,
)

data_test_dev = datasets.Dataset.from_generator(
    generator,
    gen_kwargs={
        "qa_file": "data/questions/v2_OpenEnded_mscoco_test-dev2015_questions.json",
        "image_folder": "data/images/test2015",
        "image_prefix": "COCO_test2015",
    },
    features=features,
    num_proc=NUM_PROC,
)

## Dataset Upload

Finally, we group the dataset with different splits and upload it to the Huggingface dataset hub.

In [None]:
data = datasets.DatasetDict({"val": data_val, "test": data_test, "test_dev": data_test_dev})

In [None]:
data.push_to_hub("pufanyi/VQAv2")

In [44]:
from huggingface_hub import HfApi

api = HfApi()
api.upload_file(
    path_or_fileobj="/data/pufanyi/project/lmms-eval-public/tools/data/sample_data.zip",
    path_in_repo="source_data/sample_data.zip",
    repo_id="pufanyi/VQAv2_TOY",
    repo_type="dataset",
)

CommitInfo(commit_url='https://huggingface.co/datasets/pufanyi/VQAv2_TOY/commit/b057eff450520a6e3fc7e6be88c3a172c4b5d99b', commit_message='Upload source_data/sample_data.zip with huggingface_hub', commit_description='', oid='b057eff450520a6e3fc7e6be88c3a172c4b5d99b', pr_url=None, pr_revision=None, pr_num=None)