Generate interleaved text and image content in a structured format you can directly pass to downstream APIs. You can also use this to control how many "draft" text tokens and "imagination" image tokens the model generates first before it starts generating the final output.
Example output:
{
"fruit_name": "Orange",
"fruit_image": "<image>",
"images_of_related_fruits": [
"<image>",
"<image>",
"<image>",
]
}
With the images saved to disk.
NOTE: This is a work in progress. It currently only supports vector-quantized vision-language models (e.g. Chameleon & its finetunes), but it should work on soft vector vision-language models too with little modification (hopefully).
Preprint coming up.
pip install -e .
To install everything:
pip install -e .[test]
Scripts to run the examples are in the scripts
directory.
python scripts/text_only_generation.py --prompt "Are bananas fruits or vegetables?"
Prompt:
Are bananas fruits or vegetables?
Response:
Bananas are technically a fruit. They are the seeds of the plantain, a type of flowering plant, and the flesh is formed from the ovary. However, bananas are often classified as a vegetable for culinary purposes, especially in many Asian countries.
python scripts/text_only_generation.py --inference_mode text-image-to-text --prompt "Which constellation is this?" --image_1_path "https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg"
Prompt:
Response:
That's the constellation Ursa Major.
python scripts/text_only_generation.py --inference_mode multi-image-to-text --prompt "What do these two images have in common?" --image_1_path "https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg" --image_2_path "https://www.kxan.com/wp-content/uploads/sites/40/2020/10/ORION.jpg"
Prompt:
Response:
Both are constellations.
python scripts/image_only_generation.py --inference_mode text-to-image --prompt "Draw a banana"
Prompt:
Draw a banana
Response:
python scripts/image_only_generation.py --inference-mode text-image-to-image --max-new-tokens 2500 --prompt "text-image-to-image" --prompt "Draw a variation of this image" --image-1-path "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
Prompt:
Response:
I honestly don't know why it decided to draw an alien. But it's cool!
python scripts/image_only_generation.py --inference-mode multi-image-to-image --prompt "Draw what is common between these images" --image-1-path "https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg" --image-2-path "https://www.kxan.com/wp-content/uploads/sites/40/2020/10/ORION.jpg" --max-new-tokens 1026
Prompt:
Response:
python scripts/interleaved_generation.py --inference-mode text-to-interleaved-text-image --max-new-tokens 2055 --prompt "draw a snowman and explain it"
Prompt:
draw a snowman and explain it
Response:
Here is a picture of the snowman
![
It's a snowman! There's a white snowman with black beady eyes, a black, toothy grin, a carrot nose, black buttons for buttons. It's wearing a grey hat with a white pom-pom. It has two skinny arms and two arms with black gloves.](<image>)
The snowman is a classic winter icon, with its round body made of snow and its hat, coat, scarf, and button eyes adding a touch of warmth and personality to the cold, icy landscape. The buttons for eyes and a carrot nose further contribute to the snowman's comical and endearing appearance.The addition of two skinny arms and two arms with black gloves adds a playful and practical element — gloves are often recommended for outdoor activities in colder weather, making the snowman's design not only charming but also practical.Thanks for sharing your creative vision with us! It's always enjoyable to explore and appreciate the little details that make snowmen, and in this case, a playful and practical one at that.
python scripts/structured_generation.py --prompt "Please generate a fruit along with a picture of it and related fruits."
Prompt:
Please generate a fruit along with a picture of it and related fruits. Please follow the following schema: <json_schema>
Json schema:
{
"name": "Fruit Generator",
"description": "A tool that generates details about a fruit with text and images in one go!",
"type": "object",
"properties": {
"fruit_name": {
"type": "string",
# "minLength": 1,
# "maxLength": 20,
"pattern": "[a-zA-Z0-9]{1,20}",
},
"fruit_image" : {
"type": "image",
# "maxLength": 10,
},
"images_of_related_fruits" : {
"type": "array",
"items": {
"type": "image",
# "minLength": 1,
},
"minItems": 3,
"maxItems": 3,
}
},
"required": ["fruit_name", "fruit_image", "images_of_related_fruits"],
}
Response:
{
"fruit_name": "Orange",
"fruit_image": "<image>",
"images_of_related_fruits": ["<image>", "<image>", "<image>"]
}
TODO
We can run the scripts on Modal's GPUs by replacing python scripts/structured_generation.py
above with modal run scripts/modal_inference.py
. That's it!
If you haven't installed Modal yet, you can do so by running:
pip install modal
Then setup your modal.com account and:
modal setup
Then run the scripts!
@misc{cesista2024mmsg,
author = {Franz Cesista},
title = {Multimodal Structured Generation},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/leloykun/mmsg}},
}
The models used in this work are based on Meta's Chameleon and GAIR's Anole models. This was also made a lot easier by Outlines. Please cite their work too!
Also big thanks to @zucchini-nlp and @ArthurZucker for feedback while integrating the models into Transformers!