# Super Rapid Annotator - Proof of Concept

This Jupyter Notebook presents a proof of concept for the Super Rapid Annotator, a system that utilizes a multimodal vision model to process images and produce structured JSON annotations. This notebook illustrates the workflow using a single image, encompassing input handling, model inference using BakLlava, annotation generation with JsonFormer, and output presentation. An interactive Gradio interface is provided at the end for real-time image annotation.

## Objective

The primary goal of this notebook is to demonstrate a proposed workflow for developing the Super Rapid Annotator system. The system is conceptualized in two main stages: image-to-text generation using a multimodal vision model, followed by text-to-JSON structuring.

1. **Image-to-Text Generation:** Utilizing BakLlava, this stage focuses on interpreting the visual content of the image to generate descriptive annotations.
2. **Text-to-JSON Generation:** This stage involves understanding the generated text and mapping it to appropriate JSON values, facilitated by JsonFormer and additional helper functions.

## Notebook Overview

1. **Library Installation:** Installation of all necessary Python libraries.
2. **Model Initialization:** Initialization of the BakLlava model for image-to-text generation and preparation for JSON structuring.
3. **Helper Functions:** Definition of helper functions to assist in processing and structuring data.
4. **Gradio Interface:** Introduction of a Gradio interface to enable users to interact with the model, upload images, and receive annotations in real-time.

## Workflow Overview

The notebook employs a two-step process: first, it generates textual annotations from the image using the BakLlava model. Then, it structures those descriptions into a JSON format using JsonFormer and additional parsing functions.

1. **Input:** Upload or insert an image.
2. **Generate Annotations:** BakLlava model is used to interpret the image and generate textual annotations.
3. **Parse Model Outputs and Generate JSON:** JsonFormer, along with helper functions, structures the model's output into the predefined JSON schema.
4. **Output:** The final output is presented as a structured JSON string, akin to a chatbot's response.


## Library Installation

In [1]:
# Install libraries
!pip install -q transformers gradio
!pip install -q -i https://pypi.org/simple/ bitsandbytes
!pip install -q accelerate
!pip install -q jsonformer

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m72.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.7/310.7 kB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.5/138.5 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m103.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.5/60.5 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.9/129.9 kB[0m [31m17.5

## Model Initialization

In this section, we initialize the models necessary for our annotation process. We utilize the `transformers` library from Hugging Face

1. **BakLlava Model:** BakLlava is employed for the image-to-text generation phase.
2. **Dolly Model:** A variant of the Dolly model is used by JsonFormer.


In [2]:
from transformers import BitsAndBytesConfig, pipeline
from PIL import Image
import torch

model_id = "llava-hf/bakLlava-v1-hf"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/70.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/934M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

In [3]:
from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load the Jsonformer model and tokenizer
jsonformer_model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b")
jsonformer_tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b")

config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/13.8G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Helper Functions


In [4]:
# Define the RedHenLab example data
example_image_url = "https://images.unsplash.com/photo-1494959764136-6be9eb3c261e?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
example_json = [
    {
        "description": "Is the person in the image standing?",
        "value": "standup"
    },
    {
        "description": "Can you see the hands of the person?",
        "value": "hands"

    },
    {
        "description": "Is it inside or outside?",
        "value": "inside"
    }
]

In [5]:
# Generate expected output Json
def generate_json_schema_from_descriptions(json_schema):
    properties = {}
    for item in json_schema:
        description = item.get("description", "")
        value = item.get("value", "")
        if description.startswith(("Is", "Can")):
            properties[value] = {"type": "boolean"}
        else:
            # TODO: Generate appropriate Json type calue from description
            properties[value] = {"type": "string"}
    return {
        "type": "object",
        "properties": properties
    }

# Helper function to create correct Json output values based on prompt response
def adjust_values_based_on_prompt(prompt, schema, output):
    lines = prompt.split('\n')
    for line in lines:
        parts = line.split(':', 1)
        if len(parts) != 2:
            continue  # Skip lines that don't have the expected format

        user_prompt, bot_output = parts[0].strip(), parts[1].strip()

        for key, value in schema['properties'].items():
            if key in user_prompt:
                # Check the expected type from the schema and set the value accordingly
                if value.get('type') == 'string':
                    output[key] = bot_output  # Directly set the string output
                elif value.get('type') == 'boolean':
                    if 'Yes' in bot_output:
                        output[key] = True
                    elif 'No' in bot_output:
                        output[key] = False
                    elif key in bot_output:
                        output[key] = True  # Default to True if the key is in bot_output


    return output


In [6]:
import gradio as gr
import os
import json

# Gradio Chatbot helper functions
def add_text(history, text):
    return history + [(text, None)]

def bot_inference(image, json_schema_str):
    try:
        # Parse the JSON schema string into a Python object
        json_schema = json.loads(json_schema_str)

        # Ensure json_schema is a list
        if not isinstance(json_schema, list):
            return "JSON schema must be a list of objects."

        json_schema_for_jsonformer = generate_json_schema_from_descriptions(json_schema)

        # Prepare the image prompt
        image_prompt = "<image>" if image else "No image provided"

        # Initialize an empty response dictionary
        structured_input = ""
        prompt_input = "Generate information based on the following schema, the values should be extracted from the data;\n"
        for item in json_schema:
            if not isinstance(item, dict):
                return "Each item in the JSON schema list must be an object."

            description = item.get("description", "")
            if description and image:
                # Generate the response from the image
                prompt = f"USER: Annotate this image <image>\n{description}\n"
                outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 50})
                chat_response = outputs[0]["generated_text"].split("\n")[-1].strip()

                # Add the response to the prompt input in a way Jsonformer expects
                prompt_input += f"{description}: {chat_response}.\n"

        # Use the structured prompt input to generate JSON
        jsonformer = Jsonformer(jsonformer_model, jsonformer_tokenizer, json_schema_for_jsonformer, prompt_input)
        generated_data = jsonformer()
        #generated_data = {item['value']: None for item in json_schema}

        adjusted_output = adjust_values_based_on_prompt(prompt_input, json_schema_for_jsonformer, generated_data)

        return adjusted_output


    except json.JSONDecodeError:
        return "Invalid JSON schema."


def bot(history, image, text):
    response_json = bot_inference(image, text)

    response_str = json.dumps(response_json, indent=2) if isinstance(response_json, dict) else str(response_json)

    # Append the user's message and bot's response to the history
    history.append((text, response_str))
    # Return the updated history and reset the textbox
    return history, ""


## Gradio Interface

Run this code to launch a Gradio app and try out the Super Rapid Annotator PoC!

In [12]:
# Create a Gradio interface
with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column(scale=2):
            imagebox = gr.Image(type="pil", label="Upload an Image")
            json_schema_input = gr.Textbox(label="Enter JSON Schema",
                                           placeholder = "[\n    {\n        \"description\": \"Is it inside or outside?\",\n        \"value\": \"inside\"\n    }\n]",
                                           lines=10,
                                           scale=4)
            submit_btn = gr.Button(value="Send", variant="primary")
            example_btn = gr.Button("Load RedHenLab Example")

        with gr.Column(scale=8):
            chatbot = gr.Chatbot([], elem_id="chatbot", label="BakLLaVA Chatbot", height=650, layout="panel")

    # Handle text and image submission together and reset the textbox
    submit_btn.click(
        bot,
        inputs=[chatbot, imagebox, json_schema_input],
        outputs=[chatbot, json_schema_input]
    )

    # Define what happens when the example button is clicked
    def load_example():
        return None, example_image_url, json.dumps(example_json, indent=2)

    example_btn.click(
        fn=load_example,
        inputs=[],
        outputs=[chatbot, imagebox, json_schema_input]
    )

demo.launch(share=True, debug=True)


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://3c63d3b31f3a798122.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://3c63d3b31f3a798122.gradio.live


