# Caption Images with Large MultiModal Models

This notebook lets you caption your images using different Large MultiModal Models.

Once equipped with the packages precised in the requirements file, execute all the cells to see a user intarface where you can insert your own images and test. 

In [1]:
!nvidia-smi

Wed Mar 20 17:08:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       On  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!conda info --envs

# conda environments:
#
base                     /opt/conda
deepseek              *  /opt/conda/envs/deepseek
jupyterlab               /opt/conda/envs/jupyterlab
moondream                /opt/conda/envs/moondream
pytorch                  /opt/conda/envs/pytorch
tensorflow               /opt/conda/envs/tensorflow



In [3]:
import os
import time
import pathlib
import gradio as gr
from tqdm import tqdm
import torch
import spacy

from transformers import AutoModelForCausalLM, CodeGenTokenizerFast
from PIL import Image

from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images

torch.manual_seed(0)

  from .autonotebook import tqdm as notebook_tqdm


<torch._C.Generator at 0x7f7236101610>

In [4]:
def get_boorus_spacy(description, nlp):
    """Use traditional NLP to convert caption to tags."""
    doc = nlp(description)
    notions_detailed = []
    notions_base = []

    for chunk in doc.noun_chunks:
        # Detailed notions, keep adjectives
        detailed_phrase_tokens = [token.text for token in chunk if
                                  not token.is_stop or token.pos_ in ['ADJ', 'NOUN', 'PROPN']]
        detailed_phrase = ' '.join(detailed_phrase_tokens)
        if detailed_phrase:  
            notions_detailed.append(detailed_phrase)
        # Base notions
        base_phrase_tokens = [token.text for token in chunk if token.pos_ in ['NOUN', 'PROPN'] and not token.is_stop]
        base_phrase = ' '.join(base_phrase_tokens)
        if base_phrase: 
            notions_base.append(base_phrase)

    notions_detailed_str = ', '.join(set(notions_detailed))
    notions_base_str = ', '.join(set(notions_base))
    # print("base:", notions_base_str)
    # print("detailed:", notions_detailed_str)
    return notions_detailed_str, notions_base_str

In [5]:
def caption_image(image_path, model_name, prompt, output_type, nlp):
    """Caption each image in the list and save the results in a file."""

    if "DeepSeek" == model_name:
        # Load the model
        ds_model_id = ("deepseek-ai/deepseek-vl-1.3b-chat")
        ds_vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(ds_model_id)
        ds_vl_tokenizer = ds_vl_chat_processor.tokenizer
        ds_vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(ds_model_id, trust_remote_code=True)
        ds_vl_gpt = ds_vl_gpt.to(torch.bfloat16).cuda().eval()
        print("Loaded DeepSeek weights and set inference mode.")
        # Embed input image
        conversation = [{"role": "User",
                         "content": "<image_placeholder>" + prompt,
                         "images": [image_path]},
                        {"role": "Assistant", "content": ""}]       
        pil_images = load_pil_images(conversation)
        prepare_inputs = ds_vl_chat_processor(conversations=conversation,
            images=pil_images, force_batchify=True).to(ds_vl_gpt.device)        
        inputs_embeds = ds_vl_gpt.prepare_inputs_embeds(**prepare_inputs)
        # Infer      
        start_time = time.time()
        outputs = ds_vl_gpt.language_model.generate(
            inputs_embeds=inputs_embeds,
            attention_mask=prepare_inputs.attention_mask,
            pad_token_id=ds_vl_tokenizer.eos_token_id,
            bos_token_id=ds_vl_tokenizer.bos_token_id,
            eos_token_id=ds_vl_tokenizer.eos_token_id,
            max_new_tokens=512,
            do_sample=False,
            use_cache=True
        )
        answer = ds_vl_tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
        duration = time.time() - start_time
        info = f"Inference time: {duration:.3f} seconds"
    elif "MoonDream" in model_name:
        # Load the model
        md_model_id = "vikhyatk/moondream2"
        md_revision = "2024-03-06"
        md_model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
            md_model_id, trust_remote_code=True, revision=md_revision)
        md_tokenizer: CodeGenTokenizerFast = CodeGenTokenizerFast.from_pretrained(md_model_id)
        print("Loaded MoonDream weights.")
        # Embed input image
        image = Image.open(image_path)
        enc_image = md_model.encode_image(image)
        # Infer
        start_time = time.time()
        answer = md_model.answer_question(enc_image, prompt, md_tokenizer)
        duration = time.time() - start_time
        info = f"Inference time: {duration:.3f} seconds"
    else:
        print("Error:", model_name, "is not an available LVM.")       
    print(model_name, ":", answer)
    
    # Get the tag version
    if "tag" == output_type:
        answer = answer.replace(" and", ",").replace(",,", ",")
        detailed, base = get_boorus_spacy(answer, spacy.load("en_core_web_sm"))
        answer = base if "base" in output_type else detailed
        
    return answer, info

In [6]:
model_names = ["DeepSeek", "MoonDream"]
output_types = ["tag", "caption"]


def demo_caption_image(im_in, model_name="DeepSeek", output_type="tag", 
                      prompt="Analyze this image, highlighting the most significant components and their meanings."):
    """Wrap image captioning function to hide some arguments from the user."""
    cache_dir = os.path.join(".", "temp")
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)
    im_in_path = os.path.join(cache_dir, time.strftime("%Y%m%d_%H%M%S") + ".png")
    print("im_in_path:", im_in_path)
    pil_im = Image.fromarray(im_in)
    pil_im.save(im_in_path)
    # if "DeepSeek" == model_name: 
    #     model = ds_vl_gpt
    #     tokenizer = ds_vl_tokenizer
    # elif "MoonDream" == model_name:
    #     model = md_model
    #     tokenizer = md_tokenizer
    # else:
    #     info = "Error: This model has not been added."
    #     print(info)
    caption, info = caption_image(image_path=im_in_path, model_name=model_name, # model=vl_gpt, tokenizer=tokenizer, 
                            prompt=prompt, output_type=output_type, nlp=spacy.load("en_core_web_sm"))
    return caption, info

In [7]:
demo = gr.Interface(
    fn=demo_caption_image,
    inputs=[
        gr.Image(label="Input Image"),        
        gr.Dropdown(model_names, label="Choose a LVM", value="DeepSeek"),
        gr.Dropdown(output_types, label="Choose output type", value="caption"),
        gr.Textbox(label="Captioning Prompt (optional)"),
    ],
    outputs=[
        gr.Textbox(label="Result"),
        gr.Textbox(label="Information"),
    ],
    allow_flagging="never",
    title="Image Captioning with Large Vision Models",
    description="Upload an image, pick a LVM and a captioning style."
    "\n\nThe generated captions and the runtime expressed in seconds will be shown."
    # "\n\nNote: If there is an error, your image might be too big for the GPU.",
)
demo.launch(share=True)  

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://209c1719ed0cdd7189.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




im_in_path: ./temp/20240320_170946.png


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded MoonDream weights.
MoonDream : A woman is positioned in the center of the image, wearing a hat. The background features a wall, and the image appears to have been taken in a dimly lit environment.
im_in_path: ./temp/20240320_171127.png


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded MoonDream weights.
MoonDream : A woman is positioned in the center of the image, wearing a hat. The background features a wall, and the image appears to have been taken in a dimly lit environment.
im_in_path: ./temp/20240320_171305.png


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded DeepSeek weights and set inference mode.
DeepSeek : The image is a close-up portrait of a woman with a profile view. She has long, dark hair that cascades down her shoulders. The woman is wearing a large, wide-brimmed hat with a beige or light brown color and a feathery, purple-colored trim. The hat appears to be of a vintage style, possibly reminiscent of the 1920s or 1930s. The background is blurred, but it suggests an indoor setting with warm lighting, possibly a room with wooden beams or a similar architectural feature. The woman's expression is neutral, and she is looking directly at the camera. There are no visible texts or discernible brands in the image. The style of the image is reminiscent of fashion photography, focusing on the woman's attire and the composition of the background.
im_in_path: ./temp/20240320_171346.png


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded DeepSeek weights and set inference mode.
DeepSeek : The image is a close-up portrait of a woman with a profile view. She has long, dark hair that cascades down her shoulders. The woman is wearing a large, wide-brimmed hat with a beige or light brown color and a feathery, purple-colored trim. The hat appears to be of a vintage style, possibly reminiscent of the 1920s or 1930s. The background is blurred, but it suggests an indoor setting with warm lighting, possibly a room with wooden beams or a similar architectural feature. The woman's expression is neutral, and she is looking directly at the camera. There are no visible texts or discernible brands in the image. The style of the image is reminiscent of fashion photography, focusing on the woman's attire and the composition of the background.
im_in_path: ./temp/20240320_174847.png


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded DeepSeek weights and set inference mode.
DeepSeek : The image is a close-up portrait of a woman with a profile view. She has long, dark hair that cascades down her shoulders. The woman is wearing a large, wide-brimmed hat with a beige or light brown color and a feathery, purple-colored trim. The hat appears to be of a vintage style, possibly reminiscent of the 1920s or 1930s. The background is blurred, but it suggests an indoor setting with warm lighting, possibly a room with wooden beams or a similar architectural feature. The woman's expression is neutral, and she is looking directly at the camera. There are no visible texts or discernible brands in the image. The style of the image is reminiscent of fashion photography, focusing on the woman's attire and the composition of the background.
im_in_path: ./temp/20240320_174930.png


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded DeepSeek weights and set inference mode.
DeepSeek : The image is a close-up portrait of a woman with a profile view. She has long, dark hair that cascades down her shoulders. The woman is wearing a large, wide-brimmed hat with a beige or light brown color and a feathery, purple-colored trim. The hat appears to be of a vintage style, possibly reminiscent of the 1920s or 1930s. The background is blurred, but it suggests an indoor setting with warm lighting, possibly a room with wooden beams or a similar architectural feature. The woman's expression is neutral, and she is looking directly at the camera. There are no visible texts or discernible brands in the image. The style of the image is reminiscent of fashion photography, focusing on the woman's attire and the composition of the background.
