# Visual LLMs with Moondream and Daft

In this tutorial, we will use the moondream-v2 model from HuggingFace to ask questions about images. We use the popular ImageNet dataset and show how you can get up and running with Daft and visual LLMs in a few minutes!


First, let's install dependencies:

In [1]:
!pip install daft transformers huggingface pillow torch accelerate datasets



In [2]:
%env DAFT_PROGRESS_BAR=0

env: DAFT_PROGRESS_BAR=0


Now let's import everything that we'll need in this notebook:

In [3]:
import os

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

import daft
from daft import col
from daft.io import HuggingFaceConfig, IOConfig

  from .autonotebook import tqdm as notebook_tqdm


This particular dataset requires you to sign in with your HuggingFace credentials. You should put these into an environment variable (`HF_TOKEN`) when you run this notebook.

If you want to go for a more dangerous route, paste your token in directly. But be warned! If you do that, don't share this notebook! And clear outputs when you're done! Otherwise, you'll leak your key!

In [4]:
if "HF_TOKEN" in os.environ:
    token = os.environ["HF_TOKEN"]
else:
    raise ValueError("Need HF_TOKEN as environment variable! Or supply directly here!")
    # token=...
io_config = IOConfig(hf=HuggingFaceConfig(token=token))
del token

We now load the Moondream v2 model from HuggingFace.

In [5]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"
print(f"Using device: {device}")

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-06-21",
    trust_remote_code=True,
    device_map={"": device},  # ...or 'mps', on Apple Silicon
)

Using device: mps


We can perform infrence with this model in Daft by defining a UDF (user defined function). Here, we'll use the new scalar UDFs (`daft.func`), which operate on single rows at a time.

We make two UDFs:
- one to generate a caption for the image
- one to predict what is in the image

Our second UDF is doing zero-shot image classiciation :) Let's see how it turns out!

In [6]:
@daft.func(return_dtype=daft.DataType.string())
def moonbeam_caption(image) -> str:
    return model.caption(Image.fromarray(image), length="short")["caption"]


@daft.func(return_dtype=daft.DataType.string())
def moonbeam_predict_imagenet_class(image) -> str:
    answer = model.query(
        Image.fromarray(image),
        "What main object is in the image? Be concise and limit answer to a word or a very short phrase.",
    )["answer"]
    return answer.strip().lower()

Let's get some data! We'll load a small part of the ImageNet dataset as a Daft DataFrame.

In [7]:
# Sometimes we get rate-limited and this fails :( We can use a specific partition instead when this happens.
df = daft.read_huggingface("timm/mini-imagenet", io_config=io_config)
# df = daft.read_parquet("https://huggingface.co/api/datasets/timm/mini-imagenet/parquet/default/train/2.parquet")
df.show()

"image Struct[bytes: Binary, path: Utf8]",label Int64
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_10032.JPEG, }",0
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_10127.JPEG, }",0
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_1021.JPEG, }",0
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_1106.JPEG, }",0
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_1114.JPEG, }",0
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_11202.JPEG, }",0
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_114.JPEG, }",0
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n01532829_1179.JPEG, }",0


Extract and decode the image and discard the byte array. Also pull-out the path aka image name. Once we do this, we will have something that Moondream can use.


In [8]:
df = (
    df.into_batches(32)
    .with_column("path", col("image").struct.get("path"))
    .with_column("bytes", col("image").struct.get("bytes"))
    .with_column("image", col("bytes").image.decode())
)
df = df[["path", "image", "label"]]
df_data_only = df
df.show()

path Utf8,image Image[MIXED],label Int64
n01532829_10032.JPEG,,0
n01532829_10127.JPEG,,0
n01532829_1021.JPEG,,0
n01532829_1106.JPEG,,0
n01532829_1114.JPEG,,0
n01532829_11202.JPEG,,0
n01532829_114.JPEG,,0
n01532829_1179.JPEG,,0


Now lets run inference on this data! We'll apply our Moondream-using UDFs to the "image" column.

In [9]:
def moonbeam_inference(df: daft.DataFrame) -> daft.DataFrame:
    return df.with_column("caption", moonbeam_caption(col("image"))).with_column(
        "predict_label", moonbeam_predict_imagenet_class(col("image"))
    )

In [10]:
moonbeam_inference(df.limit(10)).show()

path Utf8,image Image[MIXED],label Int64,caption Utf8,predict_label Utf8
ILSVRC2012_val_00000873.JPEG,,0,"A green bird feeder with an open seed box attracts at least six birds during winter, with a red bird perched on the side and a bird in mid-flight nearby.",bird feeder
ILSVRC2012_val_00001556.JPEG,,0,"A small bird with a vibrant red head and tail perches on a black wire, facing away from the viewer with wings slightly spread.",bird
ILSVRC2012_val_00002357.JPEG,,0,"A red-headed bird with a white belly pecks at a pile of mixed nuts on a concrete surface, drawing attention with its vibrant feathers.",bird
ILSVRC2012_val_00004375.JPEG,,0,"A small red-headed bird with a gray body and brown wings perches on a light brown branch, facing right.",bird
ILSVRC2012_val_00004747.JPEG,,0,"A small light brown bird perches on a black wire, facing the camera with wings slightly spread against a clear blue sky.",bird
ILSVRC2012_val_00004749.JPEG,,0,"A small brown bird with a red head perches on a jagged gray rock, facing right with a curious expression.",bird
ILSVRC2012_val_00005336.JPEG,,0,"A house sparrow, with a brown body, red breast, and white wing stripes, perches on a branch amidst orange and yellow autumn leaves.",bird
ILSVRC2012_val_00005848.JPEG,,0,"A red bird perches on a yellow bird feeder, filled with black seeds, against a gray background.",bird feeder


Wonderful! It looks like the captions make sense and the zero-shot classifier is doing a pretty good job! Now you can go off and run on the whole data or make new querries for Moonbeam!