# Visual LLMs with Moondream and Daft

In this tutorial, we will use the moondream-v2 model from HuggingFace to ask questions about images. We use the popular ImageNet dataset and show how you can get up and running with Daft and visual LLMs in a few minutes!


First, let's install dependencies:

In [1]:
!pip install daft transformers huggingface pillow torch accelerate datasets



Now let's import everything that we'll need in this notebook:

In [2]:
import os

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import daft
from daft import Series, col, udf
from daft.io import IOConfig, HuggingFaceConfig

  from .autonotebook import tqdm as notebook_tqdm


This particular dataset requires you to sign in with your HuggingFace credentials. You should put these into an environment variable (`HF_TOKEN`) when you run this notebook.

If you want to go for a more dangerous route, paste your token in directly. But be warned! If you do that, don't share this notebook! And clear outputs when you're done! Otherwise, you'll leak your key!

In [4]:
if "HF_TOKEN" in os.environ:
    token=os.environ['HF_TOKEN']
else:
    raise ValueError("Need HF_TOKEN as environment variable! Or supply directly here!")
    # token=...
io_config = IOConfig(hf= HuggingFaceConfig(token=token))
del token

We now load the Moondream v2 model from HuggingFace.

In [5]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'
print(f"Using device: {device}")

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-06-21",
    trust_remote_code=True,
    device_map={"": device}  # ...or 'mps', on Apple Silicon
)


Using device: mps


We can perform infrence with this model in Daft by defining a UDF (user defined function). Here, we'll use the new scalar UDFs (`daft.func`), which operate on single rows at a time.

We make two UDFs:
- one to generate a caption for the image
- one to predict what is in the image

Our second UDF is doing zero-shot image classiciation :) Let's see how it turns out!

In [6]:
@daft.func(return_dtype=daft.DataType.string())
def moonbeam_caption(image) -> str:
    return model.caption(Image.fromarray(image), length="short")['caption']


@daft.func(return_dtype=daft.DataType.string())
def moonbeam_predict_imagenet_class(image) -> str:
    answer =  model.query(Image.fromarray(image), "What main object is in the image? Be concise and limit answer to a word or a very short phrase.")['answer']
    return answer.strip().lower()


Let's get some data! We'll load a small part of the ImageNet dataset as a Daft DataFrame.

In [None]:
# Sometimes we get rate-limited and this fails :( We can use a specific partition instead when this happens.
# df = daft.read_huggingface("timm/mini-imagenet", io_config=io_config)
df = daft.read_parquet("https://huggingface.co/api/datasets/timm/mini-imagenet/parquet/default/train/2.parquet")
df.show()



"image Struct[bytes: Binary, path: Utf8]",label Int64
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_3062.JPEG, }",15
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_3076.JPEG, }",15
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_3090.JPEG, }",15
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_3099.JPEG, }",15
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_311.JPEG, }",15
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_3114.JPEG, }",15
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_3166.JPEG, }",15
"{bytes: b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..., path: n02101006_3185.JPEG, }",15


Extract and decode the image and discard the byte array. Also pull-out the path aka image name. Once we do this, we will have something that Moondream can use.


In [8]:
df = (
    df
    .with_column("path", col('image').struct.get('path'))
    .with_column('bytes', col('image').struct.get('bytes'))
    .with_column('image', col('bytes').image.decode())
)
df = df[['path', 'image', 'label']]
df_data_only = df
df.show()

path Utf8,image Image[MIXED],label Int64
n02101006_3062.JPEG,,15
n02101006_3076.JPEG,,15
n02101006_3090.JPEG,,15
n02101006_3099.JPEG,,15
n02101006_311.JPEG,,15
n02101006_3114.JPEG,,15
n02101006_3166.JPEG,,15
n02101006_3185.JPEG,,15


Now lets run inference on this data! We'll apply our Moondream-using UDFs to the "image" column.

In [9]:
def moonbeam_inference(df: daft.DataFrame) -> daft.DataFrame:
    return (
        df
        .with_column("caption", moonbeam_caption(col("image")))
        .with_column("predict_label", moonbeam_predict_imagenet_class(col("image")))
    )

In [10]:
moonbeam_inference(df.limit(10)).show()

path Utf8,image Image[MIXED],label Int64,caption Utf8,predict_label Utf8
n02101006_3062.JPEG,,15,"A black and tan dog sits on a gravel path, gazing at the camera with a red collar, surrounded by fallen leaves and a green bush.",dog
n02101006_3076.JPEG,,15,"A black and brown dog runs on a sandy beach, its tongue out, with rocks and plants scattered around.",dog
n02101006_3090.JPEG,,15,"A black dog stands on a black mat, facing left with a slight tilt, in a room with a yellow wall, white door, and black shelf.",dog
n02101006_3099.JPEG,,15,"A black and brown dog sits on a grassy lawn, its head tilted and mouth open, revealing its tongue, with a collar and tag visible.",dog
n02101006_311.JPEG,,15,"A black and brown dog stands alert in a field of yellow flowers, gazing off to the side with a curious and attentive demeanor.",dog
n02101006_3114.JPEG,,15,"A black and brown dog with long fur stands alert on a lush green lawn, facing the camera with a wooden fence in the background.",dog
n02101006_3166.JPEG,,15,"A black and brown dog lies on a beige carpet, eyes closed and tongue out, with a white tire in the background.",dog
n02101006_3185.JPEG,,15,"A black dog with a pink collar sleeps peacefully on a green couch, head resting on the armrest and legs stretched out.",dog


Wonderful! It looks like the captions make sense and the zero-shot classifier is doing a pretty good job! Let's take a small random sample and inspect prections on that:

In [11]:
df_sample = df_data_only.sample(fraction=0.25).limit(10)
df_sample = moonbeam_inference(df_sample)
df_sample.collect()



🗡️ 🐟 Parquet Scan: 00:00 900 rows emitted, 200.56 MiB bytes read

[A[A

[A[A

🗡️ 🐟 Parquet Scan: 00:00 1,000 rows emitted, 220.66 MiB bytes read


[A[A[A


[A[A[A
[A
🗡️ 🐟 Parquet Scan: 00:00 1,000 rows emitted, 220.66 MiB bytes read


[A[A[A
🗡️ 🐟 Parquet Scan: 00:00 1,000 rows emitted, 262.84 MiB bytes read


[A[A[A
🗡️ 🐟 Parquet Scan: 00:00 1,700 rows emitted, 330.53 MiB bytes read


[A[A[A
🗡️ 🐟 Parquet Scan: 00:00 2,400 rows emitted, 364.15 MiB bytes read


[A[A[A
🗡️ 🐟 Parquet Scan: 00:01 2,600 rows emitted, 391.46 MiB bytes read


[A[A[A
                                                                   d


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[A[A
[A


[A[

path Utf8,image Image[MIXED],label Int64,caption Utf8,predict_label Utf8
n02101006_3954.JPEG,,15,"A black and tan dog, wearing a purple collar, sits attentively in front of a wooden fence, with a green plant and pink flowers in the background.",dog
n02101006_3860.JPEG,,15,"A black and tan dog, wearing a red collar, bends down in a snowy field, eyes focused on an unseen object.",dog
n02101006_3803.JPEG,,15,"A black and tan dog with long fur gazes directly at the viewer, its head tilted slightly left, against a soft green and pale yellow background.",dog
n02101006_4803.JPEG,,15,"A black and brown dog rests its head on a carpeted floor, eyes open and ears folded back, with a wooden shelf and yellow wall in the background.",dog
n02101006_4589.JPEG,,15,"A black and tan dog runs with tongue out in a lush green field, its ears perked up and tail caught in the wind.",dog
n02101006_404.JPEG,,15,"A black dog with a red collar sits on a white sidewalk, gazing at a calm body of water with a red railing and a yellow pole.",dog
n02101006_311.JPEG,,15,"A black and brown dog stands alert in a field of yellow flowers, gazing off to the side with a curious and attentive demeanor.",dog
n02101006_4919.JPEG,,15,"A black dog with brown spots bends down to sniff the ground in a green, grassy area with a wooden fence and tree trunk in the background.",dog


You can run inference on the entire dataset by simply not sampling & filtering the original data. This may be hard to look at all at once since there's so many images! So instead, let's write out the full dataset and inference results to disk using Parquet:

In [None]:
df_full = moonbeam_inference(df.into_batches(32))
df_full.write_parquet("./small_imagenet_moonbeam_v2_inference")



🗡️ 🐟 Parquet Scan: 00:00 600 rows emitted, 166.47 MiB bytes read
[A
[A

[A[A

🗡️ 🐟 Parquet Scan: 00:00 600 rows emitted, 176.84 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:00 600 rows emitted, 176.84 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:00 700 rows emitted, 194.77 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:00 700 rows emitted, 194.77 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:01 700 rows emitted, 206.88 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:01 700 rows emitted, 227.02 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:01 700 rows emitted, 239.83 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:01 700 rows emitted, 239.83 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:01 700 rows emitted, 261.56 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:02 700 rows emitted, 273.09 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:02 700 rows emitted, 295.25 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:02 700 rows emitted, 306.13 MiB bytes read
[A

🗡️ 🐟 Parquet Scan: 00:02 700 rows emitted, 306.13 MiB bytes read
[A

🗡️ 🐟 Par