Skip to content

URL download hangs and doesn't properly timeout for certain URLs #4518

@jaychia

Description

@jaychia

Describe the bug

I'm running .url.download() on a dataset (sample of LAION 400M) but certain specific URLs seem to cause issues.

Note this dataset is fairly messy, and has really crappy URLs which could 404, return HTML instead of an image or other undefined behavior.

To Reproduce

import daft
import os

from daft.io import IOConfig, HTTPConfig, S3Config

# Replace with your own huggingface token
io_config = IOConfig(http=HTTPConfig(bearer_token="xxx"))

df = daft.read_parquet("hf://datasets/laion/laion400m/**/*.parquet", io_config=io_config)

# This specific URL seems to be causing an issue
# df = df.where("url != 'http://img2.imagesbn.com/p/9781432717599_p0_v1_s260x420.jpg'")

# Trigger some computation for URL download
df = df.with_column("image", df["url"].url.download(on_error="null").image.decode(on_error="null"))
df = df.with_column("image_jpeg_bytes", df["image"].image.to_mode("rgb").image.encode("jpeg"))
df = df.with_column("image_jpeg_bytes", df["image"].image.to_mode("rgb").image.encode("jpeg"))
df = df.select(df["image_jpeg_bytes"].bytes.length())

df.write_csv("laion-400M-sample")

Expected behavior

No response

Component(s)

Expressions

Additional context

I will dump a CSV to make this easier to reproduce.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingp1Important to tackle soon, but preemptable by p0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions