-
Notifications
You must be signed in to change notification settings - Fork 370
Closed
Labels
bugSomething isn't workingSomething isn't workingp1Important to tackle soon, but preemptable by p0Important to tackle soon, but preemptable by p0
Description
Describe the bug
I'm running .url.download() on a dataset (sample of LAION 400M) but certain specific URLs seem to cause issues.
Note this dataset is fairly messy, and has really crappy URLs which could 404, return HTML instead of an image or other undefined behavior.
To Reproduce
import daft
import os
from daft.io import IOConfig, HTTPConfig, S3Config
# Replace with your own huggingface token
io_config = IOConfig(http=HTTPConfig(bearer_token="xxx"))
df = daft.read_parquet("hf://datasets/laion/laion400m/**/*.parquet", io_config=io_config)
# This specific URL seems to be causing an issue
# df = df.where("url != 'http://img2.imagesbn.com/p/9781432717599_p0_v1_s260x420.jpg'")
# Trigger some computation for URL download
df = df.with_column("image", df["url"].url.download(on_error="null").image.decode(on_error="null"))
df = df.with_column("image_jpeg_bytes", df["image"].image.to_mode("rgb").image.encode("jpeg"))
df = df.with_column("image_jpeg_bytes", df["image"].image.to_mode("rgb").image.encode("jpeg"))
df = df.select(df["image_jpeg_bytes"].bytes.length())
df.write_csv("laion-400M-sample")
Expected behavior
No response
Component(s)
Expressions
Additional context
I will dump a CSV to make this easier to reproduce.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingp1Important to tackle soon, but preemptable by p0Important to tackle soon, but preemptable by p0