In [None]:
pip install daft




Custom Modalities

Custom modalities let you define:

How data is loaded or saved to storage

How it is processed or transformed within your pipeline

What it means, and how to act on it programmatically

Two ways

*   Custom Connector
*   List item




Working with URLs and Files

Daft provides powerful capabilities for working with URLs, file paths, and remote resources. Whether you're loading data from local files, cloud storage, or web URLs, Daft's URL and file handling makes it seamless to work with distributed data sources.

Daft supports working with:

Local file paths: file:///path/to/file, /path/to/file
S3: s3://bucket/path, s3a://bucket/path, s3n://bucket/path
GCS: gs://bucket/path
Azure: az://container/path, abfs://container/path, abfss://container/path
HTTP/HTTPS URLs: http://example.com/path, https://example.com/path
Hugging Face datasets: hf://dataset/name
Unity Catalog volumes: vol+dbfs:/Volumes/unity/path

Two Ways to Work with Files in Daft

1. URL Functions - When we want to fit the data into memory at once

In [None]:
import daft

df = daft.from_pydict({
    "urls": [
        "https://www.google.com",
        "https://images.unsplash.com/photo-1503023345310-bd7c1de61c7d",  # sample photo
    ],
})

df = df.with_column("data", df["urls"].url.download())

df.collect()

df.show()


urls String,data Binary
https://www.google.com,"b""<!doctype html><html itemscope=\""\""""..."
https://images.unsplash.com/photo-1503023345310-bd7c1de61c7d,"b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""..."


2. File Datatype - When we deal with large files that dont fit in memory or we need to access only specific portions of a file

In [None]:
import daft
from daft.functions import file
from daft.io import IOConfig

io_config = IOConfig()

df = daft.from_pydict(
    {
        "urls": [
            "https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstration_1.png",  # PNG
            "https://upload.wikimedia.org/wikipedia/commons/3/3f/JPEG_example_flower.jpg",                # JPEG
            "https://media.giphy.com/media/ICOgUNjpvO0PC/giphy.gif",                                      # GIF
        ],
    }
)

@daft.func
def detect_file_type(file: daft.File) -> str:
    with file.open() as f:
        header = f.read(12)

    if header.startswith(b"\xff\xd8\xff"):
        return "JPEG"
    elif header.startswith(b"\x89PNG\r\n\x1a\n"):
        return "PNG"
    elif header.startswith(b"GIF87a") or header.startswith(b"GIF89a"):
        return "GIF"
    elif header.startswith(b"<!") or header.startswith(b"<html"):
        return "HTML"
    elif header.startswith(b"HTTP/"):
        return "HTTP"
    else:
        return "Unknown"

df = df.with_column(
    "file_type",
    detect_file_type(file(df["urls"], io_config=io_config))
)

df.collect()
df.show()


urls String,file_type String
https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstration_1.png,PNG
https://upload.wikimedia.org/wikipedia/commons/3/3f/JPEG_example_flower.jpg,JPEG
https://media.giphy.com/media/ICOgUNjpvO0PC/giphy.gif,GIF
