Website • Docs • Installation • 10-minute tour of Daft • Community and Support
Daft: the distributed Python dataframe for complex data
Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.
Daft is currently in its Beta release phase - please expect bugs and rapid improvements to the project. We welcome user feedback/feature requests in our Discussions forums
Table of Contents
About Daft
The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more.
- Any Data: Columns can contain any Python objects, which means that the Python libraries you already use for running machine learning or custom data processing will work natively with Daft!
- Notebook Computing: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.
- Distributed Computing: Rich complex formats such as images can quickly outgrow your local laptop's computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.
Getting Started
Installation
Install Daft with pip install getdaft
.
For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide
Quickstart
Check out our 10-minute quickstart!
In this example, we load images from an AWS S3 bucket and run a simple function to generate thumbnails for each image:
import daft as daft
import io
from PIL import Image
def get_thumbnail(img: Image.Image) -> Image.Image:
"""Simple function to make an image thumbnail"""
imgcopy = img.copy()
imgcopy.thumbnail((48, 48))
return imgcopy
# Load a dataframe from files in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")
# Get the AWS S3 url of each image
df = df.select(df["path"].alias("s3_url"))
# Download images and load as a PIL Image object
df = df.with_column("image", df["s3_url"].url.download().apply(lambda data: Image.open(io.BytesIO(data)), return_dtype=daft.DataType.python()))
# Generate thumbnails from images
df = df.with_column("thumbnail", df["image"].apply(get_thumbnail, return_dtype=daft.DataType.python()))
df.show(3)
Benchmarks
To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.
More Resources
- 10-minute tour of Daft - learn more about Daft's full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.
- User Guide - take a deep-dive into each topic within Daft
- API Reference - API reference for public classes/functions of Daft
Contributing
To start contributing to Daft, please read CONTRIBUTING.md
Telemetry
To help improve Daft, we collect non-identifiable data.
To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0
The data that we collect is:
- Non-identifiable: events are keyed by a session ID which is generated on import of Daft
- Metadata-only: we do not collect any of our users’ proprietary code or data
- For development only: we do not buy or sell any user data
Please see our documentation for more details.
Related Projects
Dataframe | Query Optimizer | Complex Types | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
---|---|---|---|---|---|---|
Daft | Yes | Yes | Yes | Yes | Yes | Yes |
Pandas | No | Python object | No | optional >= 2.0 | Some(Numpy) | No |
Polars | Yes | Python object | No | Yes | Yes | Yes |
Modin | Eagar | Python object | Yes | No | Some(Pandas) | Yes |
Pyspark | Yes | No | Yes | Pandas UDF/IO | Pandas UDF | Yes |
Dask DF | No | Python object | Yes | No | Some(Pandas) | Yes |
Check out our dataframe comparison page for more details!
License
Daft has an Apache 2.0 license - please see the LICENSE file.