Website • Docs • Installation • 10-minute tour of Daft • Community and Support
Daft: the distributed Python dataframe for complex data
Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.
Daft is currently in its Alpha release phase - please expect bugs and rapid improvements to the project. We welcome user feedback/feature requests in our Discussions forums
Table of Contents
About Daft
The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more.
- Any Data: Columns can contain any Python objects, which means that the Python libraries you already use for running machine learning or custom data processing will work natively with Daft!
- Notebook Computing: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.
- Distributed Computing: Rich complex formats such as images can quickly outgrow your local laptop's computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.
Getting Started
Installation
Install Daft with pip install getdaft
.
For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide
Quickstart
Check out our 10-minute quickstart!
In this example, we load images from an AWS S3 bucket and run a simple function to generate thumbnails for each image:
from daft import DataFrame, lit
import io
from PIL import Image
def get_thumbnail(img: Image.Image) -> Image.Image:
"""Simple function to make an image thumbnail"""
imgcopy = img.copy()
imgcopy.thumbnail((48, 48))
return imgcopy
# Load a dataframe from files in an S3 bucket
df = DataFrame.from_files("s3://daft-public-data/laion-sample-images/*")
# Get the AWS S3 url of each image
df = df.select(lit("s3://").str.concat(df["name"]).alias("s3_url"))
# Download images and load as a PIL Image object
df = df.with_column("image", df["s3_url"].url.download().apply(lambda data: Image.open(io.BytesIO(data))))
# Generate thumbnails from images
df = df.with_column("thumbnail", df["image"].apply(get_thumbnail))
df.show(3)
More Resources
- 10-minute tour of Daft - learn more about Daft's full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.
- User Guide - take a deep-dive into each topic within Daft
- API Reference - API reference for public classes/functions of Daft
Contributing
To start contributing to Daft, please read CONTRIBUTING.md
Telemetry
To help improve Daft, we collect non-identifiable data.
To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0
The data that we collect is:
- Non-identifiable: events are keyed by a session ID which is generated on import of Daft
- Metadata-only: we do not collect any of our users’ proprietary code or data
- For development only: we do not buy or sell any user data
Please see our documentation for more details.
License
Daft has an Apache 2.0 license - please see the LICENSE file.