# Quickstart

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Eventual-Inc/Daft/blob/tools%2Fdocs-to-notebook-converter/docs/notebooks/quickstart.ipynb)

Daft is the best multimodal data processing engine that allows you to load data from anywhere, transform it with a powerful DataFrame API and AI functions, and store it in your destination of choice. In this quickstart, you'll see what this looks like in practice with a realistic e-commerce data workflow.

### Requirements

Daft requires **Python 3.10 or higher**.

### Install Daft

You can install Daft using `pip`. Run the following command in your terminal or notebook:

In [None]:
!pip install -U "daft[openai]"  # Includes OpenAI extras needed for this quickstart

Additionally, install these packages for image processing (used later in this quickstart):

In [None]:
!pip install numpy pillow

### Load Your Data

Let's start by loading an e-commerce dataset from Hugging Face. [This dataset](https://huggingface.co/datasets/calmgoose/amazon-product-data-2020) contains 10,000+ Amazon products from diverse categories including electronics, toys, home goods, and more. Each product includes details like names, prices, descriptions, technical specifications, and product images.

In [7]:
import daft

df_original = daft.read_huggingface("calmgoose/amazon-product-data-2020")

<div style="background-color: #448aff22; border-left: 4px solid #448aff; padding: 12px; margin: 16px 0;">
<strong style="color: #448aff;">Load from anywhere</strong><br/>
Daft can load data from many sources including <a href="https://docs.daft.ai/en/stable/connectors/aws/">S3</a>, <a href="https://docs.daft.ai/en/stable/connectors/iceberg/">Iceberg</a>, <a href="https://docs.daft.ai/en/stable/connectors/delta_lake/">Delta Lake</a>, <a href="https://docs.daft.ai/en/stable/connectors/hudi/">Hudi</a>, and <a href="https://docs.daft.ai/en/stable/connectors/">more</a>. We're using Hugging Face here as a demonstration.
</div>

### Inspect Your Data

Now let's take a look at what we loaded. You can inspect the DataFrame by simply printing it:

In [8]:
df_original

Uniq Id String,Product Name String,Category String,Upc Ean Code String,Selling Price String,Model Number String,About Product String,Product Specification String,Technical Details String,Shipping Weight String,Product Dimensions String,Image String,Variants String,Product Url String,Is Amazon Seller String


You see the above output because **Daft is lazy by default** - it displays the schema (column names and types) but doesn't actually load or process your data until you explicitly tell it to. This allows Daft to optimize your entire workflow before executing anything.

To actually view your data, you have two options:

**Option 1: Preview with `.show()`** - View the first few rows:

In [9]:
df_original.show(2)

Uniq Id String,Product Name String,Category String,Upc Ean Code String,Selling Price String,Model Number String,About Product String,Product Specification String,Technical Details String,Shipping Weight String,Product Dimensions String,Image String,Variants String,Product Url String,Is Amazon Seller String
4c69b61db1fc16e7013b43fc926e502d,"DB Longboards CoreFlex Crossbow 41"" Bamboo Fiberglass Longboard Complete","Sports & Outdoors | Outdoor Recreation | Skates, Skateboards & Scooters | Skateboarding | Standard Skateboards & Longboards | Longboards",,$237.68,,"Make sure this fits by entering your model number. | RESPONSIVE FLEX: The Crossbow features a bamboo core encased in triaxial fiberglass and HD plastic for a responsive flex pattern that’s second to none. Pumping & carving have never been so satisfying! Flex 2 is recommended for people 120 to 170 pounds. | COREFLEX TECH: CoreFlex construction is water resistant, impact resistant, scratch resistant and has a flex like you won’t believe. These boards combine fiberglass, epoxy, HD plastic and bamboo to create a perfect blend of performance and strength. | INSPIRED BY THE NORTHWEST: Our founding ideal is chasing adventure & riding the best boards possible, inspired by the hills, waves, beaches & mountains all around our headquarters in the Northwest | BEST IN THE WORLD: DB was founded out of sheer love of longboarding with a mission to create the best custom longboards in the world, to do it sustainably, & to treat customers & employees like family | BEYOND COMPARE: Try our skateboards & accessories if you've tried similar products by Sector 9, Landyachtz, Arbor, Loaded, Globe, Orangatang, Hawgs, Powell-Peralta, Blood Orange, Caliber or Gullwing",Shipping Weight: 10.7 pounds (View shipping rates and policies)|ASIN: B07KMVJJK7| #474 in Longboards Skateboard,,10.7 pounds,,https://images-na.ssl-images-amazon.com/images/I/51j3fPQTQkL.jpg|https://images-na.ssl-images-amazon.com/images/I/31hKM3cSoSL.jpg|https://images-na.ssl-images-amazon.com/images/I/51WlHdwghfL.jpg|https://images-na.ssl-images-amazon.com/images/I/51FsyLRBzwL.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg,https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMVJJK7|https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMN5KS7|https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMXK857|https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMW2VFR,https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMVJJK7,Y
66d49bbed043f5be260fa9f7fbff5957,"Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5)",Toys & Games | Learning & Education | Science Kits & Toys,,$99.95,55324.0,"Make sure this fits by entering your model number. | Snap circuits mini kits classpack provides basic electronic circuitry activities for students in grades 2-6 | Includes 5 separate mini building kits- an FM radio, a motion detector, music box, space battle sound effects, and a flying saucer | Each kit includes separate components and instructions to build | Each component represents one function in a circuit; components snap together to create working models of everyday electronic devices | Activity guide provides additional projects to teach students how circuitry works",Product Dimensions: 14.7 x 11.1 x 10.2 inches ; 4.06 pounds |Shipping Weight: 4 pounds (View shipping rates and policies)|Domestic Shipping: Item can be shipped within U.S.|International Shipping: This item can be shipped to select countries outside of the U.S. Learn More|ASIN: B008AK6DAS|Item model number: 55324| #3032 in Science Kits & Toys,"The snap circuits mini kits classpack provides basic electric circuitry information for students in grades 2-6. This classpack includes 5 snap-together building kits. Components snap together to create working models of everyday electronic devices. Kits included are an FM radio, a motion detector, a music box, space battle sound effects, and a flying saucer. Each mini kit comes with individual components, and an activity guide which includes instructions and additional project ideas. Each primary-colored component represents one function in a circuit. Activity kits are used by teachers and students in classroom and homeschool settings for educational and research applications in science, math, and for a variety of additional disciplines. Science education products and manipulatives incorporate applied math and science principles into classroom or homeschool projects. Teachers in pre-K, elementary, and secondary classrooms use science education kits, manipualtives, and products alongside science, technology, engineering, and math (STEM) curriculum to demonstrate STEM concepts and real-world applications through hands-on activities. Science education projects include a broad range of activities, such as practical experiments in engineering, aeronautics, robotics, chemistry, physics, biology, and geology.",4 pounds,14.7 x 11.1 x 10.2 inches 4.06 pounds,https://images-na.ssl-images-amazon.com/images/I/51M0KnJxjKL.jpg|https://images-na.ssl-images-amazon.com/images/I/5166GD8OkXL.jpg|https://images-na.ssl-images-amazon.com/images/I/61o5S1VnaNL.jpg|https://images-na.ssl-images-amazon.com/images/I/61t4Q0rPYjL.jpg|https://images-na.ssl-images-amazon.com/images/I/61NASUAyqcL.jpg|https://images-na.ssl-images-amazon.com/images/I/51OMrADdyJL.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg,,https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS,Y


This materializes and displays just the first 2 rows, which is perfect for quickly inspecting your data without loading the entire dataset.

**Option 2: Materialize with `.collect()`** - Load the entire dataset:

In [10]:
# df_original.collect()

This would materialize the entire DataFrame (all 10,000+ rows in this case) into memory. Use `.collect()` when you need to work with the full dataset in memory.

### Working with a Smaller Dataset

For quick experimentation, let's create a smaller, simplified version of the dataframe with just the essential columns:

In [11]:
# Select only the columns we need and limit to 5 rows for faster iteration
df = df_original.select("Product Name", "About Product", "Image").limit(5)

Now we have a manageable dataset of 5 products with just the product name, description, and image URLs. This simplified dataset lets us explore Daft's features without the overhead of unnecessary columns.

### Downloading Images

Let's extract and download product images. The `Image` column contains pipe-separated URLs. We'll extract the first URL and download it:

In [12]:
# Extract the first image URL from the pipe-separated list
# The pattern captures everything before the first pipe or the entire string if no pipe
df = df.with_column(
    "first_image_url",
    daft.functions.regexp_extract(
        df["Image"],
        r"^([^|]+)",  # Extract everything before the first pipe
        1,  # Get the first capture group
    ),
)

# Download the image data
df = df.with_column("image_data", daft.functions.download(df["first_image_url"], on_error="null"))

# Decode images for visual display (in Jupyter notebooks, this shows actual images!)
df = df.with_column("image", daft.functions.decode_image(df["image_data"], on_error="null"))

# Check what we have - in Jupyter notebooks, the 'image' column shows actual images!
df.select("Product Name", "first_image_url", "image_data", "image").show(3)

Product Name String,first_image_url String,image_data Binary,image Image[MIXED]
"DB Longboards CoreFlex Crossbow 41"" Bamboo Fiberglass Longboard Complete",https://images-na.ssl-images-amazon.com/images/I/51j3fPQTQkL.jpg,"b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""...",
"Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5)",https://images-na.ssl-images-amazon.com/images/I/51M0KnJxjKL.jpg,"b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""...",
"3Doodler Create Flexy 3D Printing Filament Refill Bundle (X5 Pack, Over 1000'. of Extruded Plastics! - Innovate",https://images-na.ssl-images-amazon.com/images/I/513cBC8PqpL.jpg,"b""\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01""...",


<div style="background-color: #448aff22; border-left: 4px solid #448aff; padding: 12px; margin: 16px 0;">
<strong style="color: #448aff;">Visual Display in Notebooks</strong><br/>
In Jupyter notebooks, the `image` column will display actual thumbnail images instead of `&lt;Image&gt;` text.
</div>

This demonstrates Daft's multimodal capabilities:

- **Native regex support**: Use `regexp_extract()` to parse structured text with Rust-powered regex
- **URL handling**: Download content directly with`daft.functions.download()`
- **Image decoding**: Convert binary data to images with `decode_image()` for visual display

The decoded images are now ready for further processing.

### Batch AI Inference on Images

Let's use AI to analyze product materials at scale. Daft automatically parallelizes AI operations across your local machine's cores, making it efficient to process multiple images concurrently.

Let's suppose you want to create a new column that shows if each product is made of wood or not. This might be useful for, for example, a filtering feature on your website.

In [13]:
from pydantic import BaseModel, Field

from daft.functions import prompt


# Define a simple structured output model
class WoodAnalysis(BaseModel):
    is_wooden: bool = Field(description="Whether the product appears to be made of wood")


# Run AI inference on each image - Daft automatically batches and parallelizes this
# Note: You can pass api_key explicitly here, or set the OPENAI_API_KEY environment variable
df = df.with_column(
    "wood_analysis",
    prompt(
        ["Is this product made of wood? Look at the material.", df["image"]],
        return_format=WoodAnalysis,
        model="gpt-4o-mini",  # Using mini for cost-efficiency
        provider="openai",
        api_key="your-openai-api-key-here",  # Or omit this to use OPENAI_API_KEY env var
    ),
)

# Extract the boolean value from the structured output
# The result is a struct, so we extract the 'is_wooden' field
df = df.with_column("is_wooden", df["wood_analysis"]["is_wooden"])

# Materialize the dataframe to compute all transformations
df = df.collect()

# View results
df.select("Product Name", "image", "is_wooden").show()

🗡️ 🐟 Parquet Scan: 00:00 0 rows out, 128.00 KiB bytes read
🗡️ 🐟 Limit 5: 00:00 [A
🗡️ 🐟 Limit 5: 00:00 5 rows in, 5 rows out[A
                                                           d

🗡️ 🐟 Project: 00:00 [A[A

🗡️ 🐟 Project: 00:00 5 rows in, 5 rows out[A[A

                                          [A[A


🗡️ 🐟 Project: 00:00 [A[A[A


🗡️ 🐟 Project: 00:00 5 rows in, 5 rows out[A[A[A


                                          [A[A[A





🗡️ 🐟 Project: 00:00 [A[A[A[A[A[A





🗡️ 🐟 Project: 00:00 4 rows in, 4 rows out[A[A[A[A[A[A



🗡️ 🐟 Project: 00:00 [A[A[A[A



🗡️ 🐟 Project: 00:00 5 rows in, 4 rows out[A[A[A[A






🗡️ 🐟 Async UDF prompt-a120ddcf-ba46-4ba0-99ac-cb98a8a47eff: 00:00 [A[A[A[A[A[A[A






🗡️ 🐟 Async UDF prompt-a120ddcf-ba46-4ba0-99ac-cb98a8a47eff: 00:00 4 rows in, 0 rows out[A[A[A[A[A[A[A




🗡️ 🐟 Project: 00:00 [A[A[A[A[A




🗡️ 🐟 Project: 00:00 4 rows in, 4 rows out[A[A[A[A[A



🗡️ 🐟 Project: 00:00 5 rows in,

Error when running pipeline node Async UDF prompt-a120ddcf-ba46-4ba0-99ac-cb98a8a47eff
RuntimeStatsManager finished with active nodes {7}









                                                                                        [A[A[A[A[A[A[A

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: your-ope************here. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

The AI analyzes each product image to determine if it's made of wood. Notice that the longboard is identified as wooden (true), while the electronic circuits, design studio, puzzle, and 3D printing filament are identified as not wooden (false).

<div style="background-color: #448aff22; border-left: 4px solid #448aff; padding: 12px; margin: 16px 0;">
<strong style="color: #448aff;">Improving Accuracy</strong><br/>
Looking at the actual product data, the longboard is made of bamboo and fiberglass, not wood. However, this is exactly what a human might categorize from the image alone! To improve accuracy, you could feed additional context to the AI like the product name, category, and description alongside the image. This example demonstrates how to get started with image-based analysis.
</div>

### Expanding the Analysis

Now, suppose you're satisfied with the results from your small subset and want to scale up. Instead of analyzing just 5 products, let's run the same analysis on 100 products to get more meaningful insights:

In [1]:
from pydantic import BaseModel, Field

from daft.functions import prompt


# Define a simple structured output model (same as before)
class WoodAnalysis(BaseModel):
    is_wooden: bool = Field(description="Whether the product appears to be made of wood")


# Start fresh with the first 100 products
df_large = df_original.select("Product Name", "About Product", "Image").limit(100)

# Apply the same image processing pipeline
# 1. Extract first image URL
df_large = df_large.with_column("first_image_url", daft.functions.regexp_extract(df_large["Image"], r"^([^|]+)", 1))

# 2. Download images
df_large = df_large.with_column("image_data", daft.functions.download(df_large["first_image_url"], on_error="null"))

# 3. Decode images
df_large = df_large.with_column("image", daft.functions.decode_image(df_large["image_data"], on_error="null"))

# 4. Run AI analysis on all 100 products
# Note: You can pass api_key explicitly here, or set the OPENAI_API_KEY environment variable
df_large = df_large.with_column(
    "wood_analysis",
    prompt(
        ["Is this product made of wood? Look at the material.", df_large["image"]],
        return_format=WoodAnalysis,
        model="gpt-4o-mini",  # Using mini for cost-efficiency
        provider="openai",
        api_key="your-openai-api-key-here",  # Or omit this to use OPENAI_API_KEY env var
    ),
)

# 5. Extract the boolean value
df_large = df_large.with_column("is_wooden", df_large["wood_analysis"]["is_wooden"])

# Materialize the dataframe to compute all transformations
df_large = df_large.collect()

# Count wooden products
wooden_count = df_large.where(df_large["is_wooden"] == True).count_rows()
total_count = df_large.count_rows()

print(f"Out of {total_count} products analyzed:")
print(f"  - {wooden_count} are made of wood")
print(f"  - {total_count - wooden_count} are not made of wood")
print(f"  - Percentage of wooden products: {(wooden_count / total_count * 100):.1f}%")

Out of 100 products analyzed:
  - 4 are made of wood
  - 96 are not made of wood
  - Percentage of wooden products: 4.0%


<div style="background-color: #448aff22; border-left: 4px solid #448aff; padding: 12px; margin: 16px 0;">
<strong style="color: #448aff;">Results May Vary</strong><br/>
AI models are non-deterministic, so you may see slightly different numbers when running this analysis.
</div>

### Storing Your Results

After processing your data, you'll often want to save it for later use. Let's store our analyzed dataset as Parquet files:

In [None]:
# Write the analyzed data to local Parquet files
df_large.write_parquet("product_analysis", write_mode="overwrite")

This writes your data to the `product_analysis/` directory. Daft automatically handles file naming using UUIDs to prevent conflicts. The `write_mode="overwrite"` parameter ensures that any existing data in the directory is replaced.

<div style="background-color: #448aff22; border-left: 4px solid #448aff; padding: 12px; margin: 16px 0;">
<strong style="color: #448aff;">Write anywhere</strong><br/>
Just like reading, Daft can write data to many destinations including <a href="https://docs.daft.ai/en/stable/connectors/aws/">S3</a>, <a href="https://docs.daft.ai/en/stable/connectors/iceberg/">Iceberg</a>, <a href="https://docs.daft.ai/en/stable/connectors/delta_lake/">Delta Lake</a>, and <a href="https://docs.daft.ai/en/stable/connectors/">more</a>.
</div>

### Loading Your Stored Data

Let's verify the stored data by loading it back from those Parquet files:

In [1]:
# Read the data back from Parquet files
df_loaded = daft.read_parquet("product_analysis/*.parquet")

# Verify the data loaded correctly
df_loaded.show(5)

╭────────────────────┬────────────────────┬───────────────────┬───────────────────┬────────────┬──────────────┬───────────────────┬───────────╮
│ Product Name       ┆ About Product      ┆ Image             ┆ first_image_url   ┆      …     ┆ image        ┆ wood_analysis     ┆ is_wooden │
│ ---                ┆ ---                ┆ ---               ┆ ---               ┆            ┆ ---          ┆ ---               ┆ ---       │
│ String             ┆ String             ┆ String            ┆ String            ┆ (1 hidden) ┆ Image[MIXED] ┆ Struct[is_wooden: ┆ Bool      │
│                    ┆                    ┆                   ┆                   ┆            ┆              ┆ Bool]             ┆           │
╞════════════════════╪════════════════════╪═══════════════════╪═══════════════════╪════════════╪══════════════╪═══════════════════╪═══════════╡
│ Flash Furniture    ┆ Collaborative      ┆ https://images-na ┆ https://images-na ┆ …          ┆ <Image>      ┆ {is_wooden:       ┆ fals

### What's Next?

Now that you have a basic sense of Daft's functionality and features, here are some more resources to help you get the most out of Daft:

<div style="background-color: #00c85322; border-left: 4px solid #00c853; padding: 12px; margin: 16px 0;">
<strong style="color: #00c853;">Scaling Further</strong><br/>
This same pipeline can process thousands or millions of products by leveraging Daft's distributed computing capabilities. Check out our <a href="https://docs.daft.ai/en/stable/distributed/">distributed computing guide</a> to run this analysis at scale on Ray or Kubernetes clusters. Alternatively, <a href="https://www.daft.ai/cloud">Daft Cloud</a> provides a fully managed serverless experience.
</div>

**Work with your favorite table and catalog formats:**

- [Apache Hudi](https://docs.daft.ai/en/stable/connectors/hudi/)
- [Apache Iceberg](https://docs.daft.ai/en/stable/connectors/iceberg/)
- [AWS Glue](https://docs.daft.ai/en/stable/connectors/glue/)
- [AWS S3 Tables](https://docs.daft.ai/en/stable/connectors/s3tables/)
- [Delta Lake](https://docs.daft.ai/en/stable/connectors/delta_lake/)
- [Hugging Face Datasets](https://docs.daft.ai/en/stable/connectors/huggingface/)
- [Unity Catalog](https://docs.daft.ai/en/stable/connectors/unity_catalog/)

**Explore our [Examples](https://docs.daft.ai/en/stable/examples/) to see Daft in action:**

- [MNIST Digit Classification](https://docs.daft.ai/en/stable/examples/mnist/)
- [Running LLMs on the Red Pajamas Dataset](https://docs.daft.ai/en/stable/examples/llms-red-pajamas/)
- [Querying Images with UDFs](https://docs.daft.ai/en/stable/examples/querying-images/)
- [Image Generation on GPUs](https://docs.daft.ai/en/stable/examples/image-generation/)
- [Window Functions in Daft](https://docs.daft.ai/en/stable/examples/window-functions/)