```{hint}
✨✨✨ **Run this notebook on Google Colab** ✨✨✨

You can [run this notebook yourself with Google Colab](https://colab.research.google.com/github/Eventual-Inc/Daft/blob/main/docs/source/10-min.ipynb)!
```

# 10 minutes Quickstart

This is a short introduction to all the main functionality in Daft, geared towards new users.

## What is Daft?
Daft is a distributed query engine built for running ETL, analytics, and ML/AI workloads at scale. Daft is implemented in Rust (fast!) and exposes a familiar Python dataframe API (friendly!). 

In this Quickstart you will learn the basics of Daft’s familiar DataFrame API and the features that set it apart from frameworks like pandas, pySpark, Dask and Ray. You will build a small database of dog owners and their fluffy companions and see how you can use Daft to download images from URLs, run an ML classifier and call custom UDFs, all within an interactive DataFrame interface. Woof! 🐶

## When Should I use Daft?

Daft is the right tool for you if you are working with any of the following:
- **Large datasets** that don't fit into memory or would benefit from parallelization
- **Multimodal data types** such as images, JSON, vector embeddings, and tensors
- **Formats that support data skipping** through automatic partition pruning and stats-based file pruning for filter predicates
- **ML workloads** that would benefit from interactive computation within DataFrame (via UDFs)

Read more about how Daft compares to other DataFrames [here](https://www.getdaft.io/projects/docs/en/latest/faq/dataframe_comparison.html).

Let's jump in! 🪂

## Install and Import Daft

You can install Daft using `pip`:

In [1]:
!pip install -U getdaft

Collecting getdaft
  Using cached getdaft-0.2.20-cp37-abi3-macosx_11_0_arm64.whl.metadata (10 kB)
Downloading getdaft-0.2.20-cp37-abi3-macosx_11_0_arm64.whl (17.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.5/17.5 MB[0m [31m243.0 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:02[0m
[?25hInstalling collected packages: getdaft
  Attempting uninstall: getdaft
    Found existing installation: getdaft 0.2.19
    Uninstalling getdaft-0.2.19:
      Successfully uninstalled getdaft-0.2.19
Successfully installed getdaft-0.2.20


And then import Daft and one of its classes which we'll need later on:

In [1]:
import daft
from daft import DataType

## Create your first Daft DataFrame

See also: [API Reference: DataFrame Construction](df-input-output)

To begin, let's create a DataFrame from a dictionary of columns:

In [2]:
import datetime

df = daft.from_pydict({
    "integers": [1, 2, 3, 4],
    "floats": [1.5, 2.5, 3.5, 4.5],
    "bools": [True, True, False, False],
    "strings": ["a", "b", "c", "d"],
    "bytes": [b"a", b"b", b"c", b"d"],
    "dates": [datetime.date(1994, 1, 1), datetime.date(1994, 1, 2), datetime.date(1994, 1, 3), datetime.date(1994, 1, 4)],
    "lists": [[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]],
    "nulls": [None, None, None, None],
})

df

integers Int64,floats Float64,bools Boolean,strings Utf8,bytes Binary,dates Date,lists List[Int64],nulls Null
1,1.5,True,a,"b""a""",1994-01-01,"[1, 1, 1]",
2,2.5,True,b,"b""b""",1994-01-02,"[2, 2, 2]",
3,3.5,False,c,"b""c""",1994-01-03,"[3, 3, 3]",
4,4.5,False,d,"b""d""",1994-01-04,"[4, 4, 4]",


### Multimodal Data Types

Daft is built for multimodal data type support. Daft DataFrames can contain more data types than other DataFrame APIs like pandas, Spark or Dask. Daft columns can contain URLs, images, tensors and Python classes. You'll get to work with some of these data types in a moment.

For a complete list of supported data types see: [API Reference: DataTypes](datatypes)

### Data Sources

You can also load DataFrames from other sources, such as:

1. CSV files: {func}`daft.read_csv("s3://bucket/*.csv") <daft.read_csv>`
2. Parquet files: {func}`daft.read_parquet("/path/*.parquet") <daft.read_parquet>`
3. JSON line-delimited files: {func}`daft.read_json("/path/*.parquet") <daft.read_json>`
4. Files on disk: {func}`daft.from_glob_path("/path/*.jpeg") <daft.from_glob_path>`

Daft automatically supports local paths as well as paths to object storage such as AWS S3:

```
df = daft.read_json("s3://path/to/bucket/file.jsonl)
```

See [User Guide: Integrations]() to learn more about working with other formats like Delta Lake and Iceberg.

## Who likes puppies? 😍🐶 

Let's find some more fun data to work with :)

We'll read in a Parquet file from a public S3 bucket. Note that this Parquet file is partitioned on the XX column. This will be important later on.
- predicate pushdown filtering
- parallel partition processing

In [None]:
# Read partitioned Parquet file from S3 
# will show no contents >>

In [4]:
df = daft.from_pydict({
    "first_name": ["Ernesto", "Sari", "Wolfgang", "Jackie", "Zoya"],
    "last_name":["Evergreen", "Salama", "Winter", "Jale", "Zee"],
    "age": [34, 57, 23, 62, 40],
    "DoB": [datetime.date(1990,4,3), datetime.date(1967,1,2), datetime.date(2001,2,12), datetime.date(1962,3,24), datetime.date(1984,4,7)],
    "country": ["Canada", "United Kingdom", "Germany", "Canada", "United Kingdom"],
    "has_dog": [True, True, False, True, True],
})

df

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Ernesto,Evergreen,34,1990-04-03,Canada,True
Sari,Salama,57,1967-01-02,United Kingdom,True
Wolfgang,Winter,23,2001-02-12,Germany,False
Jackie,Jale,62,1962-03-24,Canada,True
Zoya,Zee,40,1984-04-07,United Kingdom,True


In [5]:
df.write_parquet("owners", partition_cols=["country"])

  from .autonotebook import tqdm as notebook_tqdm
                                                                                                                         

path Utf8,country Utf8
owners/country=Canada/5429d152-bc0b-4b5d-a8c0-4f678b6c2881-0.parquet,Canada
owners/country=Germany/92e4b038-d29e-474c-a3fc-8c5e236f2982-0.parquet,Germany
owners/country=United Kingdom/693165f7-ffe9-4cb0-9c8d-f40ad7e14e58-0.parquet,United Kingdom


In [3]:
# change this to s3 read for final version
df = daft.read_parquet("owners/*/*")
df

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean


Daft DataFrames are lazy by default. This means that the contents will not be computed ("materialized") unless you explicitly tell Daft to do so. This is best practice for working with larger-than-memory datasets and parallel/distributed architectures.

The file we have just loaded only has XX rows. You can materialize the whole DataFrame in memory easily using the `.collect` method:

In [7]:
df.collect()

                                                                                                                         

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Ernesto,Evergreen,34,1990-04-03,Canada,True
Jackie,Jale,62,1962-03-24,Canada,True
Wolfgang,Winter,23,2001-02-12,Germany,False
Sari,Salama,57,1967-01-02,United Kingdom,True
Zoya,Zee,40,1984-04-07,United Kingdom,True


You can also take a look at just the first few rows with the `.show` method:

In [8]:
df.show(3)

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Ernesto,Evergreen,34,1990-04-03,Canada,True
Jackie,Jale,62,1962-03-24,Canada,True
Wolfgang,Winter,23,2001-02-12,Germany,False


Use `.show` for quick visualisation in an interactive notebook. To use a limited number of rows for further transformation, use the {meth}`.limit <daft.DataFrame.limit>` method.

## Basic DataFrame Operations

Let's take a look at some of the most common DataFrame operations.

You can **select** specific columns from your DataFrame with the `.select` method:

In [9]:
df.select("first_name", "has_dog").show()

first_name Utf8,has_dog Boolean
Ernesto,True
Jackie,True
Wolfgang,False
Sari,True
Zoya,True


You can **limit** the number of rows in a dataframe by calling {meth}`df.limit() <daft.DataFrame.limit>`:

In [10]:
df.limit(1).collect()

LocalLimit [Stage:6]:   0%|                                                                        | 0/1 [00:00<?, ?it/s]
                                                                                                                         [A
[A                                                                                                                      
[A

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Ernesto,Evergreen,34,1990-04-03,Canada,True


To **drop** columns from the dataframe, call {meth}`df.exclude() <daft.DataFrame.exclude>`:

In [12]:
df.exclude("DoB").show()

first_name Utf8,last_name Utf8,age Int64,country Utf8,has_dog Boolean
Ernesto,Evergreen,34,Canada,True
Jackie,Jale,62,Canada,True
Wolfgang,Winter,23,Germany,False
Sari,Salama,57,United Kingdom,True
Zoya,Zee,40,United Kingdom,True


You can **sort** a dataframe with {meth}`df.sort() <daft.DataFrame.sort>`, which we do so here in descending order:

In [15]:
df.sort(df["age"], desc=False).show()

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Wolfgang,Winter,23,2001-02-12,Germany,False
Ernesto,Evergreen,34,1990-04-03,Canada,True
Zoya,Zee,40,1984-04-07,United Kingdom,True
Sari,Salama,57,1967-01-02,United Kingdom,True
Jackie,Jale,62,1962-03-24,Canada,True


You can **filter** rows in your DataFrame with a predicate using the {meth}`df.where() <daft.DataFrame.where>` method:

In [16]:
df.where(df["age"] > 35).show()

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Jackie,Jale,62,1962-03-24,Canada,True
Sari,Salama,57,1967-01-02,United Kingdom,True
Zoya,Zee,40,1984-04-07,United Kingdom,True


Filtering can give you powerful optimization when you are working with partitioned files or tables. Daft will use the predicate to only read in the necessary partitions.

For example, our Parquet file is partitioned on the `country` column. This means that queries with a `country` predicate will benefit from query optimization:

In [17]:
df.where(df["country"] == "Canada").show()

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Ernesto,Evergreen,34,1990-04-03,Canada,True
Jackie,Jale,62,1962-03-24,Canada,True


Daft only needs to read in 1 file for this query, instead of 3.

## Query Planning

As mentioned earlier, Daft is lazy: computations on your DataFrame are not executed immediately. Instead, Daft creates a `LogicalPlan` which defines the operations that need to happen to materialize the requested result. Think of this LogicalPlan as a recipe. 

You can examine this logical plan using {meth}`df.explain() <daft.DataFrame.explain>`:

In [4]:
df.where(df["country"] == "Canada").explain(show_all=True)

== Unoptimized Logical Plan ==

* Filter: col(country) == lit("Canada")
|
* GlobScanOperator
|   Glob paths = [owners/*/*]
|   Coerce int96 timestamp unit = Nanoseconds
|   IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000, Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check hostname SSL = true, Requester pays = false }, Azure config = { Anoynmous = false, Use SSL = true }, GCS config = { Anoynmous = false }
|   Use multithreading = true
|   File schema = first_name#Utf8, last_name#Utf8, age#Int64, DoB#Date, country#Utf8, has_dog#Boolean
|   Partitioning keys = []
|   Output schema = first_name#Utf8, last_name#Utf8, age#Int64, DoB#Date, country#Utf8, has_dog#Boolean


== Optimized Logical Plan ==

* GlobScanOperator
|   Glob paths = [owners/*/*]
|   Coerce int96 timestamp unit = Nanoseconds
|   IO config = S3 config = { Max connections = 8, Retry initial ba

Daft creates 3 types of plans:
1. an **unoptimized Logical Plan**, to sketch out the rough steps
2. an **optimized Logical Plan**, to maximise performance
3. a **Physical Plan**, which maps the logical plan to the physical files

Because we are filtering our DataFrame on the partition column `country`, Daft can optimize the Logical Plan and save us time and computing resources by only reading a single partition from disk.

Use {meth}`df.collect() <daft.DataFrame.collect>` to execute computations on **all** your data and get a little preview of the materialized results. The results are kept in memory so that subsequent operations will avoid recomputations.

## Expressions

See: [Expressions](user_guide/basic_concepts/expressions.rst)

Expressions are an API for defining computation that needs to happen over your columns.

For example, use the `daft.col()` expression together with the `with_column` method to create a new column `full_name`, joining the contents of the `last_name` column to the `first_name` column:

In [6]:
df_full = df.with_column("full_name", daft.col('first_name') + ' ' + daft.col('last_name'))
df_full.select("full_name", "age", "country", "has_dog").show()

full_name Utf8,age Int64,country Utf8,has_dog Boolean
Wolfgang Winter,23,Germany,False
Ernesto Evergreen,34,Canada,True
Jackie Jale,62,Canada,True
Sari Salama,57,United Kingdom,True
Zoya Zee,40,United Kingdom,True


Some Expression methods are only allowed on certain types and are accessible through "method accessors" such as the {meth}`.str <daft.expressions.Expression.str>` accessor (see: [Expression Accessor Properties](expression-accessor-properties)).

For example, the {meth}`.str.length() <daft.expressions.expressions.ExpressionStringNamespace.length>` expression is only valid when run on a String column:

In [7]:
df_full_year = df_full.with_column("DoB_year", df["DoB"].dt.year())
df_full_year.show()

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean,full_name Utf8,DoB_year Int32
Ernesto,Evergreen,34,1990-04-03,Canada,True,Ernesto Evergreen,1990
Jackie,Jale,62,1962-03-24,Canada,True,Jackie Jale,1962
Sari,Salama,57,1967-01-02,United Kingdom,True,Sari Salama,1967
Zoya,Zee,40,1984-04-07,United Kingdom,True,Zoya Zee,1984
Wolfgang,Winter,23,2001-02-12,Germany,False,Wolfgang Winter,2001


### Merging DataFrames

DataFrames can be joined with {meth}`df.join() <daft.DataFrame.join>`.

In [None]:
# join df_full to df_dogs > df_family
# puppy time as really the practical application that brings it all together


In [None]:
# read in Parquet file with image URLs (Flickr?)


In [None]:
# missing data

In [8]:
# grouping and aggregations

## Puppy time!

You've made it half-way! Time to bring in some fluffy beings 🐶

Let's bring all of the elements you've learned together to see how you can use Daft to:
- work with **multimodal data** like Python classes, URLs, and Images,
- apply **custom User-Defined Functions** to your columns,
- and **run ML workloads** within your DataFrame.

In [None]:
# run ML classifier on dog images?
# or point to separate tutorial where we do that?

In [3]:
import datetime

df = daft.from_pydict({
    "integers": [1, 2, 3, 4],
    "floats": [1.5, 2.5, 3.5, 4.5],
    "bools": [True, True, False, False],
    "strings": ["a", "b", "c", "d"],
    "bytes": [b"a", b"b", b"c", b"d"],
    "dates": [datetime.date(1994, 1, 1), datetime.date(1994, 1, 2), datetime.date(1994, 1, 3), datetime.date(1994, 1, 4)],
    "lists": [[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]],
    "nulls": [None, None, None, None],
})
df

integers Int64,floats Float64,bools Boolean,strings Utf8,bytes Binary,dates Date,lists List[Int64],nulls Null
1,1.5,True,a,"b""a""",1994-01-01,"[1, 1, 1]",
2,2.5,True,b,"b""b""",1994-01-02,"[2, 2, 2]",
3,3.5,False,c,"b""c""",1994-01-03,"[3, 3, 3]",
4,4.5,False,d,"b""d""",1994-01-04,"[4, 4, 4]",


You can also load DataFrames from other sources, such as:

1. CSV files: {func}`daft.read_csv("s3://bucket/*.csv") <daft.read_csv>`
2. Parquet files: {func}`daft.read_parquet("/path/*.parquet") <daft.read_parquet>`
3. JSON line-delimited files: {func}`daft.read_json("/path/*.parquet") <daft.read_json>`
4. Files on disk: {func}`daft.from_glob_path("/path/*.jpeg") <daft.from_glob_path>`

Daft automatically supports local paths as well as paths to object storage such as AWS S3.

Let's try to select the columns from our DataFrame that are not nulls:

In [26]:
df = df.select("integers", "floats", "bools", "strings", "bytes", "dates", "lists")
df

integers Int64,floats Float64,bools Boolean,strings Utf8,bytes Binary,dates Date,lists List[Int64]


Another example of a useful method accessor is the {meth}`.url <daft.expressions.Expression.url>` accessor. You can use {meth}`.url.download() <daft.expressions.expressions.ExpressionUrlNamespace.download>` to download data from a column of URLs like so:

In [14]:
image_url_df = daft.from_pydict({
    "urls": [
        "http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg",
        "http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg",
        "http://farm3.staticflickr.com/2169/2118578392_1193aa04a0_z.jpg",
    ],
})
image_downloaded_df = image_url_df.with_column("image_bytes", image_url_df["urls"].url.download())
image_downloaded_df.collect()

urls Utf8,image_bytes Binary
http://farm9.staticflickr.com/8186/8119368305_4e622c8349_...,b'\xff\xd8\xff\xe1\x00TExif\x00\x00MM\x00*\x00\x00\x00\x0...
http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg,b'\xff\xd8\xff\xe1\x00(Exif\x00\x00MM\x00*\x00\x00\x00\x0...
http://farm3.staticflickr.com/2169/2118578392_1193aa04a0_...,b'\xff\xd8\xff\xe1\x00\x16Exif\x00\x00MM\x00*\x00\x00\x00...


For a full list of all Expression methods and operators, see: [Expressions API Docs](api_docs/expressions.rst)

## Python object columns

Daft Dataframes can also contain Python objects. Here is an example of how to create a dataframe with Python objects.

In [15]:
# Let's define a toy example of a Python class!
class Dog:
    def __init__(self, name):
        self.name = name
        
    def bark(self):
        return f"{self.name}!"

py_df = daft.from_pydict({
    "dogs": [Dog("ruffles"), Dog("waffles"), Dog("doofus")],
    "owner": ["russell", "william", "david"],
})

Now, when we print our dataframe we can see that it contains our `Dog` Python objects! Also note that the type of the column is {meth}`Python <daft.DataType.python>`.

In [16]:
py_df.collect()

dogs Python,owner Utf8
<__main__.Dog object at 0x11ef78ac0>,russell
<__main__.Dog object at 0x11ef78430>,william
<__main__.Dog object at 0x11ef78040>,david


To work with {meth}`Python <daft.DataType.python>` type columns, Daft provides a few useful Expression methods.

{meth}`.apply() <daft.expressions.Expression.apply>` is useful to work on each Dog individually and apply a function.

Here's an example where we extract a string from each `Dog` by calling `.bark()` on each `Dog` object and return a new `Utf8` column.

In [17]:
py_df.with_column(
    "dogs_bark_name",
    py_df["dogs"].apply(lambda dog: dog.bark(), return_dtype=DataType.string()),
).collect()

dogs Python,owner Utf8,dogs_bark_name Utf8
<__main__.Dog object at 0x11ef78ac0>,russell,ruffles!
<__main__.Dog object at 0x11ef78430>,william,waffles!
<__main__.Dog object at 0x11ef78040>,david,doofus!


### User-Defined Functions

{meth}`.apply() <daft.expressions.Expression.apply>` makes it really easy to map a function on a single column, but is limited in 2 main ways:

1. Only runs on a single column: some algorithms require multiple columns as inputs
2. Only runs on a single row: some algorithms run much more efficiently when run on a batch of rows instead

To overcome these limitations, you can use User-Defined Functions (UDFs).

See Also: [UDF User Guide](user_guide/daft_in_depth/udf)

In [18]:
from daft import udf

@udf(return_dtype=DataType.string())
def custom_bark(dog_series, owner_series):
    return [
        f"{dog.name} loves {owner_name}!"
        for dog, owner_name
        in zip(dog_series.to_pylist(), owner_series.to_pylist())
    ]

py_df.with_column("custom_bark", custom_bark(py_df["dogs"], py_df["owner"])).collect()

dogs Python,owner Utf8,custom_bark Utf8
<__main__.Dog object at 0x11ef78ac0>,russell,ruffles loves russell!
<__main__.Dog object at 0x11ef78430>,william,waffles loves william!
<__main__.Dog object at 0x11ef78040>,david,doofus loves david!


## Missing Data

All columns in Daft are "nullable" by default. Unlike other frameworks such as Pandas, Daft differentiates between "null" (missing) and "nan" (stands for not a number - a special value indicating an invalid float).

In [20]:
missing_data_df = daft.from_pydict({
    "floats": [1.5, None, float("nan")],
})
missing_data_df = missing_data_df \
    .with_column("floats_is_null", missing_data_df["floats"].is_null()) \
    .with_column("floats_is_nan", missing_data_df["floats"].float.is_nan())

missing_data_df.collect()

floats Float64,floats_is_null Boolean,floats_is_nan Boolean
1.5,False,false
,True,none
,False,true


To fill in missing values, a useful Expression is the {meth}`.if_else <daft.expressions.Expression.if_else>` expression which can be used to fill in values if the value is null:

In [21]:
missing_data_df = missing_data_df.with_column("filled_in_floats", (missing_data_df["floats"].is_null()).if_else(0.0, missing_data_df["floats"]))
missing_data_df.collect()

floats Float64,floats_is_null Boolean,floats_is_nan Boolean,filled_in_floats Float64
1.5,False,false,1.5
,True,none,0.0
,False,true,


## Merging Dataframes

DataFrames can be joined with {meth}`df.join() <daft.DataFrame.join>`. Here is a naive example of a self-join where we join `df` on itself with column "A" as the join key.

In [22]:
joined_df = df.join(df, on="integers")

In [23]:
joined_df.collect()

integers Int64,floats Float64,bools Boolean,strings Utf8,bytes Binary,dates Date,lists List[Int64],right.floats Float64,right.bools Boolean,right.strings Utf8,right.bytes Binary,right.dates Date,right.lists List[Int64]
1,1.5,True,a,b'a',1994-01-01,"[1, 1, 1]",1.5,True,a,b'a',1994-01-01,"[1, 1, 1]"
2,2.5,True,b,b'b',1994-01-02,"[2, 2, 2]",2.5,True,b,b'b',1994-01-02,"[2, 2, 2]"
3,3.5,False,c,b'c',1994-01-03,"[3, 3, 3]",3.5,False,c,b'c',1994-01-03,"[3, 3, 3]"
4,4.5,False,d,b'd',1994-01-04,"[4, 4, 4]",4.5,False,d,b'd',1994-01-04,"[4, 4, 4]"


## Grouping and Aggregations

Groupby aggregation operations over a dataset happens in 2 phases:

1. Splitting the data into groups based on some criteria using {meth}`df.groupby() <daft.DataFrame.groupby>`
2. Specifying how to aggregate the data for each group using {meth}`GroupedDataFrame.agg() <daft.dataframe.dataframe.GroupedDataFrame.agg>`

Let's take a look at an example:

In [24]:
grouping_df = daft.from_pydict(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["a", "a", "b", "c", "b", "b", "a", "c"],
        "C": [i for i in range(8)],
        "D": [i for i in range(8)],
    }
)
grouping_df.collect()

A Utf8,B Utf8,C Int64,D Int64
foo,a,0,0
bar,a,1,1
foo,b,2,2
bar,c,3,3
foo,b,4,4
bar,b,5,5
foo,a,6,6
foo,c,7,7


First we group by "A", so that we will evaluate rows with `A=foo` and `A=bar` separately in their respective groups.

In [25]:
grouped_df = grouping_df.groupby(grouping_df["A"])
grouped_df

GroupedDataFrame(df=+--------+--------+---------+---------+
| A      | B      |       C |       D |
| Utf8   | Utf8   |   Int64 |   Int64 |
| foo    | a      |       0 |       0 |
+--------+--------+---------+---------+
| bar    | a      |       1 |       1 |
+--------+--------+---------+---------+
| foo    | b      |       2 |       2 |
+--------+--------+---------+---------+
| bar    | c      |       3 |       3 |
+--------+--------+---------+---------+
| foo    | b      |       4 |       4 |
+--------+--------+---------+---------+
| bar    | b      |       5 |       5 |
+--------+--------+---------+---------+
| foo    | a      |       6 |       6 |
+--------+--------+---------+---------+
| foo    | c      |       7 |       7 |
+--------+--------+---------+---------+
(Showing first 8 of 8 rows), group_by=<daft.expressions.expressions.ExpressionsProjection object at 0x11f58ab90>)

Now we can specify the aggregations we want to compute over columns C and D. Here we compute the sum over column C, and the mean over column D for each group:

In [26]:
aggregated_df = grouped_df.agg([
    (grouped_df["C"].alias("C_sum"), "sum"),
    (grouped_df["D"].alias("D_mean"), "mean"),
])
aggregated_df.collect()

A Utf8,C_sum Int64,D_mean Float64
bar,9,3.0
foo,19,3.8


These operations work as well when run over multiple groupby columns, which will produce one row for each combination of columns that occur in the DataFrame:

In [27]:
grouping_df \
    .groupby(grouping_df["A"], grouping_df["B"]) \
    .agg([
        (grouping_df["C"].alias("C_sum"), "sum"),
        (grouping_df["D"].alias("D_mean"), "mean"),
    ]) \
    .collect()

A Utf8,B Utf8,C_sum Int64,D_mean Float64
bar,a,1,1
foo,b,6,3
foo,a,6,3
bar,b,5,5
foo,c,7,7
bar,c,3,3


## Writing Data

See: [Writing Data](df-writing-data)

Writing data will execute your DataFrame and write the results out to the specified backend. For example, to write data out to Parquet with {meth}`df.write_parquet() <daft.DataFrame.write_parquet>`:


In [5]:
written_df = df.write_parquet("my-dataframe.parquet")

                                                                  
[A                                                         

Note that writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data:

In [6]:
written_df

path Utf8
my-dataframe.parquet/d796131c-0c31-4688-a5ee-48ca500498e3-0.parquet
