# 10 minutes to Daft

This is a short introduction to all the main functionality in Daft, geared towards new users.

We import from daft as follows:

In [None]:
from daft import DataFrame, col

## DataFrame creation

See also: [DataFrames User Guide: Loading Data](dataframe-loading-data)

We can create a DataFrame from a dictionary of columns - this is a dictionary where the keys are strings representing the columns' names and the values are equal-length lists representing the columns' values.

In [None]:
import datetime

df = DataFrame.from_pydict({
    "A": [1, 2, 3, 4],
    "B": [1.5, 2.5, 3.5, 4.5],
    "C": [True, True, False, False],
    "D": ["a", "b", "c", "d"],
    "E": [b"a", b"b", b"c", b"d"],
    "F": [datetime.date(1994, 1, 1), datetime.date(1994, 1, 2), datetime.date(1994, 1, 3), datetime.date(1994, 1, 4)],
    "G": [[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]],
})

Check the schema of your dataframe by inspecting the `df` variable

In [None]:
df

You can also load DataFrames from other sources, such as:

1. CSV files: `DataFrame.from_csv("s3://bucket/*.csv")`
2. Parquet files: `DataFrame.from_csv("/path/*.parquet")`
3. JSON line-delimited files: `DataFrame.from_json("/path/*.parquet")`
4. Files on disk: `DataFrame.from_files("/path/*.jpeg")`

Daft automatically supports local paths as well as paths to object storage such as AWS S3.

## Viewing Data

Use `DataFrame.show(N)` to view N rows of the DataFrame.

In [None]:
df.show(2)

You can sort a dataframe with `DataFrame.sort` - let's do that here and retrieve 2 rows again but after sorting on A in descending order:

In [None]:
df.sort(df["A"], desc=True).show(2)

To retrieve all the column names:

In [None]:
df.column_names

You can also convert a DataFrame to a Pandas DataFrame:

In [None]:
df.to_pandas()

## Data Selection

You can limit the number of rows in a dataframe by calling `DataFrame.limit`.

In [None]:
df_limited = df.limit(1)
df_limited.show()

To select just a few columns, you can use `DataFrame.select`:

In [None]:
df_selected = df.select(df["A"], df["B"])
df_selected.show()

Column selection also allows you to rename columns using `Expression.alias`:

In [None]:
df_renamed = df.select(df["A"].alias("A2"), df["B"])
df_renamed.show()

To drop columns from the dataframe, call `DataFrame.exclude`:

In [None]:
df_excluded = df.exclude("A")
df_excluded.show()

## Operations

See: [Expressions](user_guides/expressions.rst)

Expressions are an API for defining computation that needs to happen over your columns. Daft is **lazy**, which means that it does not execute computation when you define it with expressions, but only when you [execute the DataFrame](user-guide-execution) with something such as a `DataFrame.show()`.

For example, to create a new column that is just the column A incremented by 1:

In [None]:
df_A_plus1 = df.with_column("A_plus1", df["A"] + 1)  # does not run any computation
df_A_plus1.show()  # runs all the computations defined on the dataframe, and displays it

### Method Accessors

Some Expression methods are only allowed on certain types and are accessible through "method accessors" such as the `Expression.str` accessor (see: [Expression Accessor Properties](expression-accessor-properties)).

For example, the `.str.length()` expression is only valid when run on a STRING column:

In [None]:
df_E_length = df.with_column("D_length", df["D"].str.length())
df_E_length.show()

Another example of a useful method accessor is the `.url` accessor. You can use `.url.download()` to download data from a column of URLs like so:

In [None]:
image_url_df = DataFrame.from_pydict({
    "urls": [
        "http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg",
        "http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg",
        "http://farm3.staticflickr.com/2169/2118578392_1193aa04a0_z.jpg",
    ],
})
image_downloaded_df = image_url_df.with_column("image_bytes", image_url_df["urls"].url.download())
image_downloaded_df.show()

For a full list of all Expression methods and operators, see: [Expressions API Docs](../api_docs/expressions.rst)

## Operations on PY columns

PY columns contain Python objects and operations called on these columns will be mapped on each object as well.

To work with such columns, Daft provides a few useful Expression methods.

For example, to repeat each list in column `G` 3 times, we can use the Python `list`'s native Python `*` operator:

In [None]:
df_G_extend_0 = df.with_column("G_repeat", df["G"] * 3)
df_G_extend_0.show()

To call a method on each list in column `G`, we can use the `.as_py` method. For example, here we use the Python `list`'s `.count()` method to count the number of occurences of the integer in column A:

In [None]:
df_G_count_A = df.with_column("G_count_A", df["G"].as_py(list).count(df["A"]))
df_G_count_A.show()

For more complicated functions, you can use `.apply(f)` to call a function `f` on every object in the column. For example, here we construct a Numpy array for every list in column G.

```{note}
It is good practice to supply the `return_type=` keyword argument in `.apply`, as this lets Daft effectively optimize your data and operations under the hood. For example, in this case we specify `return_type=np.ndarray` which tells Daft that each row in this column contains a Numpy array object.
```

In [None]:
import numpy as np

df_G_to_numpy = df.with_column("G_to_numpy", df["G"].apply(lambda l: np.array(l), return_type=np.ndarray))
df_G_to_numpy.show()

Iterable types such as a PY[list] column can be exploded with `DataFrame.explode`, splitting each list into a row of its own and repeating the other columns:

In [None]:
df_G_exploded = df.explode(df["G"])
df_G_exploded.show()

### User-Defined Functions

`.apply` makes it really easy to map a function on a single column, but is limited in 2 main ways:

1. Only runs on a single column: some algorithms require multiple columns as inputs
2. Only runs on a single row: some algorithms run much more efficiently when run on a batch of rows instead

To overcome these limitations, you can use User-Defined Functions (UDFs).

See Also: [UDF User Guide](https://getdaft.io/learn/user_guides/udf.html)

In [None]:
import datetime
from daft import polars_udf
import polars as pl

@polars_udf(return_type=datetime.date)
def add_days(f_date_data: pl.Series, a_days_data: pl.Series):
    return f_date_data + pl.duration(days=a_days_data)

df.with_column("F_add_A_days", add_days(df["F"], df["A"])).show()

The simple UDF demonstrated above is a "stateless UDF", and no state is maintained between invocations of the function. In certain use-cases, it can be important to maintain some state with a "stateful UDF", which you can write using a Class instead of a Function. For example, running machine learning models often requires downloading some trained weights and initializing the model in memory/on a GPU, which an expensive operation and should be cached between UDF invocations.

In [None]:
@polars_udf(return_type=float)
class RunExpensiveModel:

    def __init__(self):
        # Initialize and cache an "expensive" model between invocations of the UDF
        self.model = np.array([1.23, 4.56])
        
    def __call__(self, a_data: pl.Series, b_data: pl.Series):
        return np.matmul(self.model, np.array([a_data.to_numpy(), b_data.to_numpy()]))

df.with_column("expensive_model_results", RunExpensiveModel(df["A"], df["B"])).show()

## Filtering Data

You can filter rows in dataframe using `DataFrame.where`, which accepts a LOGICAL type Expression as an argument:

In [None]:
# Keep only rows where values in column "A" are less than 3
df_filtered = df.where(df["A"] < 3)
df_filtered.show()

## Missing Data

All columns in Daft are "nullable" by default. Unlike other frameworks such as Pandas, Daft differentiates between "null" (missing) and "nan" (stands for not a number - a special value indicating an invalid float).

In [None]:
missing_data_df = DataFrame.from_pydict({
    "floats": [1.5, None, float("nan")],
})
missing_data_df = missing_data_df \
    .with_column("floats_is_null", missing_data_df["floats"].is_null()) \
    .with_column("floats_is_nan", missing_data_df["floats"].is_nan())

# NOTE: there is an open issue with display of null vs NaNs, see: https://github.com/Eventual-Inc/Daft/issues/241
missing_data_df.show()

To fill in missing values, a useful Expression is the `.if_else` expression which can be used to fill in values if the value is null:

In [None]:
missing_data_df = missing_data_df.with_column("filled_in_floats", (missing_data_df["floats"].is_null()).if_else(0.0, missing_data_df["floats"]))
missing_data_df.show()

## Merging Dataframes

DataFrames can be joined with `.join`. Here is a naive example of a self-join where we join `df` on itself with column "A" as the join key. Notice that we automatically assign a prefix `"right."` to all the columns with conflicting names:

In [None]:
joined_df = df.join(df, on="A")
joined_df

In [None]:
joined_df.show()

## Grouping

Grouping operations over a dataset happens in 2 phases:

1. Splitting the data into groups based on some criteria using `DataFrame.groupby`
2. Specifying how to aggregate the data for each group using `GroupedDataFrame.agg`

Let's take a look at an example:

In [None]:
grouping_df = DataFrame.from_pydict(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["a", "a", "b", "c", "b", "b", "a", "c"],
        "C": [i for i in range(8)],
        "D": [i for i in range(8)],
    }
)
grouping_df.show()

First we group by "A", so that we will evaluate rows with `A=foo` and `A=bar` separately in their respective groups.

In [None]:
grouped_df = grouping_df.groupby(grouping_df["A"])
grouped_df

Now we can specify the aggregations we want to compute over columns C and D. Here we compute the sum over column C, and the mean over column D for each group:

In [None]:
aggregated_df = grouped_df.agg([
    (col("C").alias("C_sum"), "sum"),
    (col("D").alias("D_mean"), "mean"),
])
aggregated_df.show()

These operations work as well when run over multiple groupby columns, which will produce one row for each combination of columns that occur in the DataFrame:

In [None]:
grouping_df \
    .groupby(grouping_df["A"], grouping_df["B"]) \
    .agg([
        (col("C").alias("C_sum"), "sum"),
        (col("D").alias("D_mean"), "mean"),
    ]) \
    .show()

## Writing Data

See: [Writing Data](dataframe-writing-data)

Writing data will execute your DataFrame and write the results out to the specified backend. For example, to write data out to CSV:


In [None]:
# NOTE: Daft does not write PY columns at the moment.
# This is a feature that is on the roadmap as various options for implementation are being designed.
write_df = df.exclude("G")

written_df = write_df.write_csv("my-dataframe.csv")

Note that writing your dataframe is a **blocking** operation that executes your DataFrame. It will return a new `DataFrame` that contains the filepaths to the written data:

In [None]:
written_df.show()