# Quickstart

In this quickstart, you will learn the basics of Daft's DataFrame and SQL API and the features that set it apart from frameworks like Pandas, PySpark, Dask, and Ray.

<div class="admonition failure">
    <p class="admonition-title">todo(docs): incorporate sql examples</p>
</div>

## Install Daft

You can install Daft using `pip`. Run the following command in your terminal or notebook:

In [None]:
pip install getdaft

## Create Your First Daft DataFrame

Let's create a DataFrame from a dictionary of columns:

In [18]:
import daft

df = daft.from_pydict(
    {
        "A": [1, 2, 3, 4],
        "B": [1.5, 2.5, 3.5, 4.5],
        "C": [True, True, False, False],
        "D": [None, None, None, None],
    }
)

df

A Int64,B Float64,C Boolean,D Null
1,1.5,True,
2,2.5,True,
3,3.5,False,
4,4.5,False,


You just created your first DataFrame!

## Read From a Data Source

Daft supports both local paths as well as paths to object storage such as AWS S3:

- CSV files: [`daft.read_csv("s3://path/to/bucket/*.csv")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_csv.html#daft.read_csv)
- Parquet files: [`daft.read_parquet("/path/*.parquet")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_parquet.html#daft.read_parquet)
- JSON line-delimited files: [`daft.read_json("/path/*.json")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.read_json.html#daft.read_json)
- Files on disk: [`daft.from_glob_path("/path/*.jpeg")`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.from_glob_path.html#daft.from_glob_path)

<div class="admonition tip">
    <p class="admonition-title">Note</p>
    <p>
        See <a href=https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations.html>Integrations</a> to learn more about working with other formats like Delta Lake and Iceberg.
    </p>
</div>

Let’s read in a Parquet file from a public S3 bucket. Note that this Parquet file is partitioned on the column `country`. This will be important later on.

<div class="admonition failure">
    <p class="admonition-title">todo(docs): sql equivalent?</p>
</div>

In [19]:
# Set IO Configurations to use anonymous data access mode
daft.set_planning_config(default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True)))

df = daft.read_parquet("s3://daft-public-data/tutorials/10-min/sample-data-dog-owners-partitioned.pq/**")
df

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean


Why does it say `(No data to display: Dataframe not materialized)` and where are the rows?

## Execute Your DataFrame and View Data

Daft DataFrames are **lazy** by default. This means that the contents will not be computed (“materialized”) unless you explicitly tell Daft to do so. This is best practice for working with larger-than-memory datasets and parallel/distributed architectures.

The file we have just loaded only has 5 rows. You can materialize the whole DataFrame in memory easily using the [`df.collect()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.collect.html#daft.DataFrame.collect) method:

<div class="admonition failure">
    <p class="admonition-title">todo(docs): sql equivalent?</p>
</div>


In [20]:
df.collect()

                                                              

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Wolfgang,Winter,23,2001-02-12,Germany,
Shandra,Shamas,57,1967-01-02,United Kingdom,True
Zaya,Zaphora,40,1984-04-07,United Kingdom,True
Ernesto,Evergreen,34,1990-04-03,Canada,True
James,Jale,62,1962-03-24,Canada,True


To view just the first few rows, you can use the [`df.show()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.show.html#daft.DataFrame.show) method:

In [21]:
df.show(3)

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Wolfgang,Winter,23,2001-02-12,Germany,
Shandra,Shamas,57,1967-01-02,United Kingdom,True
Zaya,Zaphora,40,1984-04-07,United Kingdom,True


Now let's take a look at some common DataFrame operations.

## Selecting Columns

You can **select** specific columns from your DataFrame with the [`df.select()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.select.html#daft.DataFrame.select) method.

<div class="admonition failure">
    <p class="admonition-title">todo(docs): sql equivalent?</p>
</div>

In [22]:
df.select("first_name", "has_dog").show()

first_name Utf8,has_dog Boolean
Wolfgang,
Shandra,True
Zaya,True
Ernesto,True
James,True


## Selecting Rows

You can **filter** rows using the [`df.where()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.where.html#daft.DataFrame.where) method that takes an Logical Expression predicate input. In this case, we call the [`df.col()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.col.html#daft.col) method that refers to the column with the provided name `age`:

In [23]:
df.where(daft.col("age") >= 40).show()

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Shandra,Shamas,57,1967-01-02,United Kingdom,True
Zaya,Zaphora,40,1984-04-07,United Kingdom,True
James,Jale,62,1962-03-24,Canada,True


Filtering can give you powerful optimization when you are working with partitioned files or tables. Daft will use the predicate to read only the necessary partitions, skipping any data that is not relevant.

<div class="admonition tip">
    <p class="admonition-title">Note</p>
    <p>
        As mentioned earlier that our Parquet file is partitioned on the <code>country</code> column, this means that queries with a <code>country</code> predicate will benefit from query optimization.
    </p>
</div>

## Excluding Data

You can **limit** the number of rows in a DataFrame by calling the [`df.limit()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.limit.html#daft.DataFrame.limit) method:

In [24]:
df.limit(1).show()

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean
Wolfgang,Winter,23,2001-02-12,Germany,


To **drop** columns from the DataFrame, use the [`df.exclude()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.exclude.html#daft.DataFrame.exclude) method.

In [25]:
df.exclude("DoB").show()

first_name Utf8,last_name Utf8,age Int64,country Utf8,has_dog Boolean
Wolfgang,Winter,23,Germany,
Shandra,Shamas,57,United Kingdom,True
Zaya,Zaphora,40,United Kingdom,True
Ernesto,Evergreen,34,Canada,True
James,Jale,62,Canada,True


## Transforming Columns with Expressions

[Expressions](core_concepts/expressions.md) are an API for defining computation that needs to happen over columns. For example, use the [`daft.col()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.col.html#daft.col) expressions together with the [`with_column`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.with_column.html#daft.DataFrame.with_column) method to create a new column called `full_name`, joining the contents from the `last_name` column with the `first_name` column:

In [26]:
df = df.with_column("full_name", daft.col("first_name") + " " + daft.col("last_name"))
df.select("full_name", "age", "country", "has_dog").show()

full_name Utf8,age Int64,country Utf8,has_dog Boolean
Wolfgang Winter,23,Germany,
Shandra Shamas,57,United Kingdom,True
Zaya Zaphora,40,United Kingdom,True
Ernesto Evergreen,34,Canada,True
James Jale,62,Canada,True


Alternatively, you can also run your column transformation using Expressions directly inside your [`df.select()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.select.html#daft.DataFrame.select) method:

In [27]:
df.select((daft.col("first_name").alias("full_name") + " " + daft.col("last_name")), "age", "country", "has_dog").show()

full_name Utf8,age Int64,country Utf8,has_dog Boolean
Wolfgang Winter,23,Germany,
Shandra Shamas,57,United Kingdom,True
Zaya Zaphora,40,United Kingdom,True
Ernesto Evergreen,34,Canada,True
James Jale,62,Canada,True


## Sorting Data

You can **sort** a DataFrame with the [`df.sort()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.sort.html#daft.DataFrame.sort), in this example we chose to sort in ascending order:

In [31]:
df.sort(daft.col("age"), desc=False).show()

first_name Utf8,last_name Utf8,age Int64,DoB Date,country Utf8,has_dog Boolean,full_name Utf8
Wolfgang,Winter,23,2001-02-12,Germany,,Wolfgang Winter
Ernesto,Evergreen,34,1990-04-03,Canada,True,Ernesto Evergreen
Zaya,Zaphora,40,1984-04-07,United Kingdom,True,Zaya Zaphora
Shandra,Shamas,57,1967-01-02,United Kingdom,True,Shandra Shamas
James,Jale,62,1962-03-24,Canada,True,James Jale


## Grouping and Aggregating Data

You can **group** and **aggregate** your data using the [`df.groupby()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.groupby.html#daft.DataFrame.groupby) and the [`df.agg()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.agg.html#daft.DataFrame.agg) methods. A groupby aggregation operation over a dataset happens in 2 steps:

1. Split the data into groups based on some criteria using [`df.groupby()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.groupby.html#daft.DataFrame.groupby)
2. Specify how to aggregate the data for each group using [`df.agg()`](https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.agg.html#daft.DataFrame.agg)

In [29]:
grouped = df.groupby("country").agg(daft.col("age").mean().alias("avg_age"), daft.col("has_dog").count()).show()

country Utf8,avg_age Float64,has_dog UInt64
Canada,48.0,2
Germany,23.0,0
United Kingdom,48.5,2


<div class="admonition tip">
    <p class="admonition-title">Note</p>
    <p>
    The <a href="https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/expression_methods/daft.Expression.alias.html#daft.Expression.alias"><code>df.alias</code></a> method renames the given column.
    </p>
</div>