In [None]:
%pip install daft

In [None]:
CI = False

In [None]:
# Skip this notebook execution in CI because it hits data in a relative path
if CI:
    import sys

    sys.exit()

In [None]:
!uv pip install daft
!uv pip install validators matplotlib Pillow torch torchvision


import daft

# Read parquet file containing sample dog owners
df = daft.read_parquet("s3://daft-public-data/tutorials/10-min/sample-data-dog-owners-partitioned.pq/**")

# Combine "first_name" and "last_name" to create new column "full_name"
df = df.with_column("full_name", daft.col("first_name") + " " + daft.col("last_name"))
df.select("full_name", "age", "country", "has_dog").show()

# Create dataframe of dogs
df_dogs = daft.from_pydict(
    {
        "urls": [
            "https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
            "https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
            "https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
            "https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
            "https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
        ],
        "full_name": [
            "Ernesto Evergreen",
            "James Jale",
            "Wolfgang Winter",
            "Shandra Shamas",
            "Zaya Zaphora",
        ],
        "dog_name": ["Ernie", "Jackie", "Wolfie", "Shaggie", "Zadie"],
    }
)

# Join owners with dogs
df_family = df.join(df_dogs, on="full_name").exclude("first_name", "last_name", "DoB", "country", "age")

df_family = df_family.with_column("image_bytes", df_dogs["urls"].url.download(on_error="null"))

df_family_decpde = df_family.with_column("image", df_family["image_bytes"].image.decode())


df_family.show()

# #00 - Data Access

This Feature-of-the-Week tutorial shows the canonical way of accessing data with Daft. 

Daft reads from 3 main data sources:
1. Files (local and remote)
2. SQL Databases
3. Data Catalogs

Let's dive into each type of data access in more detail 🪂

In [1]:
import daft

## Files

You can read many different file types with Daft.

The most common file formats are:
- CSV
- JSON
- Parquet

You can read these file types from local and remote filesystems.

### CSV

Use the `read_csv` method to read CSV files from your local filesystem.

Here, we'll read in some synthetic US Census data:

In [2]:
# Read a single CSV file from your local filesystem
df = daft.read_csv("data/census001.csv")

In [None]:
df.show()

In [None]:
df.count_rows()

You can also read folders of CSV files or include wildcards to select for patterns of file paths. 

These files will have to follow the same schema.

Here, we'll read in multiple CSV files containing synthetic US Census data:

In [5]:
# Read multiple CSV files into one DataFrame
df = daft.read_csv("data/census*.csv")

In [None]:
df.show()

In [None]:
df.count_rows()

### JSON

You can read line-delimited JSON using the `read_json` method:

In [8]:
# Read a JSON file from your local filesystem
df = daft.read_json("data/sampled-tpch.jsonl")

In [None]:
df.show()

### Parquet

Use the `read_parquet` method to read Parquet files.

In [10]:
# Read a Parquet file from your local filesystem
df = daft.read_parquet("data/sample_taxi.parquet")

In [None]:
df.show()

### Remote Reads, e.g. S3

You can read files from remote filesystems such as AWS S3:

```
# Read multiple Parquet files from s3
df = daft.read_parquet("s3://mybucket/path/to/*.parquet")
```

These reads can be specified with their corresponding protocols.

#### Reading from Public Buckets

You can read from public buckets using an "anonymous" IO Config.

An anonymous IOConfig will access storage **without credentials**, and can only access fully public data.

In [12]:
# create anonymous config
MY_ANONYMOUS_IO_CONFIG = daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))

# Read this file using `MY_ANONYMOUS_IO_CONFIG`
df = daft.read_csv("s3://daft-public-data/melbourne-airbnb/melbourne_airbnb.csv", io_config=MY_ANONYMOUS_IO_CONFIG)

In [None]:
df.select("id", "text", "price", "review_scores_rating").show()

#### Remote IO Configuration
Use the `IOConfig` module to configure access to remote data.

Daft will automatically detect S3 credentials from your local environment. If your current session is authenticated with access credentials to your private bucket, then you can access the bucket  without explicitly passing credentials.

Substitute the path below with a path to a private bucket you have access to with your credentials.

In [14]:
bucket = "s3://avriiil/yellow_tripdata_2023-12.parquet"

Now configure your IOConfig object:

In [15]:
from daft.io import IOConfig, S3Config

io_config = IOConfig(
    s3=S3Config(
        region_name="eu-north-1",
    )
)

In [16]:
df = daft.read_parquet(bucket, io_config=io_config)

In [None]:
df.show()

The IOConfig object has many more configuration options. You can use this object to configure access to:
* [AWS S3](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_configs/daft.io.S3Config.html)
* [GCP](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_configs/daft.io.GCSConfig.html)
* [Azure](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_configs/daft.io.AzureConfig.html)

All the cloud-specific Config options follow the standard protocols of the respective cloud platforms. See the documentation links above for more information.

There is a dedicated section for the `IOConfig` object at the end of this tutorial.

## SQL Databases
You can use Daft to read the results of SQL queries from databases, data warehouses, and query engines, into a Daft DataFrame via the `daft.read_sql()` function.

```python
# Read from a PostgreSQL database
uri = "postgresql://user:password@host:port/database"
df = daft.read_sql("SELECT * FROM my_table", uri)
```
In order to partition the data, you can specify a partition column:

```python
# Read with a partition column
df = daft.read_sql("SELECT * FROM my_table", partition_col="date", uri)
```

Partitioning your data will allow Daft to read the data in parallel. This will make your queries faster.

### ConnectorX vs SQLAlchemy

Daft uses [ConnectorX](https://sfu-db.github.io/connector-x/databases.html) under the hood to read SQL data. ConnectorX is a fast, Rust based SQL connector that reads directly into Arrow Tables, enabling zero-copy transfer into Daft dataframes. If the database is [not supported](https://sfu-db.github.io/connector-x/intro.html#supported-sources-destinations) by ConnectorX, Daft will fall back to using [SQLAlchemy](https://docs.sqlalchemy.org/en/20/orm/quickstart.html).

You can also directly provide a SQL alchemy connection via a connection factory. This way, you have the flexibility to provide additional parameters to the engine.

### Example

Let's look at an example with a simple local database.

You will need some extra dependencies installed. The easiest way to do so is using the `[sql]` extras:

In [None]:
#!pip install "getdaft[sql]"

### create local SQL database from CSV file

Let's start by creating a local SQLite database from a CSV file using ConnectorX:

In [19]:
import sqlite3

connection = sqlite3.connect("example.db")
connection.execute("CREATE TABLE IF NOT EXISTS books (title TEXT, author TEXT, year INTEGER)")
connection.execute(
    """
INSERT INTO books (title, author, year)
VALUES
    ('The Great Gatsby', 'F. Scott Fitzgerald', 1925),
    ('To Kill a Mockingbird', 'Harper Lee', 1960),
    ('1984', 'George Orwell', 1949),
    ('The Catcher in the Rye', 'J.D. Salinger', 1951)
"""
)
connection.commit()
connection.close()

In [None]:
# Read SQL query into Daft DataFrame
df = daft.read_sql(
    "SELECT * FROM books",
    "sqlite://example.db",
)

df.show()

You can also directly provide a SQL alchemy connection via a connection factory. This way, you have the flexibility to provide additional parameters to the engine.

Let's use a SQL alchemy connection to read a CSV file into a local SQLite database and then query it with Daft:

In [21]:
from sqlalchemy import create_engine

# substitue the uri below with the engine path on your local machine
engine_uri = "sqlite:////Users/rpelgrim/daft_sql"
engine = create_engine(engine_uri, echo=True)

In [None]:
import pandas as pd

csv_file_path = "data/census-01.csv"
df = pd.read_csv(csv_file_path)

sql_df = df.to_sql(name="censustable", con=engine, index=False, index_label="id", if_exists="replace")

### Access SQL Database with Daft

Great, now let's see how we can access this data with Daft.

In [23]:
# Read from local SQLite database
uri = "sqlite:////Users/rpelgrim/daft_sql"  # replace with your local uri

df = daft.read_sql("SELECT * FROM censustable", uri)

In [None]:
df.show()

### Parallel and Distributed Reads
Supply a partition column and optionally the number of partitions to enable parallel reads:

In [None]:
df = daft.read_sql(
    "SELECT * FROM censustable",
    uri,
    partition_col="education",
    #    num_partitions=12
)

df.show()

### Data Skipping Optimizations
Filter, projection, and limit pushdown optimizations can be used to reduce the amount of data read from the database.

In the example below, Daft reads the top ranked terms from the BigQuery Google Trends dataset. The where and select expressions in this example will be pushed down into the SQL query itself, we can see this by calling the `df.explain()` method:

```python
import daft, sqlalchemy, datetime

def create_conn():
    engine = sqlalchemy.create_engine(
        "bigquery://", credentials_path="path/to/service_account_credentials.json"
    )
    return engine.connect()


df = daft.read_sql("SELECT * FROM `bigquery-public-data.google_trends.top_terms`", create_conn)

df = df.where((df["refresh_date"] >= datetime.date(2024, 4, 1)) & (df["refresh_date"] < datetime.date(2024, 4, 8)))
df = df.where(df["rank"] == 1)
df = df.select(df["refresh_date"].alias("Day"), df["term"].alias("Top Search Term"), df["rank"])
df = df.distinct()
df = df.sort(df["Day"], desc=True)

df.explain(show_all=True)

# Output
# ..
# == Physical Plan ==
# ..
# |   SQL Query = SELECT refresh_date, term, rank FROM
#  (SELECT * FROM `bigquery-public-data.google_trends.top_terms`)
#  AS subquery WHERE rank = 1 AND refresh_date >= CAST('2024-04-01' AS DATE)
#  AND refresh_date < CAST('2024-04-08' AS DATE)
```

You could code the SQL query to add the filters and projections yourself, but this may become lengthy and error-prone, particularly with many expressions. Daft automatically handles these performance optimizations for you.

## Data Catalogs

Daft is built for efficient data access from Data Catalogs using open table formats like Delta Lake, Iceberg and Hudi.

### Delta Lake

You can easily read Delta Lake tables using the `read_deltalake()` method.

In [None]:
df = daft.read_deltalake("data/delta_table")
df.show()

To access Delta tables on S3 you will have to pass some more config options:
- the AWS Region name
- your access credentials

In [6]:
import boto3

session = boto3.session.Session()
creds = session.get_credentials()

# set io configs
io_config = daft.io.IOConfig(
    s3=daft.io.S3Config(
        region_name="eu-north-1",
        key_id=creds.access_key,
        access_key=creds.secret_key,
    )
)

# Read Delta Lake table in S3 into a Daft DataFrame.
table_uri = "s3://avriiil/delta-test-daft/"

df = daft.read_deltalake(table_uri, io_config=io_config)

In [None]:
df.show()

### Iceberg
Daft is integrated with [PyIceberg](https://py.iceberg.apache.org/), the official Python implementation for Apache Iceberg.

This means you can easily read Iceberg tables into Daft DataFrames in 2 steps:
1. Load your Iceberg table from your Iceberg catalog using PyIceberg
2. Read your Iceberg table into Daft

We'll use a simple local SQLite Catalog implementation for this toy example.

In [8]:
# initialize your catalog

from pyiceberg.catalog.sql import SqlCatalog

warehouse_path = "data/iceberg-warehouse/"
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

In [9]:
# load your table
table = catalog.load_table("default.taxi_dataset")

In [None]:
# Read into Daft
df = daft.read_iceberg(table)
df.show()

Any subsequent filter operations on the Daft `df` DataFrame object will be correctly optimized to take advantage of Iceberg features such as hidden partitioning and file-level statistics for efficient reads.

### Hudi
To read from an Apache Hudi table, use the `daft.read_hudi()` function. 

The following is an example snippet of loading an example table

In [None]:
# Read Apache Hudi table into a Daft DataFrame.
import daft

df = daft.read_hudi("data/hudi-data")
df.show()

Currently there are limitations of reading Hudi tables:
- Only support snapshot read of Copy-on-Write tables
- Only support reading table version 5 & 6 (tables created using release 0.12.x - 0.15.x)
- Table must not have hoodie.datasource.write.drop.partition.columns=true

## IOConfig Deep Dive

Let's dive a little deeper into the IOConfig options for tweaking your remote data access.

`IOConfig` is Daft's mechanism for controlling the behavior of data input/output from storage. It is useful for:

1. **Providing credentials** for authenticating with cloud storage services
2. **Tuning performance** or reducing load on storage services

### Default IOConfig Behavior

The default behavior for IOConfig is to automatically detect credentials on your machines.

In [None]:
import daft

# By default, calls to AWS S3 will use credentials retrieved from the machine(s) that they are called from
#
# For AWS S3 services, the default mechanism is to look through a chain of possible "providers":
# https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#configuring-credentials
df = daft.read_csv("s3://daft-public-data/file.csv")
df.collect()

### Overriding the IOConfig
#### Setting a Global Override

Often you may want Daft to just use a certain configuration by default whenever it has to access storage such as S3, GCS or Azure Blob Store.

> **Example:**
>
> An extremely common use-case is to create a set of temporary credentials once, and share that across all calls to data access happening in Daft.
>
> The example below demonstrates this with AWS S3's `boto3` Python SDK.

In [None]:
# Use the boto3 library to generate temporary credentials which can be used for S3 access
import boto3

session = boto3.session.Session()
creds = session.get_credentials()

# Attach temporary credentials to a Daft IOConfig object
MY_IO_CONFIG = daft.io.IOConfig(
    s3=daft.io.S3Config(
        key_id=creds.access_key,
        access_key=creds.secret_key,
        session_token=creds.token,
    )
)

# Set the default config to `MY_IO_CONFIG` so that it is used in the absence of any overrides
daft.set_planning_config(default_io_config=MY_IO_CONFIG)

#### Overriding IOConfigs per-API call

Daft also allows for more granular per-call overrides through the use of keyword arguments.

This is extremely flexible, allowing you to use a different set of credentials to read from two different locations!

Here we use `daft.read_csv` as an example, but the same `io_config=...` keyword arg also exists for other I/O related functionality such as:

1. `daft.read_parquet`
2. `daft.read_json`
3. `Expression.url.download()`

In [None]:
# An "Anonymous" IOConfig will access storage **without credentials**, and can only access fully public data
MY_ANONYMOUS_IO_CONFIG = daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))

# Read this file using `MY_ANONYMOUS_IO_CONFIG` instead of the overridden global config `MY_IO_CONFIG`
df1 = daft.read_csv("s3://daft-public-data/melbourne-airbnb/melbourne_airbnb.csv", io_config=MY_ANONYMOUS_IO_CONFIG)

For more see: [IOConfig Documentation](https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_configs/daft.io.IOConfig.html?highlight=IOConfig)


## Data Access with Daft
In this tutorial you have seen the canonical ways of accessing data with Daft.

We've seen how to access:
* local files, incl. JSON, CSV and Parquet
* data in SQL databases
* data in Data Catalogs, incl. Delta Lake, Iceberg and Hudi

Take a look at our hands-on [Use Case tutorials](https://www.getdaft.io/projects/docs/en/latest/user_guide/tutorials.html) if you feel ready to start building workflows with Daft.