# Scalable and versatile databasing for ML

Here we introduce Duckdb and Parquet, two tools that can be used to store and query data in a scalable and versatile way. We will show how to use them in Python and how to integrate them with other tools like Pandas.

## Iris dataset in Pandas: recap

In [None]:
import pandas as pd
from sklearn import datasets

In [None]:
def sklearnds2df(ds):
    df = pd.DataFrame(data=ds.data, columns=ds.feature_names)
    df['target'] = pd.Series(pd.Categorical.from_codes(ds.target,
                                                       categories=ds.target_names))
    return df

In [None]:
iris_ds = datasets.load_iris()
iris = sklearnds2df(iris_ds)
iris.columns = iris.columns.str.removesuffix(" (cm)").str.replace(" ","_")
iris

In [None]:
iris_sepals = iris[['sepal_length','sepal_width','target']]
iris_petals = iris[['petal_length','petal_width','target']]

In [None]:
pd.merge(iris_sepals, iris_petals, how='inner', on='target')

In [None]:
pd.merge(iris_sepals, iris_petals, left_index=True, right_index=True, how='inner')

In [None]:
def create_col_subset(df, prefix):
    dfs = df.filter(regex=f"^{prefix}")
    dfs = pd.concat([dfs, df.select_dtypes(include='category')], axis=1)
    dfs.index = [f"P{i:03d}" for i in iris.index.values]
    dfs.columns = dfs.columns.str.removeprefix(prefix)
    return dfs.reset_index(names='ID')
iris_sepals = create_col_subset(iris, 'sepal_')
iris_petals = create_col_subset(iris, 'petal_')
iris_sepals

In [None]:
pd.merge(iris_sepals, iris_petals, on='ID', how='inner', suffixes=('_sepal','_petal'))

The point isn't that this isn't possible with Pandas. The point is that we are using the wrong tool for the job. Pandas is great for data manipulation, but it's not a database. It's not designed to store and query data efficiently. It's designed to manipulate data efficiently. _(Adapted from a Copilot autocompletion, which says something about how common this problem is.)_

# The database way: Duckdb

In [None]:
import duckdb

In [None]:
db = duckdb.from_df(iris)
db

In [None]:
(type(db), db.shape, db.columns)

In [None]:
duckdb.sql("select * from db limit 5")

In [None]:
duckdb.sql("select min(sepal_width), max(sepal_width) from db")

In [None]:
duckdb.sql("select target, avg(sepal_length), avg(sepal_width) from db group by target")

### Duckdb can access dataframes directly

In [None]:
duckdb.sql("select target, avg(sepal_length), avg(sepal_width) from iris group by target")

In [None]:
duckdb.sql("select s.ID as sID, p.ID as pID, s.length as sepal_length, s.width as sepal_width, "
           "p.length as petal_length, p.width as petal_width, p.target as species "
           "from iris_sepals as s positional join iris_petals as p ")

### Duckdb can read online datasets

#### [HuggingFace](https://hf.co) (HF)

- One of if not the most widely used ML commons for models and datasets
- Datasets are easy to use and have a lot of datasets.

Duckdb can read datasets directly from HF. (In fact, Pandas can, too.) We can start with the [Iris dataset from scikit-learn](https://hf.co/datasets/scikit-learn/iris/) on HF.

#### fsspec and HF

In [None]:
import fsspec
fsspec.filesystem('hf').ls('hf://datasets/scikit-learn/iris@~parquet')

In [None]:
df

In [None]:
duckdb.read_csv("hf://datasets/scikit-learn/iris/Iris.csv")

### Parquet format



In [None]:
pdb = duckdb.read_parquet("hf://datasets/scikit-learn/iris@~parquet/default/train/0000.parquet")
pdb

In [None]:
rel

In [None]:
duckdb.sql("select * from rel where species = 'Iris-setosa'")