## Spaceship Quickstart
This guide will show how to use Spaceship to create a (Delta Lake) dataset, add some data, and query from it.

This first example will create and interact with datasets locally. For cloud object storage see [Using Cloud Object Storage](#using-cloud-object-storage).

Let's first create a client which provide access to most of the commands we want.

In [1]:
from spaceship import Client

client = Client()

We can use the client to list datasets both locally and cloud storage like S3 or Digital Ocean Spaces. But let's first try with local datasets.

In [3]:
client.list_datasets()  # not providing path / bucket name will look for dataset in the cwd

[]

There is no dataset in the cwd, so let's create one.

To create a dataset, we will need to provide a dataset schema.
Spaceship supports PyArrow's Schema object.

The schema will be enforeced every time new data is added to the dataset.

In [4]:
import pyarrow as pa

fields = [
    pa.field("id", pa.int64(), metadata={"description": "user id"}, nullable=False),
    pa.field("date", pa.date32(), metadata={"description": "register date"}, nullable=False),
    pa.field("name", pa.string(), metadata={"description": "user name"}, nullable=False),
    pa.field("age", pa.string(), metadata={"description": "user age"}, nullable=True),
]

schema = pa.schema(fields)
schema

id: int64 not null
  -- field metadata --
  description: 'user id'
date: date32[day] not null
  -- field metadata --
  description: 'register date'
name: string not null
  -- field metadata --
  description: 'user name'
age: string
  -- field metadata --
  description: 'user age'

Now we can use the schema to create a dataset.

In [None]:
client.create_dataset(
    "my_dataset",  # dataset name or path to data set for local mode
    schema=schema,
    description="My new dataset",
)

client.list_datasets()

['my_dataset']

### Adding data to the dataset.

Spaceship supports adding data from multiple sources such as csv or parquet file, Pandas DataFrame, Pyarrow Table and Dataset.

For now, let's pass a Pandas dataframe directly.

In [None]:
from datetime import date

import pandas as pd

df = pd.DataFrame({
    "id": [100, 101, 102],
    "date": [date(2025, 1, 1), date(2025, 1, 2), date(2025, 1, 3),],
    "name": ["Eric", "Julia", "Mark"],
    "age": [30, 40, None]
})

client.append(df, "my_dataset")  # The second argument is the dataset name to append

### Querying the data.

Spaceship leverages duckdb query engine under the hood. Therefore, we can easily get the data using SQL.

In [8]:
client.query("""

    SELECT * FROM lc.my_dataset

""").df()

Unnamed: 0,id,date,name,age,load_partition_date
0,100,2025-01-01,Eric,30.0,2025-01-31
1,101,2025-01-02,Julia,40.0,2025-01-31
2,102,2025-01-03,Mark,,2025-01-31


Notes:
- `lc.` prefix refers to local dataset. If your dataset is in cwd we call refer to it as `lc."path/to/your/dataset"`
- object returned from `client.query` is a duckdb object similar to when you call `duckdb.query`
- `load_partition_date` is added automatically as a partition column if partition column(s) is not provided when create a dataset.
  The value will is the date the data is added. This is to provide a default option to query data efficiently.

Let's try adding more data from a csv file `somefile.csv` that looks like this.

```csv
id,date,name,age
201,2025-01-04,Nolan,30
202,2025-01-05,Amy,
203,2025-01-06,Dan,25
```

In [10]:
client.append("./somefile.csv", "my_dataset")

client.query("""

    SELECT * FROM lc.my_dataset

""").df()

Unnamed: 0,id,date,name,age,load_partition_date
0,201,2025-01-04,Nolan,30.0,2025-01-31
1,202,2025-01-05,Amy,,2025-01-31
2,203,2025-01-06,Dan,25.0,2025-01-31
3,100,2025-01-01,Eric,30.0,2025-01-31
4,101,2025-01-02,Julia,40.0,2025-01-31
5,102,2025-01-03,Mark,,2025-01-31


### Using Cloud Object Storage

Usually we will want to store our data on the cloud distributed object storage for better connectivity to other services.

This can be done in pretty much the same way as the local example.

Let's first define a new client. I'll use DigitalOcean Spaces as an example.

In [5]:
from spaceship import Client

client = Client(
    access_key="<your-access-key>",  # This can be set with ACCESS_KEY env variable
    secret_key="<your-secret-key>",  # This can be set with SECRET_KEY env variable
    region="nyc3",
    endpoint="digitaloceanspaces.com",
)

We will need to create a new bucket first. Here I've created a bucket call `spaceshiptestbucket`.

Let's list datasets in the bucket.

In [6]:
client.list_datasets(bucket="spaceshiptestbucket")

[]

There is nothing there as this is a new bucket. Let's creaet a new dataset. We can start with defining a PyArrow schema.

In [7]:
import pyarrow as pa

fields = [
    pa.field("id", pa.int64(), metadata={"description": "product id"}, nullable=False),
    pa.field("product", pa.string(), metadata={"description": "product name"}, nullable=False),
    pa.field("price", pa.float64(), metadata={"description": "product price"}, nullable=False),
    pa.field("quantity", pa.int64(), metadata={"description": "product quantity"}, nullable=False),
    pa.field("company", pa.string(), metadata={"description": "product maker"}, nullable=False),
]

schema = pa.schema(fields)
schema

id: int64 not null
  -- field metadata --
  description: 'product id'
product: string not null
  -- field metadata --
  description: 'product name'
price: double not null
  -- field metadata --
  description: 'product price'
quantity: int64 not null
  -- field metadata --
  description: 'product quantity'
company: string not null
  -- field metadata --
  description: 'product maker'

Now we can create a dataset with bucket name provided.

In [9]:
client.create_dataset(
    "product_dataset",  # dataset name only for object storage
    schema=schema,
    description="Product dataset",
    bucket="spaceshiptestbucket",
    partition_columns=["company"],  # Here I will define company as a partition column
    constraints={                   # We can define constraints to be enforced as well
        "price_non_negative": "price >= 0",
        "quantity_non_negative": "quantity >= 0",
    }
)

client.list_datasets(bucket="spaceshiptestbucket")

['product_dataset']

Now that the dataset is created, we can add the data the same way we did with local dataset. 

Let create a pandas df to add data there, but this time I will store it as a parquet as provide a parquet file instead.

In [10]:
import pandas as pd

pd.DataFrame(
    {
        "id": [1, 2, 3, 4],
        "product": ['chair', "table", "laptop", "lamp"],
        "price": [100.50, 220.26, 549.0, 59.99],
        "quantity": [13, 5, 10, 32],
        "company": ["AA", "AA", "BB", "CC"]
    }
).to_parquet("product.parquet")

Then we can provide the file path to add data to the dataset.

In [11]:
client.append(
    "./product.parquet", 
    "product_dataset", 
    bucket="spaceshiptestbucket"
)

And we can query the data to check as usual.

In [16]:
client.query("""

    SELECT * FROM do.spaceshiptestbucket.product_dataset  /* do means Digital Ocean. */

""").df()

Unnamed: 0,id,product,price,quantity,company
0,3,laptop,549.0,10,BB
1,4,lamp,59.99,32,CC
2,1,chair,100.5,13,AA
3,2,table,220.26,5,AA


Notes:
- If bucket name or dataset name contains `-` or any invalid SQL character. We can use `"` to wrap around the name such as `do."some_invalid_n@ame".my_dataset`
- Spaceship comes with ability to push a file larger than memory to a dataset by partioning the data automatically.