# Introduction to TileDB Arrays

* This notebook is part 1 of the FOSS4G workshop "Universal data management for all geospatial data in TileDB"
* Find the notebook on [GitHub]() or [TileDB Cloud]()

## Outline

This notebook contains the following examples:

* [Simple table](#table)
    * [Load a table](#table1)
    * [Convert the table to a TileDB array](#table2)
    * [Explore the TileDB array](#table3)
* [Dense array](#dense)
* [Sparse array](#sparse)
* [Raster map](#raster)
* [Timeseries](#time)
* [TileDB Cloud](#cloud)  

In [1]:
import numpy as np
import pandas as pd

import tiledb

<a id="table"></a>
## A simple table

<a id="table1"></a>
### Load a table 

> The [original dataset](https://simplemaps.com/data/world-cities) is cleaned up in [this notebook]()

In [2]:
capitals = pd.read_csv("./data/capitals.csv")
capitals.head()

Unnamed: 0.1,Unnamed: 0,city,lat,lon,country,iso3,population
0,0,Tokyo,35.6897,139.6922,Japan,JPN,37977000.0
1,1,Jakarta,-6.2146,106.8451,Indonesia,IDN,34540000.0
2,4,Manila,14.5958,120.9772,Philippines,PHL,23088000.0
3,7,Seoul,37.5833,127.0,"Korea, South",KOR,21794000.0
4,8,Mexico City,19.4333,-99.1333,Mexico,MEX,20996000.0


<a id="table2"></a>
### Convert the table to a TileDB array

With [pandas](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#tiledb.from_pandas):

In [3]:
uri = "arrays/capitals1"

tiledb.from_pandas(uri, capitals)

Or directly [from the csv file](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#tiledb.from_csv):

In [4]:
uri = "arrays/capitals2"

tiledb.from_csv(uri, "data/capitals.csv")

That is all! You have created your first TileDB array! 

<a id="table3"></a>
### Explore the TileDB array

But, what do these arrays now look like? And how can you work with the data in them?

> Find more info about the TileDB format specification [here](https://github.com/TileDB-Inc/TileDB/blob/dev/format_spec/FORMAT_SPEC.md) or in the [docs](https://docs.tiledb.com/main/basic-concepts/data-format)

An array is stored in a directory. For `capitals1` this looks like: 

In [5]:
%ls arrays/capitals1

[34m__1627395398865_1627395398865_15dc744a67264743a498dfa3ff4c74ba_8[m[m/
__1627395398865_1627395398865_15dc744a67264743a498dfa3ff4c74ba_8.ok
__array_schema.tdb
__lock.tdb
[34m__meta[m[m/


Also have a look at the first folder **by updating the path in the below cell to the path on your system**:

In [6]:
%ls arrays/capitals1/__1626970989823_1626970989823_b9cf74a787674969adb4664bbc1034ef_8/

ls: arrays/capitals1/__1626970989823_1626970989823_b9cf74a787674969adb4664bbc1034ef_8/: No such file or directory


In the above you will recognise the column names from the table. **Etc......** 

An array is defined by it's schema. **Etc....**

Load the schema:

In [7]:
uri = "arrays/capitals1"
A = tiledb.open(uri)
print(A.schema)

ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 207), tile='207', dtype='uint64'),
  ]),
  attrs=[
    Attr(name='Unnamed: 0', dtype='int64', var=False, nullable=False),
    Attr(name='city', dtype='<U0', var=True, nullable=False),
    Attr(name='lat', dtype='float64', var=False, nullable=False),
    Attr(name='lon', dtype='float64', var=False, nullable=False),
    Attr(name='country', dtype='<U0', var=True, nullable=False),
    Attr(name='iso3', dtype='<U0', var=True, nullable=False),
    Attr(name='population', dtype='float64', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=False,
  coords_filters=FilterList([ZstdFilter(level=-1)]),
)



Load all data:

Load a slice of the data:

Load filtered data, etc...

You might have noticed that the above example is a sparse array. This is one of the two types of arrays:

* dense arrays
* sparse arrays 

Let's first explore dense arrays, before coming back to sparse arrays in the last part of this notebook.

<a id="dense"></a>
## Dense arrays

A dense array contains values for every cell. Let's start by creating a 4 by 4 array with some data:

In [8]:
data = np.array(([1, 2, 3, 4],
                     [5, 6, 7, 8],
                     [9, 10, 11, 12],
                     [13, 14, 15, 16]))

To write this data to a dense array, first create the [Schema](https://tiledb-inc-tiledb-py.readthedocs-hosted.com/en/stable/python-api.html#array-schema) with the [Domain](https://tiledb-inc-tiledb-py.readthedocs-hosted.com/en/stable/python-api.html#domain), containing the [Dimensions](https://tiledb-inc-tiledb-py.readthedocs-hosted.com/en/stable/python-api.html#dimension), and [Attributes](https://tiledb-inc-tiledb-py.readthedocs-hosted.com/en/stable/python-api.html#tiledb.Attr) that define the shape, size and data type of the dense array:

In [9]:
rows = tiledb.Dim(name="rows", domain=(1, 4), tile=4, dtype=np.int32)
cols = tiledb.Dim(name="cols", domain=(1, 4), tile=4, dtype=np.int32)

dom = tiledb.Domain(rows,cols)
attr = tiledb.Attr(name="a", dtype=np.int32)

schema = tiledb.ArraySchema(domain=dom, attrs=[attr])
print(schema)

ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(1, 4), tile='4', dtype='int32'),
    Dim(name='cols', domain=(1, 4), tile='4', dtype='int32'),
  ]),
  attrs=[
    Attr(name='a', dtype='int32', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=False,
  coords_filters=FilterList([ZstdFilter(level=-1)]),
)



Now, create the (empty) array on disk:

In [10]:
array_dense = "arrays/dense"
tiledb.Array.create(array_dense, schema)

And then [open](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html?highlight=uri#tiledb.open) and write the data to the array:

In [11]:
with tiledb.open(array_dense, mode="w") as A:
    A[:] = data

Now you can read all data:

In [12]:
with tiledb.open(array_dense, mode="r") as A:
    data = A[:, :]
    print(data)

OrderedDict([('a', array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]], dtype=int32))])


Or read a subset or slice. For instance, only read rows 1 and 2 and columns 2, 3 and 4, and print the values of `a`:

In [13]:
with tiledb.open(array_dense, mode="r") as A:
    data = A[1:3, 2:5]
    print(data["a"])

[[2 3 4]
 [6 7 8]]


<a id="sparse"></a>
## Sparse arrays

TileDB sparse array does not require a value for every cell. Before writing any data, first define the schema of a sparse array. The only difference compared to the dense array is that you now will add `sparse=True` (the default is `False`):

In [14]:
rows = tiledb.Dim(name="rows", domain=(1, 4), tile=4, dtype=np.int32)
cols = tiledb.Dim(name="cols", domain=(1, 4), tile=4, dtype=np.int32)

dom = tiledb.Domain(rows,cols)
attr = tiledb.Attr(name="a", dtype=np.int32)

schema = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[attr])
print(schema)

ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(1, 4), tile='4', dtype='int32'),
    Dim(name='cols', domain=(1, 4), tile='4', dtype='int32'),
  ]),
  attrs=[
    Attr(name='a', dtype='int32', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=False,
  coords_filters=FilterList([ZstdFilter(level=-1)]),
)



The next step is to create the (empty) array on disk, and then open and write data to the sparse array. Let's write the 3 values from `data` to 3 cells in the array with the coordinates in `[I,J]`:

In [15]:
array_sparse = "arrays/sparse"

tiledb.SparseArray.create(array_sparse, schema)

with tiledb.open(array_sparse, mode="w") as A:
    I, J = [1, 2, 2], [1, 4, 3]
    data = np.array(([1, 2, 3]))
    A[I, J] = data 

That is it, you have now also created a TileDB sparse array! 

Read all data from a sparse array in the exact same way as reading it from a dense array:

In [16]:
with tiledb.open(array_sparse, mode="r") as A:
    data = A[:,:]

print(data)

OrderedDict([('a', array([1, 3, 2], dtype=int32)), ('rows', array([1, 2, 2], dtype=int32)), ('cols', array([1, 3, 4], dtype=int32))])


Notice that this looks different than for the dense array, where `data` only contained the values for `a` and the coordinates of the cells are defined in the schema. For the sparse array, `data` also contains the values for the dimensions `rows` and `columns`. Iterate over the data to print the values for each cell:

In [17]:
for i, coord in enumerate(zip(data["rows"], data["cols"])):
    print("Cell (%d, %d) has data %d" % (coord[0], coord[1], data["a"][i]))

Cell (1, 1) has data 1
Cell (2, 3) has data 3
Cell (2, 4) has data 2


Finally, slice a subset of the sparse array in the exact same way as for the dense array. This will now return only one cell with one value as all other cells in this slice are empty:

In [18]:
with tiledb.open(array_sparse, mode="r") as A:
    data = A[1:3, 1:3]

print(data["a"])

[1]
