# II. Use SpatialData with your data: SpatialElements and tables

In the previous notebook we saw how to load and save `SpatialData` objects from Zarr, how to construct them from scratch (assuming that the `SpatialElement` and tables are available) and maniupulate them. 

We will show how to create the `SpatialElement` and tables from scratch. We will see this for images, labels, points, shapes (circles, polygons/multipolygons) and tables.

This notebook is technically detailed; if you first prefer a more practical introduction we suggest to skip the text and just read the commented code.

## Data models: validation and parsing

In order to represent `SpatialElement`s and annotation tables, we decided not to introduce new Python classes. Instead, we opted for representing the data using existing classes that are already widely developed and used. Still, we sometimes needed extra structure for those objects. 

To accomplish this, for each type of element (e.g. images) that we support, we provide two functions: a validation function and a parser.

- The validation function determines if the element adheres to the extra structure that we need. Practically, when a `SpatialData` object is constructed, or when a `SpatialElement` or a annotation table is added to an existing `SpatialData` object, the object will be *validated*. If the validation is not met, the user will receive an error message with instructions on how to fix this.
- The parser function takes input data in various formats, converts them to the standard representation for the element and returns an object that is always guaranteed to be valid.

Before diving into the details, let's show a simple example of how to parse and validate a 2D image.

In [29]:
from scipy.datasets import face

# raw data
image = face()
print(type(image))
image.shape

<class 'numpy.ndarray'>


(768, 1024, 3)

In [144]:
import pytest
from spatialdata.models import Image2DModel

# numpy arrays do not pass validation
with pytest.raises(ValueError, match="Unsupported data type: <class 'numpy.ndarray'>."):
    Image2DModel().validate(image)

# let's parse the data
parsed_image = Image2DModel.parse(image, dims=("y", "x", "c"))

# now passes validation (=can be placed inside a SpatialData object)
Image2DModel().validate(parsed_image)

# parsed_image is a regular spatial_image.SpatialImage object (discussed later), and as such it can be used outside
# the SpatialData library (e.g. with the libraries dask-image, xarray-spatial, ...)
parsed_image

[34mINFO    [0m Transposing `data` of type: [1m<[0m[1;95mclass[0m[39m [0m[32m'dask.array.core.Array'[0m[1m>[0m to [1m([0m[32m'c'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m.                           


Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 2.25 MiB 2.25 MiB Shape (3, 768, 1024) (3, 768, 1024) Dask graph 1 chunks in 2 graph layers Data type uint8 numpy.ndarray",1024  768  3,

Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


Before delving into the details of each specific element type, let's first see some common characteristic of data models.

- **The correct syntax for single-dispatch**. Most of the parsers and some of the validators are implemented using the single-dispatch Python construct, to automatically calls an appropriate secondary function based on the type of the first argument. As such, the first argument must be unnamed. In other words, calling `Image2DModel.parse(data=image)` is incorrect, while `Image2DModel.parse(image)` is the correct syntax.
- **Passing coordinate transformations**. All the parsers, except for the one for tables, accept the argument `transformations`, to specify a dict of coordinate transformations for the element.

In [154]:
from spatialdata.transformations import Scale, get_transformation, remove_transformation

print(type(image))
parsed_image = Image2DModel.parse(image, dims=("y", "x", "c"))

# the default transformation is an identity
get_transformation(parsed_image)

# let's add custom transformations, first let's remove it
remove_transformation(parsed_image)

# to add new ones let's re-parse parse_image; note that the type of parsed_image is now a SpatialImage, not a numpy.array as before.
# This is not a problem as the single-dispath construct takes care of handling the different data types.
# No need to pass dims now, as the dims are already specified in the parsed_image object
parsed_image = Image2DModel.parse(
    parsed_image,
    transformations={"scale_space": Scale([2.0], axes=("x",)), "another_space": Scale([2.0, 3.0], axes=("y", "x"))},
)
print(type(parsed_image))

# let's check that the axes are correct
from spatialdata.models import get_axes_names

get_axes_names(parsed_image)

<class 'numpy.ndarray'>
[34mINFO    [0m Transposing `data` of type: [1m<[0m[1;95mclass[0m[39m [0m[32m'dask.array.core.Array'[0m[1m>[0m to [1m([0m[32m'c'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m.                           
<class 'spatial_image.SpatialImage'>


('c', 'y', 'x')

## Images

We support 2D and 3D images. The axes (in this order) for 2D images are `('c', 'y', 'x')`. The axes for 3D images are `('c', 'z', 'y', 'x')`. 
We support both single-scale images (aka regular images) and multi-scale images (i.e. the same image downscaled multiple times).

We use the classes `spatial_image.SpatialImage` and `multiscale_spatial_image.MultiscaleSpatialImage` respectively for single-scale and multi-scale images.

The `SpatialImage` class is a subclass of the popular `xarray.DataArray` object. The `MultiscaleSpatialImage` class is a subclass of the `datatree.DataTree`.

> 📝 **Technical note**
> 
> Future versions of `spatialdata` will see a simplification of how images are stored. In details:
> - starting from `spatial_image==1.0.0`, the `spatial_image` library uses `xarray.DataArray` directly instead of a subclass;
> - similarlry, from `multiscale_spatial_image==1.0.0` the `multiscale_spatial_image` library uses the `datatree.DataTree` instead of a subclass;
> - the `datatree.DataTree` class is currently being moved to the `xarray` package, in the future it will be accessible as `xarray.DataTree`, thus without the need of the `datatree` package.

Since the above objects are `xarray` objects, they inherit the benefits of `xarray`. In particular:
- they contain coordinates;
- they can represent the data in chunks;
- they support storing lazy-loaded data with Dask.

> 📝 **Technical note**
> 
> Currently, the `xarray` coordinates are not transformed by coordinate transformations. This is being tracked here https://github.com/scverse/spatialdata/issues/308 and will be addressed in a future release.

The examples below cover the main use cases; please refer to the documentation for all the details.

In [35]:
# when the input data has not dimensions equal to ('c', 'y', 'x'), we can use the dims argument
# of .parse() to transpose the data

# image is ('y', 'x', 'c')
Image2DModel.parse(image, dims=("y", "x", "c"))

[34mINFO    [0m Transposing `data` of type: [1m<[0m[1;95mclass[0m[39m [0m[32m'dask.array.core.Array'[0m[1m>[0m to [1m([0m[32m'c'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m.                           


Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 2.25 MiB 2.25 MiB Shape (3, 768, 1024) (3, 768, 1024) Dask graph 1 chunks in 2 graph layers Data type uint8 numpy.ndarray",1024  768  3,

Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


As we quickly saw before, the parser accepts different data types. Currently supported types are `numpy.array`s, `xarray.DataArray` and `dask.array.core.Array`.

In [160]:
# types supported by the single-dispatch construct of the parser

# numpy array (we saw in examples before)
print(type(image))

# xarray data arrays (we saw it in an example before). Remember that spatial_image.SpatialImage is a subclass of xarray.DataArray
print(type(parsed_image))

# dask ("lazy") arrays
from dask.array import from_array

lazy_array = from_array(image)
print(type(lazy_array))
Image2DModel.parse(lazy_array, dims=("y", "x", "c"))

<class 'numpy.ndarray'>
<class 'spatial_image.SpatialImage'>
<class 'dask.array.core.Array'>
[34mINFO    [0m Transposing `data` of type: [1m<[0m[1;95mclass[0m[39m [0m[32m'dask.array.core.Array'[0m[1m>[0m to [1m([0m[32m'c'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m.                           


Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 2.25 MiB 2.25 MiB Shape (3, 768, 1024) (3, 768, 1024) Dask graph 1 chunks in 2 graph layers Data type uint8 numpy.ndarray",1024  768  3,

Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


In [46]:
# we can specify the chunk sizes of the image with the argument `chunks`
# warning: this currently doesn't work because of a bug https://github.com/scverse/spatialdata/issues/406
parsed = Image2DModel.parse(image, dims=("y", "x", "c"), chunks=(1, 100, 100))

# please use this workaround instead
parsed.data = parsed.data.rechunk((1, 100, 100))
parsed

[34mINFO    [0m Transposing `data` of type: [1m<[0m[1;95mclass[0m[39m [0m[32m'dask.array.core.Array'[0m[1m>[0m to [1m([0m[32m'c'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m.                           


Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,9.77 kiB
Shape,"(3, 768, 1024)","(1, 100, 100)"
Dask graph,264 chunks in 3 graph layers,264 chunks in 3 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 2.25 MiB 9.77 kiB Shape (3, 768, 1024) (1, 100, 100) Dask graph 264 chunks in 3 graph layers Data type uint8 numpy.ndarray",1024  768  3,

Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,9.77 kiB
Shape,"(3, 768, 1024)","(1, 100, 100)"
Dask graph,264 chunks in 3 graph layers,264 chunks in 3 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


In [81]:
# the c coordinates can be set by passing the c_coords argument
parsed = Image2DModel.parse(image, dims=("y", "x", "c"), c_coords=["r", "g", "b"])
display(parsed)

[34mINFO    [0m Transposing `data` of type: [1m<[0m[1;95mclass[0m[39m [0m[32m'dask.array.core.Array'[0m[1m>[0m to [1m([0m[32m'c'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m.                           


Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 2.25 MiB 2.25 MiB Shape (3, 768, 1024) (3, 768, 1024) Dask graph 1 chunks in 2 graph layers Data type uint8 numpy.ndarray",1024  768  3,

Unnamed: 0,Array,Chunk
Bytes,2.25 MiB,2.25 MiB
Shape,"(3, 768, 1024)","(3, 768, 1024)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


In [84]:
# let's access the c coords directly using the xarray syntax
print(parsed.c.values)

['r' 'g' 'b']


In [65]:
# a multiscale image can be derived by passing a list of scale factors (chunking works here)
msi = Image2DModel.parse(image, dims=("y", "x", "c"), chunks=(1, 100, 100), scale_factors=[2, 2])
print(msi)

[34mINFO    [0m Transposing `data` of type: [1m<[0m[1;95mclass[0m[39m [0m[32m'dask.array.core.Array'[0m[1m>[0m to [1m([0m[32m'c'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m.                           
DataTree('None', parent=None)
├── DataTree('scale0')
│       Dimensions:  (c: 3, y: 768, x: 1024)
│       Coordinates:
│         * c        (c) int64 0 1 2
│         * y        (y) float64 0.5 1.5 2.5 3.5 4.5 ... 763.5 764.5 765.5 766.5 767.5
│         * x        (x) float64 0.5 1.5 2.5 3.5 ... 1.022e+03 1.022e+03 1.024e+03
│       Data variables:
│           image    (c, y, x) uint8 dask.array<chunksize=(1, 100, 100), meta=np.ndarray>
├── DataTree('scale1')
│       Dimensions:  (c: 3, y: 384, x: 512)
│       Coordinates:
│         * c        (c) int64 0 1 2
│         * y        (y) float64 1.0 3.0 5.0 7.0 9.0 ... 759.0 761.0 763.0 765.0 767.0
│         * x        (x) float64 1.0 3.0 5.0 7.0 ... 1.019e+03 1.021e+03 1.023e+03
│       Data variables:
│           image    (c, y, 

In [74]:
# note: here scale_factors=[2, 2] gives an object with 3 scales:
# scale0: full resolution
# scale1: scale0 downscaled by 2
# scale2: scale0 downscaled by 4 (i.e. scale1 downscaled by 2)
#
# using for instance scale_factors = [1, 2, 4, 8, 16] is incorrect, as it would lead to
# scale0: full resolution
# scale1: full resolution
# scale2: scale0 downscaled by 2
# scale3: scale0 downscaled by 8
# scale4: scale0 downscaled by 64
# scale5: scale0 downscaled by 1024

# let's list the scales
print(list(msi.keys()))

['scale0', 'scale1', 'scale2']


In [75]:
# let's access scale1. This is a DataTree
print(msi["scale1"])

DataTree('scale1', parent="None")
    Dimensions:  (c: 3, y: 384, x: 512)
    Coordinates:
      * c        (c) int64 0 1 2
      * y        (y) float64 1.0 3.0 5.0 7.0 9.0 ... 759.0 761.0 763.0 765.0 767.0
      * x        (x) float64 1.0 3.0 5.0 7.0 ... 1.019e+03 1.021e+03 1.023e+03
    Data variables:
        image    (c, y, x) uint8 dask.array<chunksize=(1, 100, 100), meta=np.ndarray>


In [85]:
# we are interested in the `image` contained in it
msi["scale1"]["image"]

Unnamed: 0,Array,Chunk
Bytes,576.00 kiB,9.77 kiB
Shape,"(3, 384, 512)","(1, 100, 100)"
Dask graph,72 chunks in 8 graph layers,72 chunks in 8 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 576.00 kiB 9.77 kiB Shape (3, 384, 512) (1, 100, 100) Dask graph 72 chunks in 8 graph layers Data type uint8 numpy.ndarray",512  384  3,

Unnamed: 0,Array,Chunk
Bytes,576.00 kiB,9.77 kiB
Shape,"(3, 384, 512)","(1, 100, 100)"
Dask graph,72 chunks in 8 graph layers,72 chunks in 8 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


### Lazy loading of images

The parser returns a `SpatialImage`/`MultiscaleSpatialImage` object that internally relies on a lazy representation using Dask. You can use this to construct lazy-loaded and lazy-computable objects. You can also use chunked data + lazy computations to specify and then run distributed computations on the object.

To load the data in memory, you can use `.compute()`.

In [134]:
# the parsed data is represented lazily with Dask
print(parsed.data)

dask.array<transpose, shape=(3, 768, 1024), dtype=uint8, chunksize=(3, 768, 1024), chunktype=numpy.ndarray>


In [135]:
# you can compute the data (it will not persist in-memory)
print(parsed.data.compute())

[[[121 138 153 ... 119 131 139]
  [ 89 110 130 ... 118 134 146]
  [ 73  94 115 ... 117 133 144]
  ...
  [ 87  94 107 ... 120 119 119]
  [ 85  95 112 ... 121 120 120]
  [ 85  97 111 ... 120 119 118]]

 [[112 129 144 ... 126 136 144]
  [ 82 103 122 ... 125 141 153]
  [ 66  87 108 ... 126 142 153]
  ...
  [106 110 124 ... 158 157 158]
  [101 111 127 ... 157 156 156]
  [101 113 126 ... 156 155 154]]

 [[131 148 165 ...  74  82  90]
  [100 121 143 ...  71  87  99]
  [ 84 105 126 ...  71  87  98]
  ...
  [ 76  81  92 ...  97  96  95]
  [ 72  82  96 ...  96  94  94]
  [ 74  84  97 ...  95  93  92]]]


## Labels

The data types for labels are identical as the ones used for images. Most of the considerations and code examples above are valid, with the following distincions:
- labels don't have the `c` channel
    - therefore their valid axes are `('y', 'x')` for 2D labels and `('z', 'y', 'x')` for 3D labels;
    - therefore they don't accept the `c_coords` kwargs argument in the parser.
- the `dtype` for labels needs to be an integer.

In [171]:
# the models for labels
from spatialdata.models import Labels2DModel, Labels3DModel

# let's create some 3D data with two classes: 0 (background) and 1
labels_data = np.zeros((5, 10, 15))
labels_data[:2, :, :] = 1

labels = Labels3DModel.parse(labels_data)
labels

[34mINFO    [0m no axes information specified in the object, setting `dims` to: [1m([0m[32m'z'[0m, [32m'y'[0m, [32m'x'[0m[1m)[0m                           


Unnamed: 0,Array,Chunk
Bytes,5.86 kiB,5.86 kiB
Shape,"(5, 10, 15)","(5, 10, 15)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 5.86 kiB 5.86 kiB Shape (5, 10, 15) (5, 10, 15) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",15  10  5,

Unnamed: 0,Array,Chunk
Bytes,5.86 kiB,5.86 kiB
Shape,"(5, 10, 15)","(5, 10, 15)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Points

Points are represented as a dataframe object of type `dask.dataframe.core.DataFrame`, which is the Dask version (="lazy") of a regular `pandas.DataFrame`.

Points can be 2D or 3D, and this is determined by the availability, respectively, of the `x`, `y` or of the `x`, `y`, `z` columns. These coordinates columns must be of numerical type.

Points can have any additional other column, for instance storing a gene id information for each point location.

In [177]:
from spatialdata.models import PointsModel

coords = np.array([[1, 1], [2, 3], [4, 5]])
points = PointsModel.parse(coords)

# the type of the parsed points is a "lazy" Dask dataframe
print(type(points))

<class 'dask.dataframe.core.DataFrame'>


In [178]:
points

Unnamed: 0_level_0,x,y
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,int64
2,...,...


In [179]:
# we can compute its values
points.compute()

Unnamed: 0,x,y
0,1,1
1,2,3
2,4,5


The single-dispatch mechanism for `.parse()` allows to construct points from numpy arrays, from pandas dataframes and from Dask dataframes. Let's see examples for this and also how to specify additional columns beyond `x`, `y`, `z`.

## Shapes

### Cirlces

### Polygons/multipolygons

## Tables

## Final considerations

### TODO: lazy vs non-lazy

### TODO: Single-dispatch caveats
### TODO: validate vs parse (classmethod) for raster
### get_model()
### circles vs polygon/multipolygons

### Relationship between single-scale vs multi-scale
More Advanced topic. Let's now examine the relationship between the `SpatialImage` object and the `MultiscaleSpatialImage` object.

In [127]:
# this is a more technical example
# the image in each scale of a MultiscaleSpatialImage object is actually a xarray.DataTree and not a SpatialImage
print(type(msi["scale1"]["image"]))

# we can fix this easily
from spatial_image import SpatialImage

si = SpatialImage(msi["scale1"]["image"])
print(type(si))

# future vesions of spatialdata will use "spatial_image >= 1.0.0" which uses `DataArray` directly, removing the
# need for the last extra step

<class 'xarray.core.dataarray.DataArray'>
<class 'spatial_image.SpatialImage'>


In [108]:
# let's slice a DataTree
sliced = msi.sel(x=slice(10, 100))
print(sliced)

DataTree('None', parent=None)
├── DataTree('scale0')
│       Dimensions:  (c: 3, y: 768, x: 90)
│       Coordinates:
│         * c        (c) int64 0 1 2
│         * y        (y) float64 0.5 1.5 2.5 3.5 4.5 ... 763.5 764.5 765.5 766.5 767.5
│         * x        (x) float64 10.5 11.5 12.5 13.5 14.5 ... 95.5 96.5 97.5 98.5 99.5
│       Data variables:
│           image    (c, y, x) uint8 dask.array<chunksize=(1, 100, 90), meta=np.ndarray>
├── DataTree('scale1')
│       Dimensions:  (c: 3, y: 384, x: 45)
│       Coordinates:
│         * c        (c) int64 0 1 2
│         * y        (y) float64 1.0 3.0 5.0 7.0 9.0 ... 759.0 761.0 763.0 765.0 767.0
│         * x        (x) float64 11.0 13.0 15.0 17.0 19.0 ... 91.0 93.0 95.0 97.0 99.0
│       Data variables:
│           image    (c, y, x) uint8 dask.array<chunksize=(1, 100, 45), meta=np.ndarray>
└── DataTree('scale2')
        Dimensions:  (c: 3, y: 192, x: 23)
        Coordinates:
          * c        (c) int64 0 1 2
          * y        (y)

In [123]:
# sliced is actually a DataTree, not a MultiscaleSpatialImage
type(sliced)

# we can fix this with a convenience function from spatialdata
from spatialdata._utils import multiscale_spatial_image_from_data_tree

type(multiscale_spatial_image_from_data_tree(sliced))

# future vesions of spatialdata will use "multiscale_spatial_image >= 1.0.0" which uses `DataTree` directly, removing the
# need for the last extra step

multiscale_spatial_image.multiscale_spatial_image.MultiscaleSpatialImage