# The `Dataset` Class

As previously discussed we know that a "DKIST dataset" is comprised of many files, including an ASDF and many FITS files.
The user tools represent all these files with the {obj}`dkist.Dataset` class.

A `Dataset` class is constructed from the ASDF file for that dataset.
This ASDF file contains the following information:
* A table containing all the headers from all FITS files that comprise the dataset.
* A copy of the Data Center's inventory record for the dataset.
* A `gwcs` object which provides coordinate information for the whole dataset.
* A list of all the component FITS files and the required order to combine them into a single array.

If a `Dataset` object is created from just the ASDF file, without access to the arrays in the FITS files then all the data will be missing, but everything else will function the same.

## Constructing `Dataset` Objects

There are a two ways to construct a `Dataset`: by providing a path to an ASDF file or by providing a directory containing an ASDF file.
Here we shall first fetch an ASDF file with Fido and then pass it to `Dataset.from_asdf`:

In [2]:
from astropy.time import Time

import dkist
import dkist.net
from sunpy.net import Fido, attrs as a

In [3]:
res = Fido.search(a.dkist.Dataset("AYDEW"))
res

Start Time,End Time,Instrument,Wavelength,Bounding Box,Dataset ID,Dataset Size,Exposure Time,Primary Experiment ID,Primary Proposal ID,Stokes Parameters,Target Types,Number of Frames,Average Fried Parameter,Embargoed,Downloadable,Has Spectral Axis,Has Temporal Axis,Average Spectral Sampling,Average Spatial Sampling,Average Temporal Sampling
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,nm,Unnamed: 4_level_1,Unnamed: 5_level_1,Gibyte,s,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,nm,arcsec,s
Time,Time,str4,float64[2],str32,str5,float64,float64,str9,str9,str4,str7[1],int64,float64,bool,bool,bool,bool,float64,float64,float64
2022-06-02T17:21:59.648,2022-06-02T17:48:47.688,VISP,630.2392098893586 .. 634.3897538455635,"(-405.09,37.71),(-469.15,-93.61)",AYDEW,4.0,48.00811267605634,eid_1_118,pid_1_118,IQUV,unknown,1960,0.0916508252856398,False,True,True,True,0.0016251150963997,0.2134568481952311,804.0201081632624


In [4]:
files = Fido.fetch(res, path="~/sunpy/data/{instrument}/{dataset_id}")
files

Files Downloaded:   0%|          | 0/1 [00:00<?, ?file/s]

<parfive.results.Results object at 0x7f8c71cbbc10>
['/home/stuart/sunpy/data/VISP/AYDEW/VISP_L1_20220602T172159_AYDEW.asdf']

Remember, that the file we have downloaded is a single ASDF file, **not** the whole dataset.
We can use this file to construct the `Dataset`:

In [5]:
ds = dkist.Dataset.from_asdf(files[0])

In [8]:
help(dkist.Dataset.from_directory)

Help on method from_directory in module dkist.dataset.dataset:

from_directory(directory) method of abc.ABCMeta instance
    Construct a `~dkist.dataset.Dataset` from a directory containing one
    asdf file and a collection of FITS files.



Now we have a `Dataset` object which describes the shape, size and physical dimensions of the array, but doesn't yet contain any of the actual data.
This may sound unhelpful but we'll see how it can be very powerful.

First let's have a look at the basic representation of the `Dataset`.

In [6]:
ds

<dkist.dataset.dataset.Dataset object at 0x7f8c605e69d0>
This Dataset has 4 pixel and 5 world dimensions

dask.array<reshape, shape=(4, 490, 976, 2555), dtype=float64, chunksize=(1, 1, 976, 2555), chunktype=numpy.ndarray>

Pixel Dim  Axis Name                Data size  Bounds
        0  polarization state               4  None
        1  raster scan step number        490  None
        2  spatial along slit             976  None
        3  dispersion axis               2555  None

World Dim  Axis Name                  Physical Type                   Units
        0  stokes                     phys.polarization.stokes        unknown
        1  time                       time                            s
        2  helioprojective longitude  custom:pos.helioprojective.lon  arcsec
        3  helioprojective latitude   custom:pos.helioprojective.lat  arcsec
        4  wavelength                 em.wl                           nm

Correlation between pixel and world axes:

               Pi

In [9]:
ds.dimensions

<Quantity [   4.,  490.,  976., 2555.] pix>

In [10]:
ds.data.shape

(4, 490, 976, 2555)

These are the **array** dimensions.
We can get the corresponding **pixel** axis names with:

In [11]:
ds.wcs.pixel_axis_names

('dispersion axis',
 'spatial along slit',
 'raster scan step number',
 'polarization state')

note how these are reversed from one another, we can print them together with:

In [21]:
for name, length in zip(ds.wcs.pixel_axis_names[::-1], ds.dimensions):
    print(f"{name}: {length}")

polarization state: 4.0 pix
raster scan step number: 490.0 pix
spatial along slit: 976.0 pix
dispersion axis: 2555.0 pix


These axes map onto world axes via the axis correlation matrix we saw in the first session:

In [24]:
ds.wcs.axis_correlation_matrix

array([[ True, False, False, False],
       [False,  True,  True, False],
       [False,  True,  True, False],
       [False, False,  True, False],
       [False, False, False,  True]])

We can get a list of the world axes which correspond to each array axis with:

In [27]:
from dkist.dataset.utils import pp_matrix

In [29]:
pp_matrix(ds.wcs)

[['                         ' 'dispersion axis' 'spatial along slit' 'raster scan step number' 'polarization state']
 ['               wavelength' '           True' '             False' '                  False' '             False']
 [' helioprojective latitude' '          False' '              True' '                   True' '             False']
 ['helioprojective longitude' '          False' '              True' '                   True' '             False']
 ['                     time' '          False' '             False' '                   True' '             False']
 ['                   stokes' '          False' '             False' '                  False' '              True']]


In [26]:
ds.dimensions

<Quantity [   4.,  490.,  976., 2555.] pix>

In [25]:
ds.array_axis_physical_types

[('phys.polarization.stokes',),
 ('custom:pos.helioprojective.lat', 'custom:pos.helioprojective.lon', 'time'),
 ('custom:pos.helioprojective.lat', 'custom:pos.helioprojective.lon'),
 ('em.wl',)]

Finally, as we saw in the first session, we can convert between pixel or array coordinates and world coordinates:

In [30]:
ds.wcs.pixel_to_world(30, 20, 10, 0)

[<SpectralCoord 630.28796334 nm>,
 <SkyCoord (Helioprojective: obstime=2022-06-02T17:35:22.024, rsun=695700.0 km, observer=<HeliographicStonyhurst Coordinate (obstime=2022-06-02T17:35:22.024, rsun=695700.0 km): (lon, lat, radius) in (deg, deg, m)
     (0.00194034, -0.48783239, 1.51723472e+11)>): (Tx, Ty) in arcsec
     (-60241.01861878, -390.04462929)>,
 <Time object: scale='utc' format='isot' value=2022-06-02T17:22:32.532>,
 'I']

In [33]:
world = ds.wcs.array_index_to_world(0, 10, 20, 30)
world

[<SpectralCoord 630.28796334 nm>,
 <SkyCoord (Helioprojective: obstime=2022-06-02T17:35:22.024, rsun=695700.0 km, observer=<HeliographicStonyhurst Coordinate (obstime=2022-06-02T17:35:22.024, rsun=695700.0 km): (lon, lat, radius) in (deg, deg, m)
     (0.00194034, -0.48783239, 1.51723472e+11)>): (Tx, Ty) in arcsec
     (-60241.01861878, -390.04462929)>,
 <Time object: scale='utc' format='isot' value=2022-06-02T17:22:32.532>,
 'I']

and we can also do the reverse:

In [36]:
ds.wcs.world_to_pixel(world[0], world[1], world[2], world[3])

[30.000000000058208, 19.99999999967399, 10.0, 0.0]

In [35]:
ds.wcs.world_to_array_index(*world)

(array(0), array(10), array(20), array(30))

Finally, it's possible to get all the axis coordinates along one or more axes:

```{warning}
This might eat all your <del>cat</del> RAM.

The algorithm used to calculate these coordinates in ndcube isn't as memory efficient as it could be, and when working with the large multi-dimensional DKIST data you can really notice it!
```

In [40]:
ds.axis_world_coords()

(<SpectralCoord [630.23920989, 630.240835  , 630.24246012, ..., 634.38650362,
    634.38812873, 634.38975385] nm>,
 <SkyCoord (Helioprojective: obstime=2022-06-02T17:35:22.024, rsun=695700.0 km, observer=<HeliographicStonyhurst Coordinate (obstime=2022-06-02T17:35:22.024, rsun=695700.0 km): (lon, lat, radius) in (deg, deg, m)
     (0.00194034, -0.48783239, 1.51723472e+11)>): (Tx, Ty) in arcsec
     [[(-61445.67185749, -389.93244527), (-61445.67184108, -389.90382666),
       (-61445.67182467, -389.87520805), ...,
       (-61445.65588814, -362.08653746), (-61445.65587173, -362.05791885),
       (-61445.65585532, -362.02930024)],
      [(-61325.39684985, -390.00144497), (-61325.39683347, -389.97282131),
       (-61325.39681709, -389.94419765), ...,
       (-61325.38090835, -362.15062342), (-61325.38089196, -362.12199976),
       (-61325.38087558, -362.0933761 )],
      [(-61205.08124703, -390.0705606 ), (-61205.08123067, -390.0419319 ),
       (-61205.08121431, -390.0133032 ), ...,
      

In [41]:
ds.axis_world_coords('time')

(<Time object: scale='utc' format='isot' value=['2022-06-02T17:21:59.648' '2022-06-02T17:22:02.937'
  '2022-06-02T17:22:06.225' '2022-06-02T17:22:09.513'
  '2022-06-02T17:22:12.802' '2022-06-02T17:22:16.090'
  '2022-06-02T17:22:19.379' '2022-06-02T17:22:22.667'
  '2022-06-02T17:22:25.956' '2022-06-02T17:22:29.244'
  '2022-06-02T17:22:32.532' '2022-06-02T17:22:35.821'
  '2022-06-02T17:22:39.109' '2022-06-02T17:22:42.398'
  '2022-06-02T17:22:45.686' '2022-06-02T17:22:48.975'
  '2022-06-02T17:22:52.263' '2022-06-02T17:22:55.551'
  '2022-06-02T17:22:58.840' '2022-06-02T17:23:02.128'
  '2022-06-02T17:23:05.417' '2022-06-02T17:23:08.705'
  '2022-06-02T17:23:11.994' '2022-06-02T17:23:15.282'
  '2022-06-02T17:23:18.570' '2022-06-02T17:23:21.859'
  '2022-06-02T17:23:25.147' '2022-06-02T17:23:28.436'
  '2022-06-02T17:23:31.724' '2022-06-02T17:23:35.013'
  '2022-06-02T17:23:38.301' '2022-06-02T17:23:41.589'
  '2022-06-02T17:23:44.878' '2022-06-02T17:23:48.166'
  '2022-06-02T17:23:51.455' '2022-06

### Slicing Datasets

Another useful feature of the `Dataset` class, which it inherits from `NDCube` is the ability to "slice" the dataset and get a smaller dataset, with the array and coordinate information in tact.

For example, to extract the Stokes I component of the dataset we would do:

In [42]:
ds[0]

<dkist.dataset.dataset.Dataset object at 0x7f8c5edf0550>
This Dataset has 3 pixel and 4 world dimensions

dask.array<getitem, shape=(490, 976, 2555), dtype=float64, chunksize=(1, 976, 2555), chunktype=numpy.ndarray>

Pixel Dim  Axis Name                Data size  Bounds
        0  raster scan step number        490  None
        1  spatial along slit             976  None
        2  dispersion axis               2555  None

World Dim  Axis Name                  Physical Type                   Units
        0  time                       time                            s
        1  helioprojective longitude  custom:pos.helioprojective.lon  arcsec
        2  helioprojective latitude   custom:pos.helioprojective.lat  arcsec
        3  wavelength                 em.wl                           nm

Correlation between pixel and world axes:

             Pixel Dim
World Dim    0    1    2
        0  yes   no   no
        1  yes  yes   no
        2  yes  yes   no
        3   no   no  yes

This is because the Stokes axis is the first array axis, and the "I" profile is the first one (0-indexing).

Note how we have dropped a world coordinate; this information is preserved in the `.global_coords` attribute, which contains the coordinate information which applies to the whole dataset:

In [43]:
ds[0].global_coords

<ndcube.global_coords.GlobalCoords object at 0x7f8c5ed7b810>
GlobalCoords(stokes ['phys.polarization.stokes']:
'I')

We can also slice out a section of an axis of the dataset:

In [49]:
ds[:, 10:2000, :, :]

<dkist.dataset.dataset.Dataset object at 0x7f8c5ec71e10>
This Dataset has 4 pixel and 5 world dimensions

dask.array<getitem, shape=(4, 480, 976, 2555), dtype=float64, chunksize=(1, 1, 976, 2555), chunktype=numpy.ndarray>

Pixel Dim  Axis Name                Data size  Bounds
        0  polarization state               4  None
        1  raster scan step number        480  None
        2  spatial along slit             976  None
        3  dispersion axis               2555  None

World Dim  Axis Name                  Physical Type                   Units
        0  stokes                     phys.polarization.stokes        unknown
        1  time                       time                            s
        2  helioprojective longitude  custom:pos.helioprojective.lon  arcsec
        3  helioprojective latitude   custom:pos.helioprojective.lat  arcsec
        4  wavelength                 em.wl                           nm

Correlation between pixel and world axes:

               Pi

In [47]:
ds[0, 0].global_coords

<ndcube.global_coords.GlobalCoords object at 0x7f8c5eff5890>
GlobalCoords(time ['time']:
<Time object: scale='utc' format='isot' value=2022-06-02T17:21:59.648>,
             stokes ['phys.polarization.stokes']:
'I')

This selects only 100 of the raster step points.


## TiledDataset

So far we have been working with VISP data, which is continuous in a sense, in that there are no gaps or overlaps in the coordinates axes.
However, instruments like VBI take multiple images at different locations with the intention of tiling them together to form a larger image.
In this case, those images do not share a common pixel grid and therefore cannot be simply stacked together.
It is possible to use `reproject` to regrid the images into a larger array, but since this would interpolate the data, it is not done by default.
We will cover an example of how to do this later in the workshop.

This kind of tiled data cannot be stored in a single `Dataset` object.
There is therefore a wrapper object called `TiledDataset`, which is essentially an array of `Dataset` objects.
Let's demonstrate this with a VBI dataset.

In [50]:
res = Fido.search(a.dkist.Dataset("BLKGA"))
files = Fido.fetch(res, path="~/sunpy/data/{instrument}/{dataset_id}")
tds = dkist.Dataset.from_asdf(files[0])
tds

Files Downloaded:   0%|          | 0/1 [00:00<?, ?file/s]

<dkist.dataset.tiled_dataset.TiledDataset at 0x7f8c5f2a54d0>

To access the individual tiles, we can then index this normally to get back the `Dataset` objects.

In [61]:
type(tds)

dkist.dataset.tiled_dataset.TiledDataset

In [62]:
type(ds)

dkist.dataset.dataset.Dataset

In [52]:
tds.shape

(3, 3)

In [55]:
tds.inventory

{'asdfObjectKey': 'pid_1_118/BLKGA/VBI_L1_20220602T172250_BLKGA.asdf',
 'averageDatasetSpatialSampling': 0.01099999994039536,
 'averageDatasetSpectralSampling': None,
 'averageDatasetTemporalSampling': 82.26000000000653,
 'boundingBox': '(-560.88,-346.83),(-677.78,-466.44)',
 'browseMovieObjectKey': 'pid_1_118/BLKGA/BLKGA.mp4',
 'bucket': 'data',
 'calibrationDocumentationUrl': 'https://docs.dkist.nso.edu/projects/vbi/en/v0.16.0/l0_to_l1_vbi_summit-calibrated.html',
 'contributingExperimentIds': [],
 'contributingProposalIds': [],
 'datasetId': 'BLKGA',
 'datasetInventoryId': 5264,
 'datasetSize': 2,
 'endTime': '2022-06-02T17:47:30.855505',
 'exposureTime': 5.009,
 'frameCount': 171,
 'hasAllStokes': False,
 'hasSpectralAxis': False,
 'hasTemporalAxis': True,
 'headerDataUnitCreationDate': '2022-12-08T17:46:16.402000',
 'headerDocumentationUrl': 'https://docs.dkist.nso.edu/projects/data-products/en/v3.0.0',
 'headerVersion': '3.0.0',
 'highLevelSoftwareVersion': 'Alakai_5-1',
 'infoUr

In [58]:
tds[0, 1]

<dkist.dataset.dataset.Dataset object at 0x7f8c5d702310>
This Dataset has 3 pixel and 3 world dimensions

dask.array<stack, shape=(19, 4096, 4096), dtype=float32, chunksize=(1, 4096, 4096), chunktype=numpy.ndarray>

Pixel Dim  Axis Name                  Data size  Bounds
        0  time                              19  None
        1  helioprojective longitude       4096  None
        2  helioprojective latitude        4096  None

World Dim  Axis Name                  Physical Type                   Units
        0  time                       time                            s
        1  helioprojective longitude  custom:pos.helioprojective.lat  arcsec
        2  helioprojective latitude   custom:pos.helioprojective.lon  arcsec

Correlation between pixel and world axes:

             Pixel Dim
World Dim    0    1    2
        0  yes   no   no
        1  yes  yes  yes
        2  yes  yes  yes

```{error}
Due to a known issue with the VBI level 1 FITS headers, the ordering of these tiles in the array is likley incorrect.
```

The `TiledDataset` stores the FITS headers for all the files of the individual `Dataset`s in the `combined_headers` attribute.
This means that the metadata can still be inspected in many of the ways we will see in later sessions.
Later releases of the user tools will also include helper functions for regridding a `TiledDataset` into a single `Dataset` object.