# Downloading Data

## Setting up and using Globus

<!--
- login flow
- GCP
- paths
- endpoints
- monitoring jobs w/ web interface
- download single file
-->

We've seen how to search and download the ASDF metadata files with Fido.
However, the actual data files are distributed using [Globus](https://www.globus.org/)
For the next portion of the workshop you will need to be running Globus Connect Personal, so follow the installation instructions for your platform [here](https://www.globus.org/globus-connect-personal) if you haven't already.
During the setup, you will need to login to Globus.
For this you can use your login for your institution, or alternatively you can login with Google or ORCID.

Once Globus is installed and set up, you will need to run Globus Connect Personal (GCP) as described on the installation page.
You will need to do this every time you want to download data, either through the user tools or through the Globus web app.
When you start GCP You may also want to define the location or locations on your computer which you want Globus to have access to.
On Linux you can do this using the `-restrict-paths` command line argument, or by editing the config file.
On Windows and Mac OS this option is in the "Access" tab of the configuration options.
Globus will only be able to transfer files onto your machine in the specified paths.

### The Globus web app

Many of you will already be familiar with using the [Globus web app](https://app.globus.org/) to download data.
If you are not, you should read through the [getting started docs here](https://docs.globus.org/how-to/get-started/).
We will not be using the web app significantly for this workshop, and generally we don't recommend downloading data this way, since the user tools are better suited to navigating the quantities of data that DKIST provides.
However, we will be going over how to use the web app now so that we can demonstrate some of the underlying concepts.

**Endpoints** (also called **Collections** in the web app) are locations registered with Globus for data transfer.
For example, you may want to define an endpoint for both your desktop machine in the office and your laptop, so that you can download data on each depending on where you're working.
You would then be able to transfer data directly from one to the other using Globus.
Many institutions will have their own Globus endpoints, such as a computing cluster, that you may have access to.
DKIST has an endpoint called "DKIST Data Transfer", which is where DKIST data will be made available.

**Paths** are ...?

To start a data transfer from one endpoint to another, go to the "File Manager" tab of the web app.
Here you will find a split screen - on either side you can select an endpoint in the "Collection" search box.
Select "DKSIT Data Transfer" on the left hand side and the endpoint corresponding to your local machine on the right.
Then you can navigate the file system on either machine (remembering that Globus will only have access to whichever local directories you've specified).

Let's demonstrate a simple file transfer by grabbing the preview movie for a dataset.
On the right hand side in your local endpoint, navigate to a suitable place to download the movie.
Then on the right hand side navigate to `/data/pid_1_123/BEOGN/`.
We will use this dataset for this and some other examples later in this session.
You should see a list of the files available in this dataset, mostly the data stored in `.fits` format.
Select the preview movie, `BEOGN.mp4`, by clicking the checkbox next to it, then click the "Start" button above the file list to begin the download.

You can check the progress of your transfer by going to the "Activity" tab, which shows both active and previous transfers.
Various useful information is displayed here but for now the most important is whether the transfer task has failed or succeeded.
In either case Globus will also send an email to your registered email address when the task finishes.
Of course in this trivial example this is unneccessary, but if you're transfering a whole large dataset it will likely take some time to download and it may be useful to be notified when it's complete.
You do not need to leave the web app open for the transfer to continue, but remember that you do need to have GCP running - so if you stop it then your data download will stop as well.

If you try transfering the same file again to the same location, you will find that the task completes successfully but the file is not actually transferred.
This is to save download time and avaoid duplication.

## `dkist.Dataset` basics

In DKIST data parlance, a "dataset" is the smallest unit of data that is searchable from the data centre, and represents a single self-contained observation [check with Stu for a better short definition here].
The user tools represent this unit of data with the `Dataset` class.
Within this class the data are stored as many FITS files, each containing a single frame of the observation, and an ASDF file describing how the frames relate to each other.
For VTF data, for example, one FITS file would contain a single narrowband image in one Stokes profile at a single time.
Since there will be very many of these files, each with their own FITS header, manually tracking and inspecting them would be unmanageable.
The `Dataset` class combines these many files into one object, allowing you to inspect the properties and combined headers of the whole dataset.

There are a few ways to construct a `Dataset` object.
For the first we will need the ASDF file for the dataset, which we can get using `Fido` as we saw yesterday.

In [None]:
# Imports
import dkist
from sunpy.net import Fido, attrs as a
import dkist.net

In [None]:
# Create DKIST Fido client instance
res = Fido.search(a.dkist.Dataset('BEOGN'))

res

In [None]:
files = Fido.fetch(res)
files

Notice that the file we have downloaded is a single ASDF file, **not** the whole dataset.
We can use this file to construct the `Dataset`:

In [None]:
ds = dkist.Dataset.from_asdf(files[0])

Now we have a `Dataset` object which describes the shape, size and physical dimensions of the array, but doesn't yet contain any of the actual data.
This may sound unhelpful but we'll see how it can be very powerful.

First let's have a look at the basic representation of the `Dataset`.

In [None]:
ds

This tells us that we have a 4-dimensional data cube and what values the axes correspond to.
Importantly, it not only gives us information about the *pixel* axes (the actual dimensions of the array itself), but also the *world* axes (the physical quantities related to the observation).
It also gives us a correlation matrix showing how the pixel axes relate to the world axes.

### Something?

Finally the correlation matrix tells us which pixel axes correspond to which world axes.
In this case the first three pixel axes align exactly with three of the world axes.
However, the slit axis maps to both longitude *and* latitude, since the slit is unlikely to be aligned to either one.

### Inspecting the dataset

The `Dataset` object allows us to do some basic inspection of the dataset as a whole without having to download the entire thing, using the metadata in the FITS headers.
This will save you a good amount of time and also ease the load on the DKIST servers.
For example, we can check the seeing conditions during the observation.

In [None]:
# Will need this
import matplotlib.pyplot as plt

# This may be useful here
ds.meta['inventory']['headerDocumentationUrl']

In [None]:
# Just look at the headers for Stokes I so there aren't 4 lots of the same values
I_headers = ds.headers[ds.headers['DINDEX4'] == 1]
plt.plot(I_headers['ATMOS_R0'])
plt.show()

This information allows us to select the parts of the data where the seeing is good, and only download those files.
We will see a more detailed demonstration of how to do this later.

There is an important point to note about slicing the array to reduce the number of files, which is that you need to keep in mind how the data are stored across those files.
We can see a little more information about the files with the `files` attribute of the `Dataset`:

In [None]:
ds.files

So in this case we can see that each FITS file contains effectively a 2D image - a single raster scan at one polarisation state - and that we have 4000 of these files to make a full 4D dataset.
What this means is that if we look at a subset of the scan steps or polarisation states, we will reduce the number of files across which the array is stored.

In [None]:
ds[0]

First, notice that when we slice a `Dataset` like this, the output we get here shows us not just the updated array shape but also the updated dimensions.
Because we're looking at a single polarisation state, that axis and the corresponding physical axis have been removed.

In [None]:
ds[0].files

However, if we decide we want to look at a single wavelength, we are taking a row of pixels from every single file.
So although we reduce the dimensions of the array, we are not reducing the number of files we need to reference - and therefore download.

In [None]:
ds[:, :, 500, :].data.shape

In [None]:
ds[:, :, 500, :].files

### Downloading the quality report and preview movie

For each dataset a quality report is produced during calibration which gives useful information about the quality of the data.
This is accessible through the `Dataset`'s `quality_report()` method, which will download a PDF of the quality report to the base path of the dataset.
This uses parfive underneath, which is the same library `Fido` uses, so it will return the same kind of `results` object.
If the download has been successful, this can be treated as a list of filenames.

In [None]:
qr = ds.files.quality_report()
qr

This method takes the optional arguments `path` and `overwrite`.
`path` allows you to specify a different location for the download, and `overwrite` is a boolean which tells the method whether or not to download a new copy if the file already exists.

Similarly, each dataset also has a short preview movie showing the data.
This can be downloaded in exactly the same way as the quality report but using the `preview_movie()` method:

In [None]:
pm = ds.files.preview_movie()
pm

## Dataset and downloading

<!--
- Download from globus
- Download whole dataset/many datasets
- Download to remote endpoints
-->

## Downloading Data Based on Seeing

In [None]:
import matplotlib.pyplot as plt
from sunpy.net import Fido, attrs as a

import dkist

Let's find a dataset with the highest average value of r0 (this is bad?).
First we'll search for all unembargoed VISP data, as embargoed data is no use to us for this excercise.

In [None]:
res = Fido.search(a.Instrument("VISP"), a.dkist.Embargoed(False))

Next, since we want to use the highest average $r_0$, we can have Fido sort the results and output just the useful columns.

In [None]:
res['dkist'].sort("Average Fried Parameter", reverse=True)
res['dkist'].show("Dataset ID", "Start Time", "Average Fried Parameter", "Primary Proposal ID")

This gives us the dataset `BEOGN`, which happens to be the same one we have been using already.
(So you should already have the ASDF file available on your machine but we'll go through the motions anyway for completeness.
Fido won't re-download an existing file anyway unless told to.)
We can download the dataset ASDF with Fido to inspect it in more detail.
Remember that this only downloads a single ASDF file with some more metadata about the dataset, not the actual science data.

In [None]:
asdf_files = Fido.fetch(res['dkist'][0], path="~/sunpy/data/{dataset_id}/")
ds = dkist.Dataset.from_asdf(asdf_files[0])

Now that we have access to the FITS headers we can inspect the $r_0$ more closely, just as we did in the previous session.
Remember that `DINDEX4` is the Stokes index, so we can plot the $r_0$ for just Stokes I like so:

In [None]:
plt.plot(ds.headers[ds.headers["DINDEX4"] == 1]["ATMOS_R0"])

Now let's slice down our dataset based on the first frame where $r_0$ is high:

In [None]:
# Select headers for only frames with bad r0
bad_headers = ds.headers[ds.headers["ATMOS_R0"] > 1]

# Make sure headers are sorted by date
bad_headers.sort("DATE-AVG")

# Slice up to the index of the first bad frame
sds = ds[0, :bad_headers[0]["DINDEX3"]-1, :, :]

We can now download only these files, remember you need globus-connect-personal running for this.

In [None]:
sds.files.download(path="~/sunpy/data/{dataset_id}/")

Now let's plot the Stokes I data at some wavelength:

In [None]:
ds[0, :, 466, :].plot(plot_axes=['x', 'y'], aspect="auto")

You will notice that a lot of it is missing.
This is because we have deliberately only downloaded those frames with an acceptably low $r_0$.
You may also notice though, that the `Dataset` object continues to function normally without the rest of the data.
When we try to access the data, if the file is missing then `Dataset` fills in the corresponding portions of the array with NaNs.

Since the seeing is bad for a significant contiguous portion of the data, we may simply want to discount that part and look only at the useful data.
In this case we can use the sub-dataset we made earlier:

In [None]:
sds[:, 466, :].plot(plot_axes=['x', 'y'], aspect="auto")

Or of course we can make any arbitrary slice to look at whatever subset of the data we prefer.


## Plotting again

<!--
- put data in the plot
- animations
- single image w/ coord overlay
- line plots
- plotting (ndcube)
-->