# Downloading Data

## Setting up and using Globus

<!--
- login flow
- GCP
- paths
- endpoints
- monitoring jobs w/ web interface
- download single file
-->

We've seen how to search and download the ASDF metadata files with Fido.
However, the actual data files are distributed using [Globus](https://www.globus.org/)
For the next portion of the workshop you will need to be running Globus Connect Personal, so follow the installation instructions for your platform [here](https://www.globus.org/globus-connect-personal) if you haven't already.
During the setup, you will need to login to Globus.
For this you can use your login for your institution, or alternatively you can login with Google or ORCID.

Once Globus is installed and set up, you will need to run Globus Connect Personal (GCP) as described on the installation page.
You will need to do this every time you want to download data, either through the user tools or through the Globus web app.
When you start GCP You may also want to define the location or locations on your computer which you want Globus to have access to.
On Linux you can do this using the `-restrict-paths` command line argument, or by editing the config file.
On Windows and Mac OS this option is in the "Access" tab of the configuration options.
Globus will only be able to transfer files onto your machine in the specified paths.

### The Globus web app

Many of you will already be familiar with using the [Globus web app](https://app.globus.org/) to download data.
If you are not, you should read through the [getting started docs here](https://docs.globus.org/how-to/get-started/).
We will not be using the web app significantly for this workshop, and generally we don't recommend downloading data this way, since the user tools are better suited to navigating the quantities of data that DKIST provides.
However, we will be going over how to use the web app now so that we can demonstrate some of the underlying concepts.

**Endpoints** (also called **Collections** in the web app) are locations registered with Globus for data transfer.
For example, you may want to define an endpoint for both your desktop machine in the office and your laptop, so that you can download data on each depending on where you're working.
You would then be able to transfer data directly from one to the other using Globus.
Many institutions will have their own Globus endpoints, such as a computing cluster, that you may have access to.
DKIST has an endpoint called "DKIST Data Transfer", which is where DKIST data will be made available.

**Paths** are ...?

To start a data transfer from one endpoint to another, go to the "File Manager" tab of the web app.
Here you will find a split screen - on either side you can select an endpoint in the "Collection" search box.
Select "DKSIT Data Transfer" on the left hand side and the endpoint corresponding to your local machine on the right.
Then you can navigate the file system on either machine (remembering that Globus will only have access to whichever local directories you've specified).

Let's demonstrate a simple file transfer by grabbing the preview movie for a dataset.
On the right hand side in your local endpoint, navigate to a suitable place to download the movie.
Then on the right hand side navigate to `/data/pid_1_123/BEOGN/`.
We will use this dataset for this and some other examples later in this session.
You should see a list of the files available in this dataset, mostly the data stored in `.fits` format.
Select the preview movie, `BEOGN.mp4`, by clicking the checkbox next to it, then click the "Start" button above the file list to begin the download.

You can check the progress of your transfer by going to the "Activity" tab, which shows both active and previous transfers.
Various useful information is displayed here but for now the most important is whether the transfer task has failed or succeeded.
In either case Globus will also send an email to your registered email address when the task finishes.
Of course in this trivial example this is unneccessary, but if you're transfering a whole large dataset it will likely take some time to download and it may be useful to be notified when it's complete.
You do not need to leave the web app open for the transfer to continue, but remember that you do need to have GCP running - so if you stop it then your data download will stop as well.

If you try transfering the same file again to the same location, you will find that the task completes successfully but the file is not actually transferred.
This is to save download time and avoid duplication.

## Dataset and downloading

<!--
- Download from globus
- Download whole dataset/many datasets
- Download to remote endpoints
-->

## Downloading Data Based on Seeing

Let's find a dataset with the highest average value of r0 (this is bad?).
First we'll search for all unembargoed VISP data, as embargoed data is no use to us for this excercise.

Next, since we want to use the highest average $r_0$, we can have Fido sort the results and output just the useful columns.

This gives us the dataset `BEOGN`, which happens to be the same one we have been using already.
(So you should already have the ASDF file available on your machine but we'll go through the motions anyway for completeness.
Fido won't re-download an existing file anyway unless told to.)
We can download the dataset ASDF with Fido to inspect it in more detail.
Remember that this only downloads a single ASDF file with some more metadata about the dataset, not the actual science data.

Now that we have access to the FITS headers we can inspect the $r_0$ more closely, just as we did in the previous session.
Remember that `DINDEX4` is the Stokes index, so we can plot the $r_0$ for just Stokes I like so:

Now let's slice down our dataset based on the first frame where $r_0$ is high:

We can now download only these files, remember you need globus-connect-personal running for this.

Now let's plot the Stokes I data at some wavelength:

You will notice that a lot of it is missing.
This is because we have deliberately only downloaded those frames with an acceptably low $r_0$.
You may also notice though, that the `Dataset` object continues to function normally without the rest of the data.
When we try to access the data, if the file is missing then `Dataset` fills in the corresponding portions of the array with NaNs.

Since the seeing is bad for a significant contiguous portion of the data, we may simply want to discount that part and look only at the useful data.
In this case we can use the sub-dataset we made earlier:

Or of course we can make any arbitrary slice to look at whatever subset of the data we prefer.


## Plotting again

<!--
- put data in the plot
- animations
- single image w/ coord overlay
- line plots
- plotting (ndcube)
-->