# Searching the Catalog

This notebook serves as an example of how to use the data catalog ([catalog_blue.csv](https://github.com/NOAA-GFDL/spear-flp/blob/main/catalog_blue.csv) and [catalog_blue.json](https://github.com/NOAA-GFDL/spear-flp/blob/main/catalog_blue.json)) to find and transfer files from Globus. We will walk through getting a list of where each file you're interested in is stored on the Globus endpoint.

The SPEAR-MED data is available on Globus [at this link](https://app.globus.org/file-manager?origin_id=20987ccd-15d0-4719-a60c-86ab17d6a393&origin_path=%2F).

The packages [intake-esm](https://github.com/intake/intake-esm) and pandas are required to run this notebook.

In [1]:
import intake
import intake_esm
import pandas as pd
cat_url = "https://raw.githubusercontent.com/NOAA-GFDL/spear-flp/refs/heads/main/catalog_blue.json"

The first step is to load the catalog:

In [2]:
cat = intake.open_esm_datastore(cat_url)

Then we can run a search. Each argument is `<column> = "<value>"` or `<column> = ["<value1>", "<value2>"]`

In [3]:
subcat = cat.search(variable_id="WVP", experiment_id="SPEAR_c192_o1_Hist_AllForc_IC1921_K50",
        time_range=["192101-193012", "193101-194012"])
subcat

Unnamed: 0,unique
activity_id,0
institution_id,0
source_id,0
experiment_id,1
frequency,1
realm,1
table_id,0
member_id,30
grid_label,0
variable_id,1


A full list of column names can be obtained,

In [4]:
cat.df.columns

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'frequency', 'realm', 'table_id', 'member_id', 'grid_label',
       'variable_id', 'time_range', 'chunk_freq', 'platform', 'dimensions',
       'cell_methods', 'standard_name', 'pass_qc', 'who_qc', 'path'],
      dtype='object')

As well as the unique options for that column

In [5]:
set(cat.df['realm'])

{'atmos',
 'atmos_4xdaily',
 'atmos_4xdaily_avg',
 'atmos_daily',
 'land_daily',
 'ocean'}

To get the paths, we can access the data directly from the dataframe,

In [6]:
subcat.df.path

0     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
1     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
2     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
3     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
4     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
5     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
6     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
7     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
8     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
9     /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
10    /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
11    /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
12    /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
13    /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
14    /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
15    /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
16    /collab1/data_untrusted/Aria.Radick/TFTEST/SPE...
17    /collab1/data_untrusted/Aria.Radick/TFTEST

Then we can save to a file relevant for the batch Globus transfer. The text file used for Globus batch transfers is formatted as
```
</path/to/source_file_1.ext> </path/to/destination_file_1.ext>
</path/to/source_file_2.ext> </path/to/destination_file_2.ext>
...
```

In [None]:
x = []
for y in subcat.df.path:
    x.append(y.replace("/collab1/data_untrusted/Aria.Radick/TFTEST/", ""))
df2 = pd.DataFrame({"source" : x, "dest" : x})
df2.to_csv("my_search.txt", sep=' ', index=False, header=False)

Which is used by running

```
$ SPEAR_FLP_RESTRICTED="20987ccd-15d0-4719-a60c-86ab17d6a393"
$ GLOBUS_DEST="$(globus endpoint local-id):/path/to/directory/"
$ cat /path/to/my_search.txt | globus transfer $SPEAR_FLP_RESTRICTED $GLOBUS_DEST --batch -
```

in a terminal on the destination machine. See [the readme on GitHub](https://github.com/NOAA-GFDL/spear-flp/tree/main?tab=readme-ov-file#recommended-accessing-the-data) for further elaboration.

[Globus CLI](https://docs.globus.org/cli/) is necessary to run these commands. If you don't already have a Globus endpoint you wish to transfer to, [Globus Connect Personal](https://www.globus.org/globus-connect-personal) will allow you to set one up on your machine.