# 4. Collections and Queries in DataFed
In this notebook, we will be going over creating Collections, viewing contained items, organizing Collections, downloading Collections, and searching for data

## Before we begin:
Import necessary libraries

In [None]:
import os
import random
from datafed.CommandLib import API

Instantiate the DataFed API and set ``context`` to the Training project

In [None]:
df_api = API()
df_api.setContext("p/trn001")

### <span style="color:green"> Exercise </span>
<span style="color:green"> Reset this variable to your username or Globus ID so that you work within your own collection by default </span>

In [None]:
parent_collection = "breetju"  # Name of this user

## Overview:
In this notebook, let us assume that we are working on a machine learning problem aimed at putting together training data for a machine learning model. For illustration purposes, we will assume that we aim to train a classifier for classifying animals

### Create Collection
First, let us create a collection to hold all our data. 

We will be using the ``collectionCreate()`` function:

In [None]:
coll_resp = df_api.collectionCreate(
    "Image classification training data", parent_id=parent_collection
)
print(coll_resp)

In this case we got back a ``CollDataReply`` object. This is somewhat similar to what you get from ``dataCreate()`` we just saw. 


### <span style="color:green"> Exercise </span>
<span style="color:green"> Extract the ``id`` of this newly created collection: </span>

In [None]:
train_coll_id = None

## Populate with training data
Now that we have a place to put the training data, let us populate this collection with examples of animals
### Define a function to generate (fake) training data:
We need a function to:
* Create a Data Record
* Put data into this Data Record

For simplicity we will use some dummy data from a public Globus Endpoint This information has been filled in for you via the ``raw_data_path`` variable

### <span style="color:green"> Exercise </span>
<span style="color:green"> We have a skeleton function prepared for you along with comments to guide you. </span>

In [None]:
def generate_animal_data(is_dog=True):
    # this_animal = "cat"
    if is_dog:
        this_animal = "dog"
    # To mimic a real-life scenario, we append a number to the animal
    # type to denote the N-th example of a cat or dog. In this case, we
    # use a random integer.
    # suffix = "_" + str(random.randint(1, 1000))

    # Create the record here:

    # Extract the ID of the Record:

    # path to the file containing the (dummy) raw data
    raw_data_path = (
        "sdss#public/uufs/chpc.utah.edu/common/home/sdss/dr10/"
        "apogee/spectro/data/55574/55574.md5sum"
    )

    # Put the raw data into the record you just created:

    # Only return the ID of the Data Record you created:
    return this_rec_id

#### Generate 5 examples of cats and dogs:

In [None]:
cat_records = list()
dog_records = list()
for _ in range(5):
    dog_records.append(generate_animal_data(is_dog=True))
for _ in range(5):
    cat_records.append(generate_animal_data(is_dog=False))print(cat_records)print(dog_records)

## Listing items in a Collection:
Let us take a look at the training data we have assembled so far using the ``colectionItemsList()`` function:

In [None]:
coll_list_resp = df_api.collectionItemsList(train_coll_id)
print(coll_list_resp)

### <span style="color:blue"> Note </span>
> <span style="color:blue"> If we had several dozens, hundreds, or even thousands of items in a Collection, we would need to call ``collectionItemsList()`` multiple times by stepping up the ``offset`` keyword argument each time to get the next “page” of results. </span>

### <span style="color:green"> Discussion </span>
<span style="color:green"> Let's say that we are only interested in finding records that have cats in this (potentially) large collection of training data. How do we go about doing that? </span>

# Data Query / Search
### <span style="color:red"> Caution </span>
> <span style="color:red"> Search vocabulary is likely to change with newer versions of DataFed </span>

Use the DataFed web interface to:
* Search for cats
* Specifically in your collection
* Save the query

### <span style="color:blue"> Note </span>
> <span style="color:blue"> Saved queries can be found in the bottom of the navigation (left) pane under ``Project Data`` and ``Saved Queries`` </span>

# Find saved queries:
We can list all saved queries via ``queryList()``:

In [None]:
ql_resp = df_api.queryList()
print(ql_resp)

Notice that we again recieved the familiar ``ListingReply`` object as the response

### <span style="color:green"> Exercise </span>
<span style="color:green"> Get the ``id`` of the desired query out of the response: </span>

In [None]:
query_id = ?
print(query_id)

# View the saved query
Use the ``queryView()`` function:

In [None]:
df_api.queryView(query_id)

# Run the saved query
Use the ``queryExec()`` function:

In [None]:
query_resp = df_api.queryExec(query_id)
print(query_resp)

Yet again, we get back the ``ListingReply`` message. 

### <span style="color:green"> Exercise </span>
<span style="color:green"> Extract just the ``id``s from each of the items in the message: </span>

In [None]:
# First get IDs from query result
cat_rec_ids = ?

We already have the ground truth in ``cat_records``. Is this the same as what we got from the query?

In [None]:
print(set(cat_rec_ids) == set(cat_records))

# Separating cats from dogs
Our goal now is to gather all cat Data Records into a dedicated Collection

### <span style="color:green"> Exercise </span>
<span style="color:green"> Create a new collection to hold the Cats record </span>

### <span style="color:green"> Exercise </span>
<span style="color:green"> Extract the ``id`` for this Collection: </span>

In [None]:
cat_coll_id = ?

# Adding Items to Collection
Now let us add only the cat Data Records into this new collection using the ``collectionItemsUpdate()`` function:

In [None]:
cup_resp = df_api.collectionItemsUpdate(cat_coll_id, add_ids=cat_rec_ids)
print(cup_resp)

Unlike most DataFed functions, this function doesn't really return much

### <span style="color:green"> Exercise </span>
<span style="color:green"> View the contents of the Cats Collection to make sure that all Cat Data Records are present in this Collection </span>

### <span style="color:green"> Exercise </span>
<span style="color:green"> View the contents of the main training data Collection: </span>

### <span style="color:blue"> Note </span>
> <span style="color:blue"> Data Records can exist in **multiple** Collections just like video or songs can exist on multiple playlists </span>

### <span style="color:green"> Exercise </span>
<span style="color:green"> Remove the cat Data Records from the training data collection. They already exist in the "Cats" Collection </span>

### <span style="color:green"> Exercise </span>
<span style="color:green"> View the contents of the training data Collection. Do you see the individual cat Data Records in this collection? </span>

## Search or Organize?
If you could always search for your data, what is the benefit to organizing them into collections?

# Download entire Collection

### <span style="color:blue"> Note </span>
> <span style="color:blue"> Recall that DataFed can download arbitrarily large number of Records regardless of the physical locations of the DataFed repositories containing the data. </span>

Let us first make sure we don't already have a directory with the desired name:

In [None]:
dest_dir = "./cat_data"

if os.path.exists(dest_dir):
    import shutil

    shutil.rmtree(dest_dir)

### <span style="color:green"> Exercise </span>
<span style="color:green"> Download the entire Cat Collection with a single DataFed function call </span>

Let's verify that we did infact download the data:

In [None]:
os.listdir(dest_dir)

## <span style="color:green"> Optional Exercise </span>
<span style="color:green">1. Create a new Collection to hold the simulation data you created in the previous notebook <br>2. Use the functions you saw above to ensure that the Data Records only exist in the Simulation Collection </span>