# 4. Collections and Queries in DataFed
In this notebook, we will be going over creating Collections, viewing contained items, organizing Collections, downloading Collections, and searching for data

## Before we begin:
Import necessary libraries

In [1]:
import os
import random
import json
from datafed.CommandLib import API

Instantiate the DataFed API and set ``context`` to the Training project

In [2]:
df_api = API()
df_api.setContext("p/trn001")

### <span style="color:green"> Exercise </span>
<span style="color:green"> Reset this variable to your username or Globus ID so that you work within your own collection by default </span>

In [3]:
parent_collection = "somnaths"  # Name of this user

## Overview:
In this notebook, let us assume that we are working on a machine learning problem aimed at putting together training data for a machine learning model. For illustration purposes, we will assume that we aim to train a classifier for classifying animals

### Create Collection
First, let us create a collection to hold all our data. 

We will be using the ``collectionCreate()`` function:

In [4]:
coll_resp = df_api.collectionCreate(
    "Image classification training data", parent_id=parent_collection
)
print(coll_resp)

(coll {
  id: "c/43657980"
  title: "Image classification training data"
  owner: "p/trn001"
  ct: 1615405612
  ut: 1615405612
  parent_id: "c/34558900"
}
, 'CollDataReply')


In this case we got back a ``CollDataReply`` object. This is somewhat similar to what you get from ``dataCreate()`` we just saw. 

Now, let's Extract the ``id`` of this newly created collection:

In [5]:
train_coll_id = coll_resp[0].coll[0].id
print(train_coll_id)

c/43657980


## Populate with training data
Now that we have a place to put the training data, let us populate this collection with examples of animals
### Define a function to generate (fake) training data:
We need a function to:
* Create a Data Record
* Put data into this Data Record

For simplicity we will use some dummy data from a public Globus Endpoint This information has been filled in for you via the ``raw_data_path`` variable. 

### <span style="color:green"> Exercise </span>
<span style="color:green"> We have a skeleton function prepared for you along with comments to guide you. </span>

In [6]:
def generate_animal_data(is_dog=True):
    this_animal = "cat"
    if is_dog:
        this_animal = "dog"
    # Ideally, we would have more sensible titles such as "Doberman",
    # "Poodle", etc. instead of "Dog_234"
    # To mimic a real-life scenario, we append a number to the animal
    # type to denote the N-th example of a cat or dog. In this case,
    # we use a random integer.
    title = this_animal + "_" + str(random.randint(1, 1000))
    # Create the record here:
    rec_resp = df_api.dataCreate(
        title,
        metadata=json.dumps({"animal": this_animal}),
        parent_id=train_coll_id,
    )

    # Extract the ID of the Record:
    this_rec_id = rec_resp[0].data[0].id

    # path to the file containing the (dummy) raw data
    raw_data_path = (
        "sdss#public/uufs/chpc.utah.edu/common/home/sdss/dr10/"
        "apogee/spectro/data/55574/55574.md5sum"
    )

    # Put the raw data into the record you just created:
    put_resp = df_api.dataPut(this_rec_id, raw_data_path)

    # Only return the ID of the Data Record you created:
    return this_rec_id

#### Generate 5 examples of cats and dogs:

In [7]:
cat_records = list()
dog_records = list()
for _ in range(5):
    dog_records.append(generate_animal_data(is_dog=True))
for _ in range(5):
    cat_records.append(generate_animal_data(is_dog=False))

In [8]:
print(cat_records)

['d/43659103', 'd/43659126', 'd/43659149', 'd/43659172', 'd/43659195']


In [9]:
print(dog_records)

['d/43658988', 'd/43659011', 'd/43659034', 'd/43659057', 'd/43659080']


## Listing items in a Collection:
Let us take a look at the training data we have assembled so far using the ``colectionItemsList()`` function:

In [11]:
coll_list_resp = df_api.collectionItemsList(train_coll_id, offset=5)
print(coll_list_resp)

(item {
  id: "d/43658988"
  title: "dog_196"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659034"
  title: "dog_57"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659080"
  title: "dog_686"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659011"
  title: "dog_689"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659057"
  title: "dog_778"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
offset: 5
count: 20
total: 10
, 'ListingReply')


### <span style="color:blue"> Note </span>
> <span style="color:blue"> If we had several dozens, hundreds, or even thousands of items in a Collection, we would need to call ``collectionItemsList()`` multiple times by stepping up the ``offset`` keyword argument each time to get the next “page” of results. </span>

### <span style="color:green"> Discussion </span>
<span style="color:green"> Let's say that we are only interested in finding records that have cats in this (potentially) large collection of training data. How do we go about doing that? </span>

# Data Query / Search
### <span style="color:red"> Caution </span>
> <span style="color:red"> Search vocabulary is likely to change with newer versions of DataFed </span>

Use the DataFed web interface to:
* Search for cats
* Specifically in your collection
* Save the query

### <span style="color:blue"> Note </span>
> <span style="color:blue"> Saved queries can be found in the bottom of the navigation (left) pane under ``Project Data`` and ``Saved Queries`` </span>

# Find saved queries:
We can list all saved queries via ``queryList()``:

In [12]:
ql_resp = df_api.queryList()
print(ql_resp)

(item {
  id: "q/43664114"
  title: "is_cat"
}
offset: 0
count: 20
total: 1
, 'ListingReply')


Notice that we again recieved the familiar ``ListingReply`` object as the response

### <span style="color:green"> Exercise </span>
<span style="color:green"> Get the ``id`` of the desired query out of the response: </span>

In [13]:
query_id = ql_resp[0].item[0].id
print(query_id)

q/43664114


# View the saved query
Use the ``queryView()`` function:

In [14]:
df_api.queryView(query_id)

(query {
   id: "q/43664114"
   title: "is_cat"
   query: "{\"meta\":\"md.animal == \\\"cat\\\"\",\"scopes\":[{\"scope\":4,\"id\":\"c/43657980\",\"recurse\":true}]}"
   owner: "u/somnaths"
   ct: 1615406400
   ut: 1615406400
 },
 'QueryDataReply')

# Run the saved query
Use the ``queryExec()`` function:

In [15]:
query_resp = df_api.queryExec(query_id)
print(query_resp)

(item {
  id: "d/43659103"
  title: "cat_209"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659126"
  title: "cat_511"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659149"
  title: "cat_341"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659172"
  title: "cat_558"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
item {
  id: "d/43659195"
  title: "cat_821"
  owner: "p/trn001"
  creator: "u/somnaths"
  size: 2842.0
  notes: 0
}
, "ListingReply")


Yet again, we get back the ``ListingReply`` message. 

Now let us extract just the ``id``s from each of the items in the message:

In [16]:
cat_rec_ids = list()
for record in query_resp[0].item:
    cat_rec_ids.append(record.id)
# one could also use list comprehensions to get the answer in one line:
# cat_rec_ids = [record.id for record in query_resp[0].item]
print(cat_rec_ids)

['d/43659103', 'd/43659126', 'd/43659149', 'd/43659172', 'd/43659195']


We already have the ground truth in ``cat_records``. Is this the same as what we got from the query?

In [17]:
print(set(cat_rec_ids) == set(cat_records))

True


# Separating cats from dogs
Our goal now is to gather all cat Data Records into a dedicated Collection

### <span style="color:green"> Exercise </span>
<span style="color:green"> Create a new collection to hold the Cats record </span>

In [18]:
coll_resp = df_api.collectionCreate("Cats", parent_id=train_coll_id)

c/43666045


### <span style="color:green"> Exercise </span>
<span style="color:green"> Extract the ``id`` for this Collection: </span>

In [19]:
cat_coll_id = coll_resp[0].coll[0].id
print(cat_coll_id)

c/43666045


# Adding Items to Collection
Now let us add only the cat Data Records into this new collection using the ``collectionItemsUpdate()`` function:

In [20]:
cup_resp = df_api.collectionItemsUpdate(cat_coll_id, add_ids=cat_rec_ids)
print(cup_resp)

(, 'ListingReply')


Unlike most DataFed functions, this function doesn't really return much

Now, let us view the contents of the Cats Collection to make sure that all Cat Data Records are present in this Collection. 

Just to keep the output clean and short, we will only extract the ID and title of the items

In [21]:
ls_resp = df_api.collectionItemsList(cat_coll_id)
# Iterating through the items in the Collection and only extracting
# a few items:
for obj in ls_resp[0].item:
    print(obj.id, obj.title)

d/43659103 cat_209
d/43659149 cat_341
d/43659126 cat_511
d/43659172 cat_558
d/43659195 cat_821


### <span style="color:green"> Exercise </span>
<span style="color:green"> View the contents of the main training data Collection. <br> You may use the snippet above if you like and modify it accordingly </span>

In [None]:
ls_resp = df_api.collectionItemsList(train_coll_id)
# Iterating through the items in the Collection and only extracting a
# few items:
for obj in ls_resp[0].item:
    print(obj.id, obj.title)

### <span style="color:blue"> Note </span>
> <span style="color:blue"> Data Records can exist in **multiple** Collections just like video or songs can exist on multiple playlists </span>

### <span style="color:green"> Exercise </span>
<span style="color:green"> Remove the cat Data Records from the training data collection. They already exist in the "Cats" Collection. <br> **Hint**: The function call is very similar to the function call for adding cats to the "Cats" collection </span>

In [22]:
cup_resp = df_api.collectionItemsUpdate(train_coll_id, rem_ids=cat_rec_ids)
print(cup_resp)

(, 'ListingReply')


### <span style="color:green"> Exercise </span>
<span style="color:green"> View the contents of the training data Collection. <br> You may reuse a code snippet from an earlier cell. <br> Do you see the individual cat Data Records in this collection? </span>

In [None]:
ls_resp = df_api.collectionItemsList(train_coll_id)
# Iterating through the items in the Collection and only extracting
# a few items:
for obj in ls_resp[0].item:
    print(obj.id, obj.title)

## Search or Organize?
If you could always search for your data, what is the benefit to organizing them into collections?

# Download entire Collection

### <span style="color:blue"> Note </span>
> <span style="color:blue"> Recall that DataFed can download arbitrarily large number of Records regardless of the physical locations of the DataFed repositories containing the data. </span>

Let us first make sure we don't already have a directory with the desired name:

In [23]:
dest_dir = "./cat_data"

if os.path.exists(dest_dir):
    import shutil

    shutil.rmtree(dest_dir)

### <span style="color:green"> Exercise </span>
<span style="color:green"> Download the entire Cat Collection with a single DataFed function call. <br> **Hint:** You may want to look at a function we used in the third notebook </span>

In [24]:
df_api.dataGet(cat_coll_id, "./cat_data", wait=True)

(task {
   id: "task/43667334"
   type: TT_DATA_GET
   status: TS_SUCCEEDED
   client: "u/somnaths"
   step: 2
   steps: 3
   msg: "Finished"
   ct: 1615407332
   ut: 1615407338
   source: "d/43659103, d/43659126, d/43659149, d/43659172, d/43659195, ..."
   dest: "1646e89e-f4f0-11e9-9944-0a8c187e8c12/Users/syz/OneDrive - Oak Ridge National Laboratory/DataFed_Tutorial/Notebooks/cat_data"
 },
 'TaskDataReply')

Let's verify that we did infact download the data:

In [25]:
os.listdir(dest_dir)

['43659195.md5sum',
 '43659103.md5sum',
 '43659172.md5sum',
 '43659126.md5sum',
 '43659149.md5sum']

## <span style="color:green"> Optional Exercise </span>
<span style="color:green">1. Create a new Collection to hold the simulation data you created in the previous notebook <br>2. Use the functions you saw above to ensure that the Data Records only exist in the Simulation Collection </span>