# Working with data collections

<table align="left">

  <td>
    <a href="https://github.com/DataBiosphere/terra-axon-examples/blob/main/first_hour_on_vwb/working_with_data_collections.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://github.com/DataBiosphere/terra-axon-examples/main/first_hour_on_vwb/working_with_data_collections.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in a Verily Workbench cloud environment
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook provides a walkthrough of creating and using a [data collection](https://terra-docs.api.verily.com/docs/reference/glossary/#data-collection) in Verily Workbench. A data collection is a grouping of related cloud-based resources related to a specific project, study or purpose. To interact with the data in a data collection, you must have access at the policy level, via group membership, and you must [add the data collection to your workspace](https://terra-docs.api.verily.com/docs/how_to_guides/work_with_data/#add-a-data-collection).
Build upon the best practices described in this notebook to create and share your own data collections. 

### Objective

Perform common workspace resource operations including:

1. Create a new data collection from cloud data.
1. Share the data collection with collaborators.
1. Add the data collection as a resource to a new workspace.

#### How to run this notebook

Please run the setup section before running any other section in this notebook.

#### Costs

This notebook takes less than a minute to run, which will typically cost less than $0.01 of compute time on your cloud environment.

### Setup

Run the cell below to capture the ID of the current workspace. You'll use this value to return to the current workspace after you've created a new workspace as part of the process of creating a data collection.

In [None]:
import json
import ipywidgets as widgets
import subprocess

'''
Resolves ID of current workspace.
'''
def get_current_workspace_id():
    CURRENT_WORKSPACE_ID_CMD_OUTPUT = !terra workspace describe --format=json | jq --raw-output ".id"
    CURRENT_WORKSPACE_ID = CURRENT_WORKSPACE_ID_CMD_OUTPUT[0]
    return CURRENT_WORKSPACE_ID

CURRENT_WORKSPACE_ID = get_current_workspace_id()
print(f"Current workspace ID is {CURRENT_WORKSPACE_ID}")

# Widget utilities
input_style= {'description_width':'initial'}

## Create a data collection

The process of creating a data collection requires you to specify the following:
1. What data do you want to share? What type of resources--Cloud Storage buckets or objects, BigQuery tables, GitHUb repositories--will be made available via this data collection?
1. With whom do you wish to share this data? Will you be sharing the data collection with all members of an existing Verily Workbench group (e.g. for your organization or team), or will you need to create a new Verily Workbench group in order to restrict access to the data collection?
1. Will you update the data collection by releasing future versions?

<div class="alert alert-block alert-success">
<b>Note:</b> 
    If you'd like to restrict access to your data collection to members of a specific group, you'll need to provide the <a href="https://et-docs-tests.googleplex.com/docs/reference/glossary/#policy">group policy constraint</a> at the time of workspace creation. See <a href="../creating_a_group.ipynb">../creating_a_group.ipynb</a> for details on how to create a Verily Workbench group that can be used for group policy constraints on workspaces and data collections. </div>

### Create a new workspace

In order to create a data collection, you must first create a new workspace. 
Run the cell below to create a widget, then populate the widget's input fields and click the button to create your new workspace.

Widget input parameters include:
- `workspace_id`: Must be unique and consist only of lowercase letters, numbers and underscores. Provide a workspace ID that suggests something about the contents of the data collection you'd like to create and include the date of its creation, such as `<STUDY_NAME>_<YYMMDD>_dc_ws`. *You cannot change the workspace ID after workspace creation.*
- `description`: Must be a string. This description will stay with the workspace after it becomes a data collection. Before you convert a workspace to a data collection, you can update this value in the UI.
- `name`: This is the name displayed in Data Collection modal once the workspace is converted to a data collection, so the value should communicate the purpose of workspace (e.g. `<STUDY_NAME> Data Collection`). Before you convert a workspace to a data collection, you can update this value in the UI.

The output should resemble:

```
Workspace successfully created.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:              <STUDY_NAME>-Data-<YYMMDD>
Description:       A new workspace which I will transform into a data collection.
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-type: workspace
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       0
```

In [None]:
buttonOutput = 'Please provide appropriate values in the input boxes.'

input_workspace_id = widgets.Text(
 placeholder="<WORKSPACE_ID>",
 description="Workspace ID:",
 style=input_style
)
output_workspace_id = widgets.Text()
input_description = widgets.Text(
 placeholder="<DESCRIPTION>",
 description="Description:",
 style=input_style
 )
input_name = widgets.Text(
 placeholder="<RESOURCE_NAME>",
 description="Resource Name:",
 style=input_style
 )
display(input_workspace_id)
display(input_description)
display(input_name)

def bind_input_to_output(e):
    output_workspace_id.value = input_workspace_id.value

# define a function for the button to call
def button_click_event(b):
    with output:
        global buttonOutput
        terraCommand = f"terra workspace create --id={input_workspace_id.value} --description=\"{input_description.value}\" --name={input_name.value}""".format(input_workspace_id.value,input_description.value,input_name.value)       
        print("Running command to create workspace...")
        print(terraCommand)
        result = subprocess.run(["terra","workspace","create",f"--id={input_workspace_id.value}",f"--description=\"{input_description.value}\"",f"--name={input_name.value}"],
                                capture_output=True,text=True)
        print("Your workspace will be ready momentarily...")
        print(result.stderr) if not result.stdout else print(result.stdout)

# get a reference to the widget output
output = widgets.Output()

input_workspace_id.observe(bind_input_to_output)

button = widgets.Button(
    description='Create workspace',
    disabled=False,
    button_style='',
    tooltip='Click to create a new workspace',
    icon='check',
    layout=widgets.Layout(width='50%', height='40px')
)

#bind the button_click_event to the button call event
button.on_click(button_click_event)

# show the current state of the output
print(buttonOutput)

#display the button
display(button, output)

### Add resources to data collection

Add new controlled and/or referenced resources to the workspace created in the previous step using the Terra CLI or the Verily Workbench Resources tab. To easily add resources, use the widgets provided in [../workspace_resource_examples.ipynb](../workspace_resource_examples.ipynb../workspace_resource_examples.ipynb).

### Reset your current workspace ID

When you create a new workspace via the Verily Workbench CLI, your cloud environment will automatically point to that newly-created workspace as your current workspace. This behavior means you can jump right into adding resources once you've created a new workspace without having to point the cloud environment to the new workspace explicitly. However, since you will next convert your newly created workspace into a data collection, you should ensure that you point your cloud environment back to your original, non-data-collection workspace before proceeding further in this notebook.

Run the cell below to point your cloud environment to the original workspace as your current workspace.

In [None]:
!terra workspace set --id={CURRENT_WORKSPACE_ID}

### Convert workspace to data collection

Now you'll convert your newly created workspace, to which you have added resources, into a data collection which can be shared with others and added to other workspaces. 
Run the cell below to create a widget, then populate the widget's input fields and click the button to convert the workspace to a data collection.

Widget input parameters include:
- `workspace_id`: Must match the workspace ID of the workspace created above.
- `short-description`: Must be a string. This description will be visible in the Add a Data Collection modal and should summarize the purpose and/or contents of your data collection.
- `version`: Must be a string. This field is useful in particular if future releases of the data in this data collection are planned. It's suggested that the same versioning strategy be used for all releases of a data collection to make it transparent for users (e.g., if the first release is version `1.0`, the next release should be `2.0`, not `Version 2`).

The output should resemble:
```
Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:               <STUDY_NAME>-Data-<YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: descriptive content
  terra-workspace-version: 1.0
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:              <STUDY_NAME> Data <YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: <SHORT_DESCRIPTION>
  terra-workspace-version: <VERSION>
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       0
```

In [None]:
input_dc_workspace_id = widgets.Text(
 placeholder="<WORKSPACE_ID>",
 description="Workspace ID:",
 style=input_style,
 value=output_workspace_id.value
)
input_short_description = widgets.Text(
 placeholder="<SHORT_DESCRIPTION>",
 description="Short Description:",
 style=input_style
 )
input_version = widgets.Text(
 placeholder="<VERSION>",
 description="Version:",
 style=input_style
 )
display(input_workspace_id)
display(input_short_description)
display(input_version)

# define a variable that will be reset when the button is pressed
buttonOutput = 'Please provide appropriate values in the input boxes.'

# define a function for the button to call
def button_click_event(b):
    with output:
        global buttonOutput
        terraCommand = f"""terra workspace set-property \\
        --workspace={input_workspace_id.value} \\
        --properties=\"terra-type=data-collection,terra-workspace-short-description={input_short_description.value},terra-workspace-version={input_version.value}\"
        """
        properties = f"--properties=\"terra-type=data-collection,terra-workspace-short-description={input_short_description.value},terra-workspace-version={input_version.value}\""
        print("Running command to convert workspace to data collection...")
        print(terraCommand)
        result = subprocess.run(["terra","workspace","set-property",
                                 f"--workspace={input_workspace_id.value}",
                                 f"--properties=terra-type=data-collection,terra-workspace-short-description={input_short_description.value},terra-workspace-version={input_version.value}"],
                                capture_output=True,text=True)
        print("Your data collection will be ready momentarily...")
        print(result.stderr) if not result.stdout else print(result.stdout)

# get a reference to the widget output
output = widgets.Output()

button = widgets.Button(
    description='Convert to data collection',
    disabled=False,
    button_style='',
    tooltip='Click to convert to data collection',
    icon='check',
    layout=widgets.Layout(width='50%', height='40px')
)

#bind the button_click_event to the button call event
button.on_click(button_click_event)

# show the current state of the output
print(buttonOutput)

#display the button
display(button, output)

### Add data collection to workspace

Follow the steps below to add your new data collection to an Verily Workbench workspace.<br>The video below provides a visual walkthrough of these steps.

1. In the [Verily Workbench workspace UI](https://terra.verily.com/workspaces), select a workspace that is NOT the data collection workspace.
1. Navigate to the Resources tab.
1. Click the "+ Data catalog" button.
1. Select your newly created data collection from those listed in the modal.
1. Navigate through the steps in the modal to complete the addition of the data collection to your workspace.

<video controls src="screencasts/add_data_collection_to_workspace.mp4" width=600>Add data collection to workspace</video>

## Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date

Conda and pip installed packages:

In [None]:
!conda env export

JupyterLab extensions:

In [None]:
!jupyter labextension list

Number of cores:

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

Memory:

In [None]:
!grep "^MemTotal:" /proc/meminfo

---
Copyright 2022 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style   
license that can be found in the LICENSE file or at   
https://developers.google.com/open-source/licenses/bsd