# Working with data collections

<table align="left">

  <td>
    <a href="https://github.com/DataBiosphere/terra-axon-examples/blob/main/first_hour_on_vwb/working_with_data_collections.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Overview

This notebook provides a walkthrough of creating and using a [data collection](https://terra-docs.api.verily.com/docs/reference/glossary/#data-collection) in Verily Workbench. A data collection is a grouping of related cloud-based resources related to a specific project, study or purpose. To interact with the data in a data collection, you must have access at the policy level, via group membership, and you must [add the data collection to your workspace](https://terra-docs.api.verily.com/docs/how_to_guides/work_with_data/#add-a-data-collection).
Build upon the best practices demonstrated in this notebook to create and share your own data collections. 

### Objective

This notebook will guide you through creating a new data collection or converting an existing workspace into a data collection which you can then share with collaborators and use in your workspaces.

- [Create a new data collection](#create-new-dc)
    - [Create a new workspace.](#create-new-ws)
    - [Add referenced and controlled resources to your new workspace](#add-resources)
    - [Transform the workspace into a data collection](#convert-to-dc)
- [Convert an existing workspace to a data collection](#convert-existing)
- [Add a data collection to your workspace](#add-dc-to-ws)

#### How to run this notebook

Please run the [Setup](#setup) section before running any other section in this notebook.

#### Costs

This notebook takes less than a minute to run, which will typically cost less than $0.01 of compute time on your cloud environment.

### Setup
<a id="setup"></a>

Run the cell below to capture the ID of the current workspace. You'll use this value to return to the current workspace after you've created a new workspace as part of the process of creating a data collection.

In [None]:
import json
import ipywidgets as widgets
import subprocess
import widget_utils as wu

'''
Resolves ID of current workspace.
'''
def get_current_workspace_id():
    CURRENT_WORKSPACE_ID_CMD_OUTPUT = !terra workspace describe --format=json | jq --raw-output ".id"
    CURRENT_WORKSPACE_ID = CURRENT_WORKSPACE_ID_CMD_OUTPUT[0]
    return CURRENT_WORKSPACE_ID

CURRENT_WORKSPACE_ID = get_current_workspace_id()
print(f"Current workspace ID is {CURRENT_WORKSPACE_ID}")

## Create a new data collection
<a id="create-new-dc"></a>

Before creating your data collection, consider the following:
1. What data do you want to share? What type of resources--Cloud Storage buckets or objects, BigQuery tables, GitHub repositories--will be made available via this data collection?
1. With whom do you wish to share this data? Will you be sharing the data collection with all members of an existing Verily Workbench group (e.g. for your organization or team), or will you need to create a new Verily Workbench group in order to restrict access to the data collection?
1. Will you update the data collection by releasing future versions? What versioning scheme is most appropriate?

<div class="alert alert-block alert-success">
<b>Note:</b> 
    If you'd like to restrict access to your data collection to members of a specific group, contact <a href="mailto:workbench-support@verily.com">Verily Workbench Support</a> <b>before</b> creating your workspace so that a <a href="https://terra-docs.api.verily.com/docs/https://terra-docs.api.verily.com/docs/reference/glossary/#policy">group policy constraint</a> can be applied at the time of workspace creation. See <a href="../creating_a_group.ipynb">../creating_a_group.ipynb</a> for details on how to create a Verily Workbench group that can be used for group policy constraints on workspaces and data collections. </div>

### Create a new workspace
<a id="create-new-ws"></a>

In order to create a data collection, you must first create a new workspace. 
Run the cell below to create a widget, then populate the widget's input fields and click the button to create your new workspace.

Widget input parameters include:
- `Workspace Name`: Must be a string. This value is displayed in the Data Collection modal once the workspace is converted to a data collection, so the value should communicate the intended purpose (e.g. `<STUDY_NAME> Data Collection`).<br> While the Verily Workbench UI and this widget require a workspace name to be provided, the CLI does not; if no workspace name is provided to the CLI, a UUID is generated instead. 
- `Description`: Must be a string. This description will stay with the workspace after it becomes a data collection. Before you convert a workspace to a data collection, you can update this value in the UI.
- `Workspace ID`: Must be unique and consist only of lowercase letters, numbers and underscores. Provide a workspace ID that suggests something about the contents of the data collection you'd like to create and include the date of its creation, such as `<STUDY_NAME>_<YYMMDD>_dc_ws`. *You cannot change the workspace ID after workspace creation.* 


The output should resemble:

```
Workspace successfully created.
ID:                <WORKSPACE_ID>
Name:              <WORKSPACE_NAME>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-type: workspace
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       0
```

In [None]:
class CreateWorkspaceWidget(object):
    def __init__(self):
        self.label = widgets.Label(value='Please provide appropriate values in the input boxes.')
        self.warning = wu.WarningWidget('The workspace name provided will be shown in the data catalog once the workspace is converted to a data collection.').get()
        self.input_name = wu.TextInputWidget("<WORKSPACE_NAME>","Workspace Name:").get()
        self.input_description = wu.TextInputWidget("<DESCRIPTION>","Description:").get()
        self.input_workspace_id = wu.TextInputWidget("<WORKSPACE_ID>","Workspace ID:").get()
        self.output_workspace_id = widgets.Text()
        self.output_workspace_id.value = self.input_workspace_id.value
        self.button = wu.StyledButton('Create workspace','Click to create a new workspace','plus').get()
        self.button.on_click(self.create_workspace)
        self.output = widgets.Output()
        self.vb = widgets.VBox(
            children = [self.label, self.warning,
                        self.input_name, self.input_description,
                        self.input_workspace_id,
                        self.button, self.output],
            layout = wu.vbox_layout)

    def get_workspace_id(self):
        return self.input_workspace_id.value

    def create_workspace(self,b):
        with self.output:
            commandList = [
                "terra", "workspace", "create",
                f"--id={self.input_workspace_id.value.strip()}",
                f"--description=\"{self.input_description.value}\"",
                f"--name={self.input_name.value.strip()}"
            ]

            print('Running command to create workspace...')
            print('\n'.join(commandList))
            print('')
            print("Your workspace will be ready in less than one minute...")

            result = subprocess.run(commandList, capture_output = True, text = True)
            print(result.stderr) if not result.stdout else print(result.stdout)

create_ws_widget = CreateWorkspaceWidget()
display(create_ws_widget.vb)

### Add resources to data collection
<a id="add-resources"></a>

Add new controlled and/or referenced resources to the workspace created in the previous step using the Terra CLI or the Verily Workbench Resources tab. To easily add resources, use the widgets provided in [../workspace_resource_examples.ipynb](../workspace_resource_examples.ipynb../workspace_resource_examples.ipynb).

### Reset your current workspace ID

When you create a new workspace via the Verily Workbench CLI, your cloud environment will automatically point to that newly-created workspace as your current workspace. This behavior means you can jump right into adding resources once you've created a new workspace without having to point the cloud environment to the new workspace explicitly. However, since you will next convert your newly created workspace into a data collection, you should ensure that you point your cloud environment back to your original, non-data-collection workspace before proceeding further in this notebook.

Run the cell below to point your cloud environment to the original workspace as your current workspace.

In [None]:
!terra workspace set --id={CURRENT_WORKSPACE_ID}

### Convert new workspace to data collection
<a id="convert-to-dc"></a>

Now you'll convert your newly created workspace, to which you have added resources, into a data collection which can be shared with others and added to other workspaces. 
Run the cell below to create a widget, then populate the widget's input fields and click the button to convert the workspace to a data collection.

Widget input parameters include:
- `workspace_id`: Must match the workspace ID of the workspace created above.
- `short-description`: Must be a string. This description will be visible in the Add a Data Collection modal and should summarize the purpose and/or contents of your data collection.
- `version`: Must be a string. This field is useful in particular if future releases of the data in this data collection are planned. It's suggested that the same versioning strategy be used for all releases of a data collection to make it transparent for users (e.g., if the first release is version `1.0`, the next release should be `2.0`, not `Version 2`).

The output should resemble:
```
Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:               <STUDY_NAME>-Data-<YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: descriptive content
  terra-workspace-version: 1.0
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:              <STUDY_NAME> Data <YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: <SHORT_DESCRIPTION>
  terra-workspace-version: <VERSION>
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       0
```

In [None]:
class ConvertToDataCollectionWidget(object):
    def __init__(self, prev_widget):
        self.label = widgets.Label(value = 'Please provide appropriate values in the input boxes.')
        self.input_workspace_id = wu.TextInputWidget("<WORKSPACE_ID>","Workspace ID:").get()
        if prev_widget and not self.input_workspace_id.value:
            self.input_workspace_id.value = prev_widget.get_workspace_id()
        self.input_short_description = wu.TextInputWidget("<SHORT_DESCRIPTION>","Short Description:").get()
        self.input_version = wu.TextInputWidget("<VERSION>","Version:").get()
        self.button = wu.StyledButton('Convert to data collection','Click to convert to data collection','check').get()
        self.button.on_click(self.convert_to_data_collection)
        self.output = widgets.Output()
        self.vb = widgets.VBox([
            self.label,
            self.input_workspace_id,
            self.input_short_description,
            self.input_version,
            self.button,
            self.output
        ], layout=wu.vbox_layout)
        
    def convert_to_data_collection(self,b):
        with self.output:
            terraCommand = f"""terra workspace set-property \\
            --workspace={self.input_workspace_id.value} \\
            --properties=\"terra-type=data-collection,terra-workspace-short-description={self.input_short_description.value},terra-workspace-version={self.input_version.value}\"
            """
            print("Running command to convert workspace to data collection...")
            print(terraCommand)
            print("Your data collection will be ready in less than one minute...")
            result = subprocess.run(["terra","workspace","set-property",
                                     f"--workspace={self.input_workspace_id.value}",
                                     f"--properties=terra-type=data-collection,terra-workspace-short-description={self.input_short_description.value},terra-workspace-version={self.input_version.value}"],
                                    capture_output=True,text=True)
            print(result.stderr) if not result.stdout else print(result.stdout)

convert_to_dc_widget = ConvertToDataCollectionWidget(create_ws_widget)
display(convert_to_dc_widget.vb)

## Convert an existing workspace to data collection
<a id="convert-existing"></a>

You can also convert an existing workspace into a data collection. 
Run the cell below to create a widget that lists your workspaces which are eligible for conversion, then populate the widget's input fields and click the button to convert the workspace to a data collection.

Widget input parameters include:
- `Workspace ID`: Select from a dropdown list of all of your existing workspaces that are not data collections.
- `Short Description`: Must be a string. This description will be visible in the data catalog modal and should summarize the purpose and/or contents of your data collection.
- `Version`: Must be a string. This field is useful in particular if future releases of the data in this data collection are planned. It's suggested that the same versioning strategy be used for all releases of a data collection to make it transparent for users (e.g., if the first release is version `1.0`, the next release should be `2.0`, not `Version 2`).

The output should resemble:
```
Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:               <STUDY_NAME>-Data-<YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: descriptive content
  terra-workspace-version: 1.0
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:              <STUDY_NAME> Data <YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: <SHORT_DESCRIPTION>
  terra-workspace-version: <VERSION>
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       0
```

In [None]:
class ConvertExistingWsToDataCollectionWidget(object):
    def __init__(self):
        self.label = widgets.Label(value = 'Please provide appropriate values in the input boxes.')
        self.workspace_ids = self.get_workspace_ids()
        self.input_workspace_id = wu.DropdownInputWidget(self.workspace_ids,self.workspace_ids[0],"Workspace ID:").get()
        self.input_short_description = wu.TextInputWidget("<SHORT_DESCRIPTION>","Short Description:").get()
        self.input_version = wu.TextInputWidget("<VERSION>","Version:").get()
        self.button = wu.StyledButton('Convert to data collection','Click to convert to data collection','check').get()
        self.button.on_click(self.convert_to_data_collection)
        self.output = widgets.Output()
        self.vb = widgets.VBox([
            self.label,
            self.input_workspace_id,
            self.input_short_description,
            self.input_version,
            self.button,
            self.output
        ], layout=wu.vbox_layout)
        
    def get_workspace_ids(self):
        result = subprocess.run(["terra","workspace","list","--format=JSON"],capture_output=True,text=True)
        ids_list = wu.list_workspace_ids(result.stdout)
        return ids_list

    def convert_to_data_collection(self,b):
        with self.output:
            terraCommand = f"""terra workspace set-property \\
            --workspace={self.input_workspace_id.value} \\
            --properties=\"terra-type=data-collection,terra-workspace-short-description={self.input_short_description.value},terra-workspace-version={self.input_version.value}\"
            """
            print("Running command to convert workspace to data collection...")
            print(terraCommand)
            print("Your data collection will be ready in less than one minute...")
            result = subprocess.run(
                ["terra","workspace","set-property",
                 f"--workspace={self.input_workspace_id.value}",
                 f"--properties=terra-type=data-collection,terra-workspace-short-description={self.input_short_description.value},terra-workspace-version={self.input_version.value}"],
                capture_output=True,text=True)
            print(result.stderr) if not result.stdout else print(result.stdout)

convert_existing_to_dc_widget = ConvertExistingWsToDataCollectionWidget()
display(convert_existing_to_dc_widget.vb)

### Add data collection to workspace
<a id="add-dc-to-ws"></a>

Follow the steps below to add your new data collection to an Verily Workbench workspace.<br>The video below provides a visual walkthrough of these steps.

1. In the [Verily Workbench workspace UI](https://terra.verily.com/workspaces), select a workspace that is NOT the data collection workspace.
1. Navigate to the Resources tab.
1. Click the "+ Data catalog" button.
1. Select your newly created data collection from those listed in the modal.
1. Navigate through the steps in the modal to complete the addition of the data collection to your workspace.

<video controls src="screencasts/add_data_collection_to_workspace.mp4" width=600>Add data collection to workspace</video>

## Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date

Conda and pip installed packages:

In [None]:
!conda env export

JupyterLab extensions:

In [None]:
!jupyter labextension list

Number of cores:

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

Memory:

In [None]:
!grep "^MemTotal:" /proc/meminfo

---
Copyright 2022 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style   
license that can be found in the LICENSE file or at   
https://developers.google.com/open-source/licenses/bsd