# Creating a data collection

<table align="left">

  <td>
    <a href="https://github.com/DataBiosphere/terra-axon-examples/blob/main/first_hour_on_vwb/working_with_data_collections.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Overview

This notebook provides a walkthrough of creating and publishing versions of a <a href="https://support.workbench.verily.com/docs/how_to_guides/creating_data_collections/#what-is-a-data-collection">data collection</a> in Workbench. Each section in this notebook creates a widget that takes in the required inputs to complete a step in the <a href="https://support.workbench.verily.com/docs/how_to_guides/creating_data_collections/#how-to-create-a-data-collection-and-manage-its-versions">data collection creation process</a> which would otherwise be performed manually via the Workbench UI and CLI.</p>

### About data collections
<p>A data collection is a grouping of related cloud-based resources related to a specific project, study or purpose. To interact with the data in a data collection, you must have access at the policy level, via group membership, and you must <a href="#add-dc-to-ws">add the data collection to your workspace</a>.</p>

### Objective

This notebook will guide you through creating a new data collection which you can then share with collaborators and use in your workspaces.

- [Create a new data collection](#create-new-dc)
    - [Create a new workspace.](#create-new-ws)
    - [Convert the workspace into a data collection](#convert-to-dc)
    - [Publish an initial version](#publish-version)
    - [Add referenced and controlled resources to your new workspace](#add-resources)
- [Add a data collection to your workspace](#add-dc-to-ws)

#### How to run this notebook

Please run the [Setup](#setup) section before running any other section in this notebook.

#### Costs

This notebook takes less than a minute to run, which will typically cost less than $0.01 of compute time on your cloud environment.

### Setup
<a id="setup"></a>

Run the cell below to capture the ID of the current workspace. You'll use this value to return to the current workspace after you've created a new workspace as part of the process of creating a data collection.

In [None]:
import json
import ipywidgets as widgets
import subprocess
import widget_utils as wu
import vwb_folder_utils as vfu
from datetime import date

'''
Resolves ID of current workspace.
'''
def get_current_workspace_id():
    CURRENT_WORKSPACE_ID_CMD_OUTPUT = !wb workspace describe --format=json | jq --raw-output ".id"
    CURRENT_WORKSPACE_ID = CURRENT_WORKSPACE_ID_CMD_OUTPUT[0]
    return CURRENT_WORKSPACE_ID

CURRENT_WORKSPACE_ID = get_current_workspace_id()
print(f"Current workspace ID is {CURRENT_WORKSPACE_ID}")

## Create a new data collection
<a id="create-new-dc"></a>
Before creating your data collection, consider the following:
1. What data do you want to share? What type of resources--Cloud Storage buckets or objects, BigQuery tables or datasets--will be made available via this data collection?
1. With whom do you wish to share this data? Will you be sharing the data collection with all members of an existing Workbench group (e.g. for your organization or team), or will you need to create a new Workbench group in order to restrict access to the data collection?
1. Will you update the data collection by releasing future versions? What versioning scheme is most appropriate?

### Restricting discovery access to a data collection

<div class="alert alert-block alert-success">
<b>Note:</b> Unless a group membership <a href="https://support.workbench.verily.com/docs/technical_reference/workspaces/access_control_and_sharing/#limiting-workspace-access-with-a-group-policy">policy</a> is applied <b>at the creation time for the data collection's underlying workspace</b>, your data collection will be discoverable to all Workbench users.</div> 
<p>If a user has discovery access to a data collection, they are able to see its name and short description in the <a href="https://support.workbench.verily.com/docs/technical_reference/data_resources/#data-catalog-and-collections">data catalog</a>. In order to have read access to the resources in a data collection, users and/or groups must still be explicitly granted access to the data collection-backing workspace via the Workbench UI or CLI by the data collection owner.</p>
<h4>Adding a group membership policy</h4>
<p>When a <a href="https://support.workbench.verily.com/docs/how_to_guides/creating_data_collections/#group-policies">group membership policy</a> is added to a data collection-backing workspace at creation time, only members of that group will see the data collection in the data catalog. </p>
<p>To create a data collection with a group membership policy for discovery, <a href="https://support.workbench.verily.com/docs/how_to_guides/creating_data_collections/#add-a-data-collection-policy">create the data collection-backing workspace via the Workbench UI</a> (providing the group to which to restrict discovery access), skip the next section, and proceed directly to <a href="#convert-to-dc"> Convert the workspace into a data collection<a/>.</p>
<p><i>Don't have an existing Workbench group to use?</i> Run <a href="../working_with_groups.ipynb>">../working_with_groups.ipynb</a> for tooling to help you create and manage Workbench groups and their members.</p>
<h4>Adding a region constraint policy</h4>
A <a href="https://support.workbench.verily.com/docs/how_to_guides/creating_data_collections/#region-constraint-policies">region constraint policy</a> restricts which regions may be used to create cloud resources & environments in workspaces to which your data collection is added. Reach out to <a href="mailto:workbench-support@verily.com">workbench-support@verily.com</a>, or your primary Verily Workbench contact, for support in setting a data collection’s region constraint policy, prior to sharing your data collection's underlying workspace. 


### Create a new workspace for the data collection
<a id="create-new-ws"></a>

In order to create a data collection, you must first create a new workspace. 
Run the cell below to create a widget, then populate the widget's input fields and click the button to create your new workspace.

Widget input parameters include:
- `Workspace Name`: Must be a string. This value is displayed in the Data Collection modal once the workspace is converted to a data collection, so the value should communicate the intended purpose (e.g. `<STUDY_NAME> Data Collection`).<br> While the Workbench UI and this widget require a workspace name to be provided, the CLI does not; if no workspace name is provided to the CLI, a UUID is generated instead. 
- `Description`: Must be a string. This description will stay with the workspace after it becomes a data collection. Before you convert a workspace to a data collection, you can update this value in the UI.
- `Workspace ID`: Must be unique and consist only of lowercase letters, numbers and underscores. Provide a workspace ID that suggests something about the contents of the data collection you'd like to create and include the date of its creation, such as `<STUDY_NAME>_<YYMMDD>_dc_ws`. *You cannot change the workspace ID after workspace creation.* 
- `Version`: Must be unique from subsequent versions (e.g., you cannot have two versions of a given data collection, both named "1.0"). Represents the first version of the data collection.


The output should resemble:

```
Workspace successfully created.
ID:                <WORKSPACE_ID>
Name:              <WORKSPACE_NAME>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-type: workspace
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       0
Successfully converted <WORKSPACE_ID> to data collection.
Created initial version of data collection: <VERSION>
```

In [None]:
class CreateWorkspaceWidget(object):
    def __init__(self):
        self.label = widgets.Label(value='Please provide appropriate values in the input boxes.')
        self.warning = wu.WarningWidget('The workspace name provided will be shown in the data catalog once the workspace is converted to a data collection.').get()
        self.input_name = wu.TextInputWidget("<WORKSPACE_NAME>","Workspace Name:").get()
        self.input_description = wu.TextInputWidget("<DESCRIPTION>","Description:").get()
        self.input_workspace_id = wu.TextInputWidget("<WORKSPACE_ID>","Workspace ID:").get()
        self.output_workspace_id = widgets.Text()
        self.output_workspace_id.value = self.input_workspace_id.value
        self.button = wu.StyledButton('Create workspace','Click to create a new workspace','plus').get()
        self.button.on_click(self.create_workspace)
        self.output = widgets.Output()
        self.vb = widgets.VBox(
            children = [self.label, self.warning,
                        self.input_name, self.input_description,
                        self.input_workspace_id,
                        self.button, self.output],
            layout = wu.vbox_layout)
        
    def get_workspace_id(self):
        return self.input_workspace_id.value.strip()

    def create_workspace(self,b):
        with self.output:
            createWorkspaceCommandList = [
                "wb", "workspace", "create",
                f"--id={self.input_workspace_id.value.strip()}",
                f"--description={self.input_description.value.strip()}",
                f"--name={self.input_name.value.strip()}",
            ]
            print('Running command to create workspace...')
            print('\n'.join(createWorkspaceCommandList))
            print('')
            print("Your workspace will be ready in less than one minute...")
            result = subprocess.run(createWorkspaceCommandList, capture_output = True, text = True, timeout=180, check=True)
            print(result.stderr) if not result.stdout else print(result.stdout)

create_ws_widget = CreateWorkspaceWidget()
display(create_ws_widget.vb)

### Convert new workspace to data collection
<a id="convert-to-dc"></a>

Now you'll convert your newly created workspace, to which you have added resources, into a data collection which can be shared with others and added to other workspaces. 
Run the cell below to create a widget, then populate the widget's input fields and click the button to convert the workspace to a data collection. Please note that until you <a href="#publish-version">publish a version</a> in the next section, your data collection will not appear in the data catalog.

Widget input parameters include:
- `Workspace ID`: Automatically populated with the workspace ID of the workspace created in the previous step.
- `Short Description`: Must be a string. This description will be visible in the Add a Data Collection modal and should summarize the purpose and/or contents of your data collection.

The output should resemble:
```
Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:               <STUDY_NAME>-Data-<YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: <DESCRIPTION>
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       <NUMBER_OF_RESOURCES>
Workspace properties successfully updated.
```

In [None]:
class ConvertToDataCollectionWidget(object):
    def __init__(self,prev_widget):
        self.label = widgets.Label(value = 'Please provide appropriate values in the input boxes.')
        self.workspace_ids = self.get_workspace_ids()
        self.new_ws_id = prev_widget.get_workspace_id();
        self.input_workspace_id = wu.DropdownInputWidget([self.new_ws_id],self.new_ws_id,"Workspace ID:").get()
        self.input_short_description = wu.TextInputWidget("<SHORT_DESCRIPTION>","Short Description:").get()
        self.button = wu.StyledButton('Convert to data collection','Click to convert to data collection','check').get()
        self.button.on_click(self.convert_to_data_collection)
        self.output = widgets.Output()
        self.vb = widgets.VBox([
            self.label,
            self.input_workspace_id,
            self.input_short_description,
            self.button,
            self.output
        ], layout=wu.vbox_layout)
    
    def get_workspace_id(self):
        return self.input_workspace_id.value.strip()

    def get_workspace_ids(self):
        result = subprocess.run(["wb","workspace","list","--format=JSON"],capture_output=True,text=True)
        ids_list = wu.list_workspace_ids(result.stdout)
        # Insert empty string to display as value of dropdown until changed by user.
        ids_list.insert(0, " ")
        return ids_list
    
    def convert_to_data_collection(self,b):
        workspace_id = self.input_workspace_id.value
        short_desc = self.input_short_description.value
        with self.output:
            prettyConvertToDataCollectionCommand = f"""wb workspace set-property \\
            --workspace={workspace_id} \\
            --properties=\"terra-type=data-collection,terra-workspace-short-description={short_desc}\"
            """
            print("Running command to convert workspace to data collection...")
            print(prettyConvertToDataCollectionCommand)
            print("Your data collection will be ready in less than one minute...")
            result = subprocess.run(["wb","workspace","set-property",
                                     f"--workspace={workspace_id}",
                                     f"--properties=terra-type=data-collection,terra-workspace-short-description={short_desc}"],
                                    capture_output=True,text=True)
            print(result.stderr) if not result.stdout else print(result.stdout)

convert_to_dc_widget = ConvertToDataCollectionWidget(create_ws_widget)
display(convert_to_dc_widget.vb)

### Publish an initial version of your data collection
<a id="publish-version"></a>

Now you'll publish a version of your newly created data collection which can be shared with others and added to other workspaces. 
Run the cell below to create a widget, then populate the widget's input fields and click the button to create a version and publish it.

Widget input parameters include:
- `Workspace ID`: Automatically populated with the workspace ID of the workspace created in the previous step.
- `Version`: Must be a string. This field is useful in particular if future releases of the data in this data collection are planned. It's suggested that the same versioning strategy be used for all releases of a data collection to make it transparent for users (e.g., if the first release is version `1.0`, the next release should be `2.0`, not `Version 2`).
- `Description`: Must be a string. This field should describe this particular version of your data collection and is displayed in the data catalog.

The output should resemble:
```
Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:               <STUDY_NAME>-Data-<YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: descriptive content
  terra-workspace-version: 1.0
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       Workspace properties successfully updated.
ID:                <STUDY_NAME>_<YYMMDD>_dc_ws
Name:              <STUDY_NAME> Data <YYMMDD>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: <SHORT_DESCRIPTION>
  terra-workspace-version: <VERSION>
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       0
```

In [None]:
class PublishVersionWidget(object):
    def __init__(self, prev_widget):
        self.label = widgets.Label(value = 'Please provide appropriate values in the input boxes.')
        self.data_collection_ids = self.get_data_collections()
        self.new_ws_id = prev_widget.get_workspace_id();
        self.input_workspace_id = wu.DropdownInputWidget([self.new_ws_id],self.new_ws_id,"Workspace ID:").get()
        self.input_version = wu.TextInputWidget("<VERSION>","Version:").get()
        self.input_description = wu.TextInputWidget("<DESCRIPTION>","Description:").get()        
        self.button = wu.StyledButton('Publish version','Click to publish a new version','check').get()
        self.button.on_click(self.publish_version)
        self.output = widgets.Output()
        self.vb = widgets.VBox([
            self.label,
            self.input_workspace_id,
            self.input_version,
            self.input_description,
            self.button,
            self.output
        ], layout=wu.vbox_layout)

    def get_data_collections(self):
        result = subprocess.run(["wb","workspace","list","--format=JSON"],capture_output=True,text=True)
        ids_list = wu.list_data_collections(result.stdout)
        # Insert empty string to display as value of dropdown until changed by user.
        ids_list.insert(0,'')
        return ids_list

    def get_workspace_id(self):
        return self.input_workspace_id.value.strip()
    
    def publish_version(self,b):
        with self.output:
            # Save values as variables for reuse.
            workspace_id = self.input_workspace_id.value.strip()
            version = self.input_version.value.strip()
            description = self.input_description.value.strip()
            
            # 1) Point cloud env to target data-collection workspace.
            setWorkspaceCommand = f"wb workspace set --id={workspace_id}"
            setWorkspaceResult = subprocess.run(setWorkspaceCommand, shell = True, capture_output = True, text = True)
            print(setWorkspaceResult.stderr) if not setWorkspaceResult.stdout else print(setWorkspaceResult.stdout)

            # 2) Create version folder.            
            if description is not None:
                createVersionCommand = f"wb folder create --name={version} --description=\"{description}\" --workspace={workspace_id}"
            else:
                createVersionCommand = f"wb folder create --name={version} --workspace={workspace_id}"
            createVersionResult = subprocess.run(createVersionCommand, capture_output = True, text = True, shell = True)
            print(createVersionResult.stderr) if not createVersionResult.stdout else print(createVersionResult.stdout)
            
            # 3) Get file tree from Workbench CLI.
            folderTreeResult = subprocess.run(["wb", "folder", "tree", "--format=JSON"], capture_output = True, text = True)

            # 4) Search tree for ID of desired version folder.
            folder_id = vfu.get_folder_id(version,json.loads(folderTreeResult.stdout))

            # 6) Publish desired folder as a version with today's date.
            today = date.today()
            formatted_date = today.strftime("%Y-%m-%d")
            publishVersionCommand = f"wb folder set-property --properties=terra-published-date={formatted_date} --id={folder_id}"
            publishVersionResult = subprocess.run(publishVersionCommand, shell = True, capture_output = True, text = True, check = True)
            print(publishVersionResult.stderr) if not publishVersionResult.stdout else print(publishVersionResult.stdout)

publish_version_widget = PublishVersionWidget(convert_to_dc_widget)
display(publish_version_widget.vb)

### Add resources to a data collection version
<a id="add-resources"></a>

Resources in data collections live in "versions". To make a resource available in a published data collection version, it must be within the top-level workspace folder corresponding to that version. Controlled and/or referenced resources can be added to the top-level version folder in the workspace corresponding to the data collection created in the previous step in the following ways.
* Navigate to the Workbench Resources tab for the workspace corresponding to your data collection. Move desired resources to the top-level folder named for your version. 
* Run the following Workbench CLI commands to get the version folder's ID, then move a resource to the version folder:

        # Get ID for version folder
        wb folder tree
        # Move desired resource to version folder
        wb resource move --folder-id=<FOLDER_ID> --name=<RESOURCE_NAME> --workspace=<WORKSPACE_ID>
* Use the widgets provided in [../workspace_resource_examples.ipynb](../workspace_resource_examples.ipynb../workspace_resource_examples.ipynb) to create controlled and add referenced resources to the data collection's backing workspace, then move them to the version folder.

### Add release notes to a data collection version
<a id="add-resources"></a>

You may wish to add "release notes" for each version of your data collection. The release note will appear in the data catalog modal and in the data collections table within a workspace. Workbench supports adding notes that exist as text files at URLs (e.g. files stored on GitHub, in a GCS bucket, et cetera). To add release notes, run the cell below to create a widget, populate the widget input fields with appropriate values and click the button.

Widget input parameters include:
- `Workspace ID`: Automatically populated with the workspace ID of the workspace created in the previous step.
- `Version`: Must be a string. Must match an existing published version of your data collection, which corresponds to a top-level folder in your data collection's underlying workspace.
- `Notes URL`: Must be a string. URL should be formatted like `https://www.<DOMAIN>.com`. Notes at this URL will be shown in data catalog modal and in workspaces that add this data collection version.

You should see output like:
```
Workspace successfully loaded.
ID:                <WORSKPACE_ID>
Name:              <WORKSPACE_ID>
Description:       <DESCRIPTION>
Cloud Platform:    GCP
Google project:    <GOOGLE_PROJECT_ID>
Cloud console:     https://console.cloud.google.com/home/dashboard?project=<GOOGLE_PROJECT_ID>
Properties:
  terra-workspace-short-description: <SHORT_DESCRIPTION>
  terra-type: data-collection
Created:           YYYY-MM-DD
Last updated:      YYYY-MM-DD
# Resources:       N
Folder properties successfully updated.
ID:          <FOLDER_ID>
Name:        <VERSION>
Description: <DESCRIPTION>
Parent ID:   null
Properties:
  terra-release-notes-url: <RELEASE_NOTES_URL>
  terra-published-date: <YYYY-MM-DD>
```

In [None]:
class AddReleaseNotesWidget(object):
    def __init__(self, prev_widget):
        self.label = widgets.Label(value = 'Please provide appropriate values in the input boxes.')
        self.data_collection_ids = self.get_data_collections()
        self.new_ws_id = prev_widget.get_workspace_id();
        self.input_workspace_id = wu.DropdownInputWidget([self.new_ws_id],self.new_ws_id,"Workspace ID:").get()
        self.input_version = wu.TextInputWidget("<VERSION>","Version:").get()
        self.input_notes_url = wu.TextInputWidget("<NOTES_URL>","Notes URL:").get()        
        self.button = wu.StyledButton('Attach notes','Click to attach release notes','check').get()
        self.output = widgets.Output()
        self.initial_fields = [
            self.label,
            self.input_workspace_id,
            self.input_version,
            self.input_notes_url,
            self.button,
            self.output
        ]
        self.vb = widgets.VBox(children=self.initial_fields, layout=wu.vbox_layout)
        self.button.on_click(self.add_release_notes)

    def get_data_collections(self):
        result = subprocess.run(["wb","workspace","list","--format=JSON"],capture_output=True,text=True)
        ids_list = wu.list_data_collections(result.stdout)
        # Insert empty string to display as value of dropdown until changed by user.
        ids_list.insert(0,'')
        return ids_list

    def add_release_notes(self,b):
        with self.output:
            # Save values as variables for reuse.
            workspace_id = self.input_workspace_id.value.strip()
            version = self.input_version.value.strip()
            notes_url = self.input_notes_url.value.strip()
            
            # 1) Point cloud env to target data-collection workspace.
            setWorkspaceCommand = f"wb workspace set --id={workspace_id}"
            setWorkspaceResult = subprocess.run(setWorkspaceCommand, shell = True, capture_output = True, text = True)
            print(setWorkspaceResult.stderr) if not setWorkspaceResult.stdout else print(setWorkspaceResult.stdout)

            # 2) Get file tree from Workbench CLI.
            folderTreeResult = subprocess.run(["wb", "folder", "tree", "--format=JSON"], capture_output = True, text = True)

            # 3) Search tree for ID of desired version folder.
            folder_id = vfu.get_folder_id(version,json.loads(folderTreeResult.stdout))

            # 4) Run command to add release notes URL to version.
            attachNotesCommand = ["wb", "folder", "set-property", f"--properties=terra-release-notes-url={notes_url}", f"--id={folder_id}", f"--workspace={workspace_id}"]
            attachNotesResult = subprocess.run(attachNotesCommand, capture_output = True, text = True, check = True)
            print(attachNotesResult.stderr) if not attachNotesResult.stdout else print(attachNotesResult.stdout)
            
add_release_notes_widget = AddReleaseNotesWidget(publish_version_widget)
display(add_release_notes_widget.vb)

### Add data collection to workspace
<a id="add-dc-to-ws"></a>

Follow the steps below to add your new data collection to a Workbench workspace.<br>The video below provides a visual walkthrough of these steps.

1. In the [Workbench workspace UI](https://workbench.verily.com/workspaces), select a workspace that is NOT the data collection workspace.
1. Navigate to the Resources tab.
1. Click the "+ Data catalog" button.
1. Select your newly created data collection from those listed in the modal.
1. Navigate through the steps in the modal to complete the addition of the data collection to your workspace.

<video controls src="screencasts/add_data_collection_to_workspace.mp4" width=600>Add data collection to workspace</video>

## Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date

Conda and pip installed packages:

In [None]:
!conda env export

JupyterLab extensions:

In [None]:
!jupyter labextension list

Number of cores:

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

Memory:

In [None]:
!grep "^MemTotal:" /proc/meminfo

---
Copyright 2024 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style   
license that can be found in the LICENSE file or at   
https://developers.google.com/open-source/licenses/bsd