# Working with data

<table align="left">

  <td>
    <a href="https://github.com/DataBiosphere/terra-axon-examples/blob/main/first_hour_on_vwb/working_with_data.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>


## Overview

This notebook provides examples of working with referenced resource data. Build upon the best practices demonstrated in this notebook to visualize and analyze data in your own workspaces.


### Objective

Use this notebook to perform common workspace resource operations including:

- [Access data from referenced resources](#access-ref-data)
  - [Read data from BigQuery](#read-from-bq)
    - [Using cell magics](#cell-magics)
    - [Using pandas-gbq library](#using-pandas)
    - [Using google-cloud-bigquery](#using-gcb)

#### How to run this notebook

Run the [Notebook setup](#notebook-setup) section before running the cells of the other sections.

#### Costs

This notebook takes less than a minute to run, which will typically cost less than $0.01 of compute time on your cloud environment.


### Notebook setup <a href="notebook-setup"></a>

Run the cell below to import dependencies and utilities.


In [None]:
from google.cloud import bigquery
from IPython.display import display, HTML
import ipywidgets as widgets
import json
import pandas as pd
import pandas_gbq
import os
import subprocess
import widget_utils as wu

'''
Resolves bucket URL from bucket reference in workspace.
'''
def get_bucket_url_from_reference(bucket_reference):
    BUCKET_CMD_OUTPUT = !terra resolve --name={bucket_reference}
    BUCKET = BUCKET_CMD_OUTPUT[0]
    return BUCKET

'''
Resolves current workspace ID from workspace description.
'''
def get_current_workspace_id():
    WORKSPACE_CMD_OUTPUT = !terra workspace describe --format=json | jq --raw-output ".id"
    WORKSPACE_ID = WORKSPACE_CMD_OUTPUT[0]
    return WORKSPACE_ID

CURRENT_WORKSPACE_ID = get_current_workspace_id()
print(f'Workspace ID: {CURRENT_WORKSPACE_ID}')

### Workspace setup

<div class="alert alert-block alert-info">
<b>Note:</b> This notebook assumes that <a href="../../terra-axon-examples/workspace_setup.ipynb">`workspace_setup.ipynb`</a> has been run.
</div>
    
`workspace_setup.ipynb` creates two Cloud Storage buckets for your workspace files with workspace reference names:

- ws_files
- ws_files_autodelete_after_two_weeks

The code in this notebook will write output files to the "autodelete" bucket by default.  
 Any file in this bucket will be automatically deleted <b>two weeks</b> after it is written.  
 This alleviates the need for you to remember to clean up temporary and example files manually.  
 If you want to write outputs to a durable location, simply change the assignment of the `BUCKET_REFERENCE` variable in the cell below and re-run the notebook.


In [None]:
# Change this to "ws_files" to use the durable workspace bucket instead of the autodelete bucket.
BUCKET_REFERENCE = "ws_files_autodelete_after_two_weeks"

In [None]:
MY_BUCKET = get_bucket_url_from_reference(BUCKET_REFERENCE)
print(f'Bucket ID: {MY_BUCKET}')

## Working with data from referenced resources

<a id='access-ref-data'></a>

[Referenced resources](https://terra-docs.api.verily.com/docs/getting_started/web_ui/#referenced-vs-workspace-controlled-resources) represent data or other elements in Verily Workbench by pointing to a source that exists outside of the current workspace. To add a referenced resource to your workspace, use the <a href="../working_with_resources.ipynb">working_with_resources.ipynb</a> notebook.

### Reading data from BigQuery
<a id='read-from-bq'></a>

There are many ways to interact with your referenced resource data in a Verily Workbench cloud environment. This notebook provides examples of several options, each of which are appropriate for certain use cases.

### Using cell magics

<a id='cell-magics'></a>

For troubleshooting, demos and developing new queries, it's useful to leverage the [IPython magics for BigQuery](https://cloud.google.com/python/docs/reference/bigquery/latest/magics), several examples of which are provided below.

<div class="alert alert-block alert-success">
<b>Note:</b> 
    The examples in this sub-section utilize data from the 1000 Genomes data collection. Please add the BigQuery tables from that data collection to your current workspace before running the cells below. Instructions for adding a data collection to your workspace can be found <a href="https://support.workbench.verily.com/docs/how_to_guides/work_with_data/#add-a-data-collection">here</a>.</div>




#### Getting summary stats

<a id='inspect-bq-stats'></a>

To get summary statistics and visualizations for all the columns of a BigQuery table, you can use the `%bigquery_stats` cell magic. Run the cell below to view summary stats for all the table columns in the pedigree table of the 1000 Genomes dataset.

In [None]:
%bigquery_stats bigquery-public-data.human_genome_variants.1000_genomes_pedigree

#### Querying with cell magics

Run the cell below to total the number of distinct families represented in the 1000 Genomes dataset.

In [None]:
%%bigquery

SELECT
    COUNT(DISTINCT Maternal_ID) as num_mothers,
    COUNT(DISTINCT Paternal_ID) as num_fathers,
    COUNT(DISTINCT Individual_ID) as num_individuals,
FROM `bigquery-public-data.human_genome_variants.1000_genomes_pedigree`;

Run the cell below to total the number of distinct families corresponding to each of the 25 populations represented in the 1000 Genomes dataset.


In [None]:
%%bigquery

SELECT
    Population,
    COUNT(DISTINCT(Family_ID)) as num_familes,
FROM `bigquery-public-data.human_genome_variants.1000_genomes_pedigree`
GROUP BY
    Population;

Run the next two cells to select all individuals for whom the dataset contains both maternal and paternal IDs, then create a plot showing the distribution of these individuals by population.


In [None]:
%%bigquery full_parent_data
SELECT
    Population,
    Gender,
    COUNT(DISTINCT Individual_ID) AS num_individuals
FROM 
    `bigquery-public-data.human_genome_variants.1000_genomes_pedigree`
WHERE
    Paternal_ID != '0' and Maternal_ID != '0'
GROUP BY
  Population, Gender
ORDER BY
  num_individuals DESC;

### Using pandas-gbq
<a id="using-pandas"></a>

The [`pandas-gbq` library](https://googleapis.dev/python/pandas-gbq/latest/index.html) provides a simple interface for running queries and uploading pandas dataframes to BigQuery. Additionally, this library offers [utilies for creating plots](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) from your data.

Run the cell below to use the `pandas-gbq` library to read data from BigQuery and create a plot of the data.


In [None]:
query_df = pd.read_gbq(
    """SELECT
        Population,
        COUNT(DISTINCT IF(Gender = 1, Individual_ID, NULL)) AS num_females,
        COUNT(DISTINCT IF(Gender = 2, Individual_ID, NULL)) AS num_males
    FROM 
        `bigquery-public-data.human_genome_variants.1000_genomes_pedigree`
    WHERE
        Paternal_ID != '0' and Maternal_ID != '0'
    GROUP BY
      Population
    ORDER BY
      Population DESC;""")

query_df.plot(
    kind='bar',
    title='Number of Individuals Per Population',
    stacked='True',
    x='Population'
)

### Using the BigQuery Python client

<a id="using-gcb"></a>

The cell below implements a widget using the [BigQuery Python client](https://github.com/googleapis/python-bigquery/tree/mainhttps://github.com/googleapis/python-bigquery/tree/main). The purpose of this example is to demonstrate the possibilities beyond just running queries when using the Python BigQuery client in your Workbench cloud environment; you can also build dashboarding and tooling that leverages your data.

To use the widget:

1. Run the cell below to create the widget.
1. Select a table ID from the widget's dropdown of all the BigQuery tables available in your current workspace as resources.
1. Click the 'Resolve resource' button to view information about the table and produce an input field for a query.
1. Enter a query in the widget's text area.
1. Click 'Run query' to run the query and output the result in the widget.

In [None]:
class ResolveDatasetWidget(object):
    def __init__(self):
        self.table = None
        self.client = bigquery.Client()
        self.label = widgets.Label(
            value="Please provide appropriate values in the input boxes.")
        self.resource_names = self.get_bq_resources()
        self.name_dropdown = wu.DropdownInputWidget(
            self.resource_names, self.resource_names[0], "Resource Name:").get()
        self.input_query_textarea = widgets.Textarea(
            value='',
            placeholder='<INSERT QUERY>',
            description='Query:',
            layout=wu.input_layout,
            style=wu.input_style
        )
        self.query = None
        self.query_result = None
        self.output = widgets.Output()
        self.resolve_button = wu.StyledButton(
            "Resolve resource", "Click to resolve a referenced resource", "check",).get()
        self.query_button = wu.StyledButton(
            "Query resource", "Click to query your referenced resource", "check",).get()
        self.resolve_button.on_click(self.resolve_resource)
        self.query_button.on_click(self.run_query)
        self.vb = widgets.VBox(
            children=[
                self.label,
                self.name_dropdown,
                self.resolve_button,
                self.output
            ],
            layout=wu.vbox_layout
        )

    def get_bq_resources(self):
        result = subprocess.run(
            ["terra", "resource", "list", "--format=JSON"], capture_output=True, text=True)
        ids_list = wu.list_bq_tables(result.stdout)
        return ids_list

    def run_query(self, b):
        with self.output:
            print(
                f"Accessing table '{self.table.project}.{self.table.dataset_id}.{self.table.table_id}'.")
            self.query = self.input_query_textarea.value
            print(f"Running query {self.query}")
            print("")
            self.query_result = pd.read_gbq(self.query)
            print("Result...")
            print('')
            print(self.query_result)

    def resolve_resource(self, b):
        with self.output:
            commandList = ["terra", "resolve",
                           f"--name={self.name_dropdown.value}"]

            print('Running command:')
            print("terra resolve \\")
            print(f"  --name={self.name_dropdown.value}")
            print('')

            result = subprocess.run(
                commandList, capture_output=True, text=True)
            table_id = result.stdout.strip()

            # Make value of resource name fixed after resolving.
            self.table = self.client.get_table(self.name_dropdown.value)
            self.input_query_textarea.value = f"SELECT *  FROM `{self.table.project}.{self.table.dataset_id}.{self.table.table_id}` LIMIT 10"
            self.vb.children = [
                self.label,
                self.name_dropdown,
                self.input_query_textarea,
                self.query_button,
                self.output
            ]

# Instantiate widget
bq_dataset_widget = ResolveDatasetWidget()
display(bq_dataset_widget.vb)

## Provenance

Generate information about this notebook environment and the packages installed.

In [None]:
!date

Conda and pip installed packages:


In [None]:
!conda env export

JupyterLab extensions:


In [None]:
!jupyter labextension list

Number of cores:


In [None]:
!grep ^processor /proc/cpuinfo | wc -l

Memory:


In [None]:
!grep "^MemTotal:" /proc/meminfo

---

Copyright 2022 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style  
license that can be found in the LICENSE file or at  
https://developers.google.com/open-source/licenses/bsd
