# How to use a cohort
This notebook shows how to use a cohort saved from Data Explorer.

It uses a cohort saved in the [Terra Notebooks Playground workspace](https://app.terra.bio/#workspaces/help-gatk/Terra%20Notebooks%20Playground/data).

# Setup

In [None]:
library(reticulate)
library(bigrquery)
library(ggplot2)

In [None]:
BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')

# Get the cohort query

If you look at the data tab in workspace [Terra Notebooks Playground](https://app.terra.bio/#workspaces/help-gatk/Terra%20Notebooks%20Playground/data) you can see that someone used [data explorer web user interface](https://test-data-explorer.appspot.com/) to define and create a cohort containing a subset of the samples in the 1,000 Genomes dataset.

In this notebook we use the [firecloud api](https://api.firecloud.org/) to programmatically retrieve the definition of that cohort. This returns an auto generated SQL query which was the result of the users interaction with Data Explorer.


In [None]:
fapi <- import('firecloud.api')

In [None]:
# Hard-code instead of use WORKSPACE_NAMESPACE/WORKSPACE_NAME, since other workspaces
# won't have the 1000g_americans cohort.
ws_namespace <- 'help-gatk'
ws_name <- 'Terra Notebooks Playground'

cohort_query <- fapi$get_entity(ws_namespace, ws_name, 'cohort', '1000g_americans')$json()$attributes$query
print(cohort_query)

# Call BigQuery

We then use the query from the cohort.

In [None]:
cohort_table <- bigrquery::bq_project_query(
    BILLING_PROJECT_ID,
    cohort_query
)
cohort_df <- bigrquery::bq_table_download(cohort_table)
dim(cohort_df)

In [None]:
head(cohort_df)

# Join with another table

In [None]:
query <- '
SELECT
    DISTINCT participant_id,
    Gender
FROM
    `verily-public-data.human_genome_variants.1000_genomes_participant_info`
'
cohort_info_table <- bigrquery::bq_project_query(
    BILLING_PROJECT_ID,
    query
)
cohort_info_df <- bigrquery::bq_table_download(cohort_info_table)
dim(cohort_info_df)

In [None]:
merged_df <- merge(x = cohort_df, y = cohort_info_df, by = 'participant_id', all.x = TRUE)
dim(merged_df)

# Plot

In [None]:
table(merged_df$Gender)

In [None]:
ggplot(merged_df, aes(Gender)) + geom_bar()

# Provenance

In [None]:
devtools::session_info()

Copyright 2019 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.