# How to use a cohort

This notebook shows how to use a cohort saved from Data Explorer.

It uses a cohort saved in the [Terra Notebooks Playground workspace](https://app.terra.bio/#workspaces/help-gatk/Terra%20Notebooks%20Playground/data).

## Setup

First, be sure to run notebook **`Python environment setup`** in this workspace.

Then in this section we:

1. load the needed python packages
2. set the project id of the cloud project to bill for queries to BigQuery

In [1]:
import os

import pandas as pd
import firecloud.api as fapi



In [2]:
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']

## Retrieve cohort SQL query

In [3]:
# Hard-code instead of use WORKSPACE_NAMESPACE/WORKSPACE_NAME, since other workspaces
# won't have the 1000g_americans cohort.
ws_namespace = 'help-gatk'
ws_name = 'Terra Notebooks Playground'
cohort_query = fapi.get_entity(ws_namespace, ws_name, 'cohort', '1000g_americans').json()['attributes']['query']
cohort_query

'SELECT DISTINCT t1.participant_id FROM (SELECT participant_id FROM `verily-public-data.human_genome_variants.1000_genomes_participant_info` WHERE  ((Super_Population_Description = "American"))) t1'

## Create pandas dataframe of cohort participant ids

In [4]:
participant_ids = pd.read_gbq(
    cohort_query,
    dialect='standard')
participant_ids.head()

Downloading: 100%|██████████| 535/535 [00:00<00:00, 4907.78rows/s]


Unnamed: 0,participant_id
0,HG01433
1,HG01445
2,HG01452
3,HG01473
4,HG01482


## See what tables are available to join against

In [5]:
bq_table_entities = fapi.get_entities(ws_namespace, ws_name, 'BigQuery_table').json()
bq_tables = list(map(lambda e: e['attributes']['table_name'], bq_table_entities))
bq_tables

['verily-public-data.human_genome_variants.1000_genomes_participant_info',
 'verily-public-data.human_genome_variants.1000_genomes_sample_info']

## Join cohort participant ids against sample_info table

In [6]:
sample_info = pd.read_gbq("SELECT * FROM `verily-public-data.human_genome_variants.1000_genomes_sample_info`",
                          dialect="standard")
print("sample_info has %d rows" % len(sample_info.index))

sample_info_americans = participant_ids.join(sample_info, lsuffix='_L', rsuffix='_R')
print("sample_info_americans has %d rows\n" % len(sample_info_americans.index))

sample_info_americans.head()

Downloading: 100%|██████████| 3500/3500 [00:04<00:00, 796.70rows/s]

sample_info has 3500 rows
sample_info_americans has 535 rows






Unnamed: 0,participant_id_L,sample_id,participant_id_R,In_Low_Coverage_Pilot,LC_Pilot_Platforms,LC_Pilot_Centers,In_High_Coverage_Pilot,HC_Pilot_Platforms,HC_Pilot_Centers,In_Exon_Targetted_Pilot,...,exome_mapped_cram,exome_mapped_crai,exome_mapped_csra,exome_unmapped_bam,exome_unmapped_bai,exome_unmapped_bas,wgs_high_cov_bam,wgs_high_cov_bai,wgs_high_cov_cram,wgs_high_cov_crai
0,HG01433,HG00144,HG00144,,,,,,,,...,,,,,,,,,,
1,HG01445,HG00147,HG00147,,,,,,,,...,,,,,,,,,,
2,HG01452,HG00248,HG00248,,,,,,,,...,,,,,,,,,,
3,HG01473,HG00411,HG00411,,,,,,,,...,,,,,,,,,,
4,HG01482,HG00600,HG00600,,,,,,,,...,,,,,,,,,,


# Provenance

In [7]:
import datetime
print(datetime.datetime.now())
!pip3 freeze

2020-01-13 21:53:52.072635
google-api-core==1.15.0
google-auth-oauthlib==0.4.1
google-cloud-bigquery==1.23.1
google-cloud-core==1.1.0
ibis-framework==1.2.0
multipledispatch==0.6.0
oauthlib==3.1.0
pandas==0.25.3
pandas-gbq==0.13.0
pydata-google-auth==0.2.1
regex==2020.1.8
requests-oauthlib==1.3.0
six==1.13.0
toolz==0.10.0


Copyright 2019 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.