# How to read data from BigQuery

This notebook demonstrates two ways to use BigQuery with Python
1. by using SQL via [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/)
2. by using only Python code to extract the data of interest from BigQuery via [Ibis](https://docs.ibis-project.org/)

## Setup

In [1]:
!pip3 install ibis-framework

Collecting ibis-framework
  Using cached ibis_framework-1.3.0-py3-none-any.whl (601 kB)
Collecting pandas>=0.21
  Using cached pandas-1.0.5-cp37-cp37m-manylinux1_x86_64.whl (10.1 MB)
Collecting numpy>=1.11
  Using cached numpy-1.19.1-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
Collecting regex
  Using cached regex-2020.7.14-cp37-cp37m-manylinux2010_x86_64.whl (660 kB)
Collecting multipledispatch>=0.6.0
  Using cached multipledispatch-0.6.0-py3-none-any.whl (11 kB)
Collecting pytz
  Using cached pytz-2020.1-py2.py3-none-any.whl (510 kB)
Processing /home/jupyter-user/.cache/pip/wheels/e2/83/7c/248063997a4f9ff6bf145822e620e8c37117a6b4c765584077/toolz-0.10.0-py3-none-any.whl
Collecting python-dateutil>=2.6.1
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting six
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
[31mERROR: pyopenssl 19.1.0 has requirement cryptography>=2.8, but you'll have cryptography 2.1.4 which is incompatible.[0m
Installing collected 

In [2]:
import pandas as pd
import pandas_gbq
import ibis
import os

In [3]:
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']

## Option 1: Retrieve filtered data from BigQuery using SQL.

The following SQL will read a subset of columns and subset of rows from a BigQuery table into a Pandas dataframe.
* [Pandas](http://pandas.pydata.org/pandas-docs/stable/) is a popular Python package for data manipulation.
* To learn more about SQL syntax see the [BigQuery standard SQL reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/).

In [4]:
sample_info = pd.read_gbq("""
SELECT
  Sample,
  Gender,
  Relationship,
  Population,
  Population_Description,
  Super_Population,
  Super_Population_Description,
  Total_Exome_Sequence,
  Main_Project_E_Platform,
  Main_Project_E_Centers
FROM
  `bigquery-public-data.human_genome_variants.1000_genomes_sample_info`
WHERE
  -- Only include information for samples in phase 1.
  In_Phase1_Integrated_Variant_Set = TRUE
""",
    project_id=BILLING_PROJECT_ID,
    dialect='standard')

Downloading: 100%|██████████| 1092/1092 [00:00<00:00, 5607.88rows/s]


In [5]:
sample_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1092 entries, 0 to 1091
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Sample                        1092 non-null   object 
 1   Gender                        1092 non-null   object 
 2   Relationship                  1092 non-null   object 
 3   Population                    1092 non-null   object 
 4   Population_Description        1092 non-null   object 
 5   Super_Population              1092 non-null   object 
 6   Super_Population_Description  1092 non-null   object 
 7   Total_Exome_Sequence          1069 non-null   float64
 8   Main_Project_E_Platform       1092 non-null   object 
 9   Main_Project_E_Centers        1092 non-null   object 
dtypes: float64(1), object(9)
memory usage: 85.4+ KB


In [6]:
sample_info.describe()

Unnamed: 0,Total_Exome_Sequence
count,1069.0
mean,10282430000.0
std,3845551000.0
min,2477265000.0
25%,8078811000.0
50%,9730942000.0
75%,11424060000.0
max,35117090000.0


In [7]:
sample_info.head()

Unnamed: 0,Sample,Gender,Relationship,Population,Population_Description,Super_Population,Super_Population_Description,Total_Exome_Sequence,Main_Project_E_Platform,Main_Project_E_Centers
0,HG00377,female,,FIN,Finnish in Finland,EUR,European,4105720000.0,ILLUMINA,BI
1,HG01278,female,mother,CLM,"Colombian in Medellin, Colombia",AMR,American,7717205000.0,ILLUMINA,BCM
2,NA18527,female,,CHB,"Han Chinese in Bejing, China",EAS,East Asian,5811252000.0,ILLUMINA,BCM
3,NA19672,female,mother,MXL,"Mexican Ancestry in Los Angeles, California",AMR,American,,,
4,NA20363,female,mother,ASW,African Ancestry in Southwest US,AFR,African,,,


## Option 2: Retrieve filtered data from BigQuery using Python.

The following Python code will read a BigQuery table into a Pandas dataframe.

From https://cloud.google.com/community/tutorials/bigquery-ibis

*[Ibis](http://ibis-project.org/) is a Python library for doing data analysis. It offers a Pandas-like environment for executing data analysis in big data processing systems such as Google BigQuery. Ibis's primary goals are to be a type safe, expressive, composable, and familiar replacement for SQL.*

In [8]:
conn = ibis.bigquery.connect(
    project_id=BILLING_PROJECT_ID,
    dataset_id='bigquery-public-data.human_genome_variants')

In [9]:
print('ibis version: %s' % ibis.__version__)

ibis version: 1.3.0


In [10]:
sample_info_tbl = conn.table('1000_genomes_sample_info')
sample_info_tbl

BigQueryTable[table]
  name: bigquery-public-data.human_genome_variants.1000_genomes_sample_info
  schema:
    Sample : string
    Family_ID : string
    Population : string
    Population_Description : string
    Gender : string
    Relationship : string
    Unexpected_Parent_Child : string
    Non_Paternity : string
    Siblings : string
    Grandparents : string
    Avuncular : string
    Half_Siblings : string
    Unknown_Second_Order : string
    Third_Order : string
    In_Low_Coverage_Pilot : boolean
    LC_Pilot_Platforms : string
    LC_Pilot_Centers : string
    In_High_Coverage_Pilot : boolean
    HC_Pilot_Platforms : string
    HC_Pilot_Centers : string
    In_Exon_Targetted_Pilot : boolean
    ET_Pilot_Platforms : string
    ET_Pilot_Centers : string
    Has_Sequence_in_Phase1 : boolean
    Phase1_LC_Platform : string
    Phase1_LC_Centers : string
    Phase1_E_Platform : string
    Phase1_E_Centers : string
    In_Phase1_Integrated_Variant_Set : boolean
    Has_Phase1_chr

In [11]:
# Define the filter criteria.
phase_1_only = sample_info_tbl.In_Phase1_Integrated_Variant_Set == True

# Apply the filter and choose the columns to return.
phase_1_sample_info_tbl = sample_info_tbl.filter(phase_1_only)['Sample', 'Gender', 'Relationship', 'Population', 'Population_Description',
                'Super_Population', 'Super_Population_Description', 'Total_Exome_Sequence',
                'Main_Project_E_Platform', 'Main_Project_E_Centers']

In [12]:
# Optional: take a look at the SQL.
print(phase_1_sample_info_tbl.compile())

SELECT `Sample`, `Gender`, `Relationship`, `Population`,
       `Population_Description`, `Super_Population`,
       `Super_Population_Description`, `Total_Exome_Sequence`,
       `Main_Project_E_Platform`, `Main_Project_E_Centers`
FROM (
  SELECT *
  FROM `bigquery-public-data.human_genome_variants.1000_genomes_sample_info`
  WHERE `In_Phase1_Integrated_Variant_Set` = TRUE
) t0


In [13]:
# Optional: See how much data this will return.
phase_1_sample_info_tbl.count().execute()

1092

In [14]:
# Go ahead and retrieve the data.
phase_1_sample_info_df = phase_1_sample_info_tbl.limit(1000000).execute()
phase_1_sample_info_df.shape

(1092, 10)

In [15]:
phase_1_sample_info_df.head()

Unnamed: 0,Sample,Gender,Relationship,Population,Population_Description,Super_Population,Super_Population_Description,Total_Exome_Sequence,Main_Project_E_Platform,Main_Project_E_Centers
0,HG00377,female,,FIN,Finnish in Finland,EUR,European,4105720000.0,ILLUMINA,BI
1,HG01278,female,mother,CLM,"Colombian in Medellin, Colombia",AMR,American,7717205000.0,ILLUMINA,BCM
2,NA18527,female,,CHB,"Han Chinese in Bejing, China",EAS,East Asian,5811252000.0,ILLUMINA,BCM
3,NA19672,female,mother,MXL,"Mexican Ancestry in Los Angeles, California",AMR,American,,,
4,NA20363,female,mother,ASW,African Ancestry in Southwest US,AFR,African,,,


# Provenance

In [16]:
import datetime
print(datetime.datetime.now())

2020-07-27 20:16:31.116879


In [17]:
!pip3 freeze

absl-py==0.9.0
arrow==0.15.7
asn1crypto==0.24.0
astor==0.8.1
astroid==2.4.2
astropy==4.0.1.post1
attrs==19.3.0
backcall==0.2.0
bagit==1.7.0
bgzip==0.3.5
binaryornot==0.4.4
biopython==1.72
bleach==3.1.5
bokeh==1.0.0
brewer2mpl==1.4.1
bx-python==0.8.2
CacheControl==0.11.7
cachetools==4.1.0
certifi==2020.6.20
chardet==3.0.4
cli-builder==0.0.1
click==7.1.2
confuse==1.1.0
cookiecutter==1.7.2
cryptography==2.1.4
cwltool==1.0.20190228155703
cycler==0.10.0
Cython==0.29.20
decorator==4.4.2
defusedxml==0.6.0
entrypoints==0.3
enum34==1.1.6
fastinterval==0.1.1
firecloud==0.16.25
future==0.18.2
gast==0.3.3
ggplot==0.11.5
google-api-core==1.21.0
google-auth==1.18.0
google-auth-oauthlib==0.4.1
google-cloud-bigquery==1.23.1
google-cloud-bigquery-datatransfer==0.4.1
google-cloud-core==1.3.0
google-cloud-datastore==1.10.0
google-cloud-resource-manager==0.30.0
google-cloud-storage==1.29.0
google-pasta==0.2.0
google-resumable-media==0.5.1
googleapis-common-p

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.