# How to read data from BigQuery

This notebook demonstrates two ways to use BigQuery with Python
1. by using SQL via [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/)
2. by using only Python code to extract the data of interest from BigQuery via [Ibis](https://docs.ibis-project.org/)

## Setup

In [None]:
!pip3 install ibis-framework

In [None]:
import os

import ibis
import pandas as pd
import pandas_gbq

## Option 1: Retrieve filtered data from BigQuery using SQL.

The following SQL will read a subset of columns and subset of rows from a BigQuery table into a Pandas dataframe.
* [Pandas](http://pandas.pydata.org/pandas-docs/stable/) is a popular Python package for data manipulation.
* To learn more about SQL syntax see the [BigQuery standard SQL reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/).

In [None]:
sample_info = pd.read_gbq(
    """
SELECT
  Sample,
  Gender,
  Relationship,
  Population,
  Population_Description,
  Super_Population,
  Super_Population_Description,
  Total_Exome_Sequence,
  Main_Project_E_Platform,
  Main_Project_E_Centers
FROM
  `bigquery-public-data.human_genome_variants.1000_genomes_sample_info`
WHERE
  -- Only include information for samples in phase 1.
  In_Phase1_Integrated_Variant_Set = TRUE
"""
)

In [None]:
sample_info.info()

In [None]:
sample_info.describe()

In [None]:
sample_info.head()

## Option 2: Retrieve filtered data from BigQuery using Python.

The following Python code will read a BigQuery table into a Pandas dataframe.

From https://cloud.google.com/community/tutorials/bigquery-ibis

*[Ibis](http://ibis-project.org/) is a Python library for doing data analysis. It offers a Pandas-like environment for executing data analysis in big data processing systems such as Google BigQuery. Ibis's primary goals are to be a type safe, expressive, composable, and familiar replacement for SQL.*

In [None]:
conn = ibis.bigquery.connect(dataset_id="bigquery-public-data.human_genome_variants")

In [None]:
sample_info_tbl = conn.table("1000_genomes_sample_info")
sample_info_tbl

In [None]:
# Define the filter criteria.
phase_1_only = sample_info_tbl.In_Phase1_Integrated_Variant_Set == True

# Apply the filter and choose the columns to return.
phase_1_sample_info_tbl = sample_info_tbl.filter(phase_1_only)[
    "Sample",
    "Gender",
    "Relationship",
    "Population",
    "Population_Description",
    "Super_Population",
    "Super_Population_Description",
    "Total_Exome_Sequence",
    "Main_Project_E_Platform",
    "Main_Project_E_Centers",
]

In [None]:
# Optional: take a look at the SQL.
print(phase_1_sample_info_tbl.compile())

In [None]:
# Optional: See how much data this will return.
phase_1_sample_info_tbl.count().execute()

In [None]:
# Go ahead and retrieve the data.
phase_1_sample_info_df = phase_1_sample_info_tbl.limit(1000000).execute()
phase_1_sample_info_df.shape

In [None]:
phase_1_sample_info_df.head()

# Provenance

In [None]:
import datetime

print(datetime.datetime.now())

In [None]:
!pip3 freeze

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.