# How to load data to BigQuery

This notebook demonstrates a few different ways to load data to BigQuery from a Python 3 notebook
* [bq](https://cloud.google.com/bigquery/docs/bq-command-line-tool) command line tool
* [gcloud python client](https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html#bigquery-basics)
* [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/)

For files, the `bq` tool or the gcloud-python-client are great and work the same. Choose whichever one you like more.

For dataframes in memory, the pandas-gbq client is a great way to go (no need to write it out to a file first).

See also the companion Terra Support article [Accessing Advanced GCP features in Terra](https://support.terra.bio/hc/en-us/articles/360051229072-Accessing-Advanced-GCP-features-in-Terra).

# Setup
Edit these global variables in your clone of this notebook if you do not have permission to WRITE data to this native Google Cloud Platform project.
* **The destination BigQuery dataset should already exist. Your pet account must have WRITE access to it.**       
[**Click for step-by-step instructions to create a BQ dataset**](https://support.terra.bio/hc/en-us/articles/360051229072#h_01EPCCS08S69VE4VMT0F0NNDWR)     


* Make sure to change to your own project and dataset names. The remaining cells can be run as-is.

In [1]:
from google.cloud import bigquery
from google.cloud.bigquery import LoadJobConfig
from google.cloud.bigquery import SchemaField

import os
import numpy as np
import pandas as pd
import time

In [2]:
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
CSV_PATH = 'gs://genomics-public-data/platinum-genomes/other/platinum_genomes_sample_info.csv'
# Also try this CSV which will yield some autodetect errors.
# gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv

**Note that you will need to change the variables below to your own values** (expand the tips if you need help finding the variables)

In [3]:
# CHANGE THESE VARIABLES
DESTINATION_PROJECT_ID = 'your_GCP-native_project_ID'
DESTINATION_DATASET = 'your_BQ_dataset'

In [4]:
DESTINATION_PROJECT_ID = 'ah-native-gcp-project-74939'
DESTINATION_DATASET = 'BQ_dataset_autodelete_after_one_day'

### <font color="#FF6600">(expand for tip) </font> <font color="#445555">How to find your cloud-native project-ID</font><a class="tocSkip">

When logged in with your Terra user-ID, go to billing in the GCP console at [https://console.cloud.google.com/billing](https://console.cloud.google.com/billing)     
![finding project ID screen shot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_Find-Project-ID_Step1_Screen%20shot.png)

1. Select the Organization you used when creating your cloud-native project    
2. Find the Project ID at right    

### <font color="#FF6600">(expand for tip) </font> <font color="#445555">How to find your cloud-native Big Query dataset</font><a class="tocSkip">

Go to [https://console.cloud.google.com/bigquery](https://console.cloud.google.com/bigquery)   

On the left column, select your cloud-native Project from the drop-down. You should see your BQ dataset listed:   

![Find BQ dataset Screen shiot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_Find-BQ-dataset-name_Screen%20shot.png)

# Load data to BigQuery from a CSV

## Via the `bq` command line tool

In [5]:
DESTINATION_TABLE = 'py3_bq_' + time.strftime("%Y%m%d_%H%M%S")

In [6]:
%%bash -s "$DESTINATION_PROJECT_ID" "$DESTINATION_DATASET" "$DESTINATION_TABLE" "$CSV_PATH"

bq --project_id ${1} load --autodetect ${2}.${3} ${4}

Waiting on bqjob_ra89468b846fea74_0000017986eb53b3_1 ... (0s) Current status: RUNNING                                                                                     Waiting on bqjob_ra89468b846fea74_0000017986eb53b3_1 ... (1s) Current status: RUNNING                                                                                     Waiting on bqjob_ra89468b846fea74_0000017986eb53b3_1 ... (1s) Current status: DONE   


Show the table schema.

In [7]:
%%bash -s "$DESTINATION_PROJECT_ID" "$DESTINATION_DATASET" "$DESTINATION_TABLE"

bq --project_id ${1} show ${2}.${3}

Table ah-native-gcp-project-74939:BQ_dataset_autodelete_after_one_day.py3_bq_20210519_231717

   Last modified                  Schema                 Total Rows   Total Bytes     Expiration      Time Partitioning   Clustered Fields   Labels  
 ----------------- ------------------------------------ ------------ ------------- ----------------- ------------------- ------------------ -------- 
  19 May 23:17:22   |- Catalog_ID: string                17           1441          20 May 23:17:22                                                  
                    |- Description: string                                                                                                           
                    |- Gender: string                                                                                                                
                    |- Race: string                                                                                                                  
      

## Via the gcloud Python client

https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.LoadJobConfig.html

In [8]:
client = bigquery.Client(project=os.environ['GOOGLE_PROJECT'])
DESTINATION_TABLE = 'py3_gcloud_py_client_' + time.strftime("%Y%m%d_%H%M%S")

In [9]:
table_ref = client.dataset(DESTINATION_DATASET,
                           project=DESTINATION_PROJECT_ID).table(DESTINATION_TABLE)

# https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.LoadJobConfig.html
job_config = LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True

load_job = client.load_table_from_uri(CSV_PATH, table_ref, job_config=job_config)
print('Loading {}, starting job {}'.format(DESTINATION_TABLE, load_job.job_id))

Loading py3_gcloud_py_client_20210519_231725, starting job 1019ddda-3ca0-4e3b-9b13-8a8701fc2aab


In [10]:
# Waits for table load to complete.
load_job.result()
print('Job finished.')

Job finished.


In [11]:
load_job.errors

In [12]:
table = client.get_table(table_ref)  # API Request
print(table.schema)

[SchemaField('Catalog_ID', 'STRING', 'NULLABLE', None, (), None), SchemaField('Description', 'STRING', 'NULLABLE', None, (), None), SchemaField('Gender', 'STRING', 'NULLABLE', None, (), None), SchemaField('Race', 'STRING', 'NULLABLE', None, (), None), SchemaField('Family_Member', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('Relationship_to_Proband', 'STRING', 'NULLABLE', None, (), None), SchemaField('CT_Desc', 'STRING', 'NULLABLE', None, (), None), SchemaField('Cum_Pdl', 'STRING', 'NULLABLE', None, (), None)]


In [13]:
print(table.description)

None


In [14]:
print(table.num_rows)

17


# Load data to BigQuery from a dataframe

## Via pandas-gbq

In [15]:
DESTINATION_TABLE = 'py3_pandas_gbq_' + time.strftime("%Y%m%d_%H%M%S")

In [16]:
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
                  columns=['a', 'b', 'c', 'd', 'e'])

In [17]:
df.to_gbq(destination_table='.'.join([DESTINATION_DATASET, DESTINATION_TABLE]),
          project_id=DESTINATION_PROJECT_ID)

1it [00:04,  4.37s/it]


In [18]:
pd.io.gbq.read_gbq(
    f'SELECT COUNT(*) AS cnt FROM `{DESTINATION_PROJECT_ID}.{DESTINATION_DATASET}.{DESTINATION_TABLE}`')

Unnamed: 0,cnt
0,5


### <font color="#FF6600">(expand for tip) </font> <font color="#445555">What to expect</font><a class="tocSkip">

After running the cells above, you should see three new BQ datasets available under your cloud-native project. Note that if you set the dataset up to autodelete, they will disappear after the set time (so you will not be charged for storage costs)         
![BQ datasets_Screen shot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_BQ-datasets_Screen%20shot.png)

# Provenance

In [19]:
import datetime
print(datetime.datetime.now())

2021-05-19 23:17:36.387711


In [20]:
!pip3 freeze

absl-py==0.12.0
anyio==3.1.0
argon2-cffi==20.1.0
arrow==1.1.0
arviz==0.11.2
asn1crypto==0.24.0
astroid==2.5.6
astunparse==1.6.3
async-generator==1.10
attrs==21.2.0
backcall==0.2.0
bagit==1.8.1
bgzip==0.3.5
binaryornot==0.4.4
biopython==1.78
bleach==3.3.0
bokeh==2.3.1
brewer2mpl==1.4.1
bx-python==0.8.11
CacheControl==0.11.7
cachetools==4.2.2
certifi==2020.12.5
cffi==1.14.5
cftime==1.4.1
chardet==4.0.0
cli-builder==0.1.5
click==7.1.2
colorama==0.4.4
confuse==1.4.0
cookiecutter==1.7.2
crcmod==1.7
cryptography==3.4.7
cwltool==1.0.20190228155703
cycler==0.10.0
Cython==0.29.23
decorator==4.4.2
defusedxml==0.7.1
descartes==1.1.0
dill==0.3.3
entrypoints==0.3
facets-overview==1.0.0
fastinterval==0.1.1
fastprogress==1.0.0
filelock==3.0.12
firecloud==0.16.25
flatbuffers==1.12
future==0.18.2
gast==0.3.3
ggplot==0.11.5
gitdb==4.0.7
GitPython==3.1.17
google-api-core==1.26.3
google-auth==1.30.0
google-auth-oauthlib==0.4.4
google-cloud-bigquery==2.

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.