# How to load data to BigQuery

Demonstrate a few different ways to load data to BigQuery from a Python 3 notebook.
* [bq](https://cloud.google.com/bigquery/docs/bq-command-line-tool) command line tool
* [gcloud python client](https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html#bigquery-basics)
* [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/)

For files, the `bq` tool or the gcloud-python-client are great and work the same. Choose whichever one you like more.

For dataframes in memory, the pandas-gbq client is a great way to go (no need to write it out to a file first).

# Setup

In [1]:
from google.cloud import bigquery
from google.cloud.bigquery import LoadJobConfig
from google.cloud.bigquery import SchemaField

import os
import numpy as np
import pandas as pd
import time

Edit these global variables in your clone of this notebook if you do not have permission to WRITE data to this native Google Cloud Platform project.
* The destination BigQuery dataset should already exist. Your pet account must have WRITE access to it.
* The remaining cells can be run as-is.

In [2]:
# CHANGE THESE VARIABLES, IF NEEDED
DESTINATION_PROJECT_ID = 'terra-resources'
DESTINATION_DATASET = 'autodelete_after_one_day'

In [3]:
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
CSV_PATH = 'gs://genomics-public-data/platinum-genomes/other/platinum_genomes_sample_info.csv'
# Also try this CSV which will yield some autodetect errors.
# gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv

# Load data to BigQuery from a CSV

## Via the `bq` command line tool

In [4]:
DESTINATION_TABLE = 'py3_bq_' + time.strftime("%Y%m%d_%H%M%S")

In [5]:
%%bash -s "$DESTINATION_PROJECT_ID" "$DESTINATION_DATASET" "$DESTINATION_TABLE" "$CSV_PATH"

bq --project_id ${1} load --autodetect ${2}.${3} ${4}

Waiting on bqjob_r6484e458e29b0fe1_00000173923090dd_1 ... (0s) Current status: RUNNING                                                                                      Waiting on bqjob_r6484e458e29b0fe1_00000173923090dd_1 ... (1s) Current status: RUNNING                                                                                      Waiting on bqjob_r6484e458e29b0fe1_00000173923090dd_1 ... (1s) Current status: DONE   


Show the table schema.

In [6]:
%%bash -s "$DESTINATION_PROJECT_ID" "$DESTINATION_DATASET" "$DESTINATION_TABLE"

bq --project_id ${1} show ${2}.${3}

Table terra-resources:autodelete_after_one_day.py3_bq_20200727_213202

   Last modified                  Schema                 Total Rows   Total Bytes     Expiration      Time Partitioning   Clustered Fields   Labels  
 ----------------- ------------------------------------ ------------ ------------- ----------------- ------------------- ------------------ -------- 
  27 Jul 21:32:05   |- Catalog_ID: string                17           1441          03 Aug 21:32:05                                                  
                    |- Description: string                                                                                                           
                    |- Gender: string                                                                                                                
                    |- Race: string                                                                                                                  
                    |- Family

## Via the gcloud Python client

https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.LoadJobConfig.html

In [7]:
client = bigquery.Client(project=os.environ['GOOGLE_PROJECT'])
DESTINATION_TABLE = 'py3_gcloud_py_client_' + time.strftime("%Y%m%d_%H%M%S")

In [8]:
table_ref = client.dataset(DESTINATION_DATASET,
                           project=DESTINATION_PROJECT_ID).table(DESTINATION_TABLE)

# https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.LoadJobConfig.html
job_config = LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True

load_job = client.load_table_from_uri(CSV_PATH, table_ref, job_config=job_config)
print('Loading {}, starting job {}'.format(DESTINATION_TABLE, load_job.job_id))

Loading py3_gcloud_py_client_20200727_213207, starting job 67adfc47-0cf4-4a8e-b550-5b50185a637b


In [9]:
# Waits for table load to complete.
load_job.result()
print('Job finished.')

Job finished.


In [10]:
load_job.errors

In [11]:
table = client.get_table(table_ref)  # API Request
print(table.schema)

[SchemaField('Catalog_ID', 'STRING', 'NULLABLE', None, ()), SchemaField('Description', 'STRING', 'NULLABLE', None, ()), SchemaField('Gender', 'STRING', 'NULLABLE', None, ()), SchemaField('Race', 'STRING', 'NULLABLE', None, ()), SchemaField('Family_Member', 'INTEGER', 'NULLABLE', None, ()), SchemaField('Relationship_to_Proband', 'STRING', 'NULLABLE', None, ()), SchemaField('CT_Desc', 'STRING', 'NULLABLE', None, ()), SchemaField('Cum_Pdl', 'STRING', 'NULLABLE', None, ())]


In [12]:
print(table.description)

None


In [13]:
print(table.num_rows)

17


# Load data to BigQuery from a dataframe

## Via pandas-gbq

In [14]:
DESTINATION_TABLE = 'py3_pandas_gbq_' + time.strftime("%Y%m%d_%H%M%S")

In [15]:
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
                  columns=['a', 'b', 'c', 'd', 'e'])

In [16]:
df.to_gbq(destination_table='.'.join([DESTINATION_DATASET, DESTINATION_TABLE]),
          project_id=DESTINATION_PROJECT_ID)

1it [00:05,  5.50s/it]


In [17]:
pd.io.gbq.read_gbq('SELECT COUNT(*) AS cnt FROM `{}.{}.{}`'.format(DESTINATION_PROJECT_ID,
                                                                   DESTINATION_DATASET,
                                                                   DESTINATION_TABLE),
                   project_id=BILLING_PROJECT_ID,
                   dialect='standard')

Downloading: 100%|██████████| 1/1 [00:00<00:00,  7.41rows/s]


Unnamed: 0,cnt
0,5


# Provenance

In [18]:
import datetime
print(datetime.datetime.now())

2020-07-27 21:32:19.163267


In [19]:
!pip3 freeze

absl-py==0.9.0
arrow==0.15.7
asgiref==3.2.10
asn1crypto==0.24.0
astor==0.8.1
astroid==2.4.2
astropy==4.0.1.post1
attrs==19.3.0
backcall==0.2.0
bagit==1.7.0
bgzip==0.3.5
binaryornot==0.4.4
biopython==1.72
bleach==3.1.5
bokeh==1.0.0
brewer2mpl==1.4.1
bx-python==0.8.2
CacheControl==0.11.7
cachetools==4.1.0
certifi==2020.6.20
chardet==3.0.4
cli-builder==0.0.1
click==7.1.2
confuse==1.1.0
cookiecutter==1.7.2
cryptography==2.1.4
cwltool==1.0.20190228155703
cycler==0.10.0
Cython==0.29.20
decorator==4.4.2
defusedxml==0.6.0
Django==3.0.8
entrypoints==0.3
enum34==1.1.6
facets==1.1.0
facets-overview==1.0.0
fastinterval==0.1.1
firecloud==0.16.25
future==0.18.2
gast==0.3.3
ggplot==0.11.5
google-api-core==1.21.0
google-auth==1.18.0
google-auth-oauthlib==0.4.1
google-cloud-bigquery==1.23.1
google-cloud-bigquery-datatransfer==0.4.1
google-cloud-core==1.3.0
google-cloud-datastore==1.10.0
google-cloud-resource-manager==0.30.0
google-cloud-storage==1.29.0


Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.