# How to load data to BigQuery

This notebook demonstrates a few different ways to load data to BigQuery from a Python 3 notebook
* [bq](https://cloud.google.com/bigquery/docs/bq-command-line-tool) command line tool
* [gcloud python client](https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html#bigquery-basics)
* [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/)

For files, the `bq` tool or the gcloud-python-client are great and work the same. Choose whichever one you like more.

For dataframes in memory, the pandas-gbq client is a great way to go (no need to write it out to a file first).

<div class="alert alert-block alert-info">
<b>Tip:</b> See also the companion Terra Support article <a href='https://support.terra.bio/hc/en-us/articles/360051229072-Accessing-Advanced-GCP-features-in-Terra'>Accessing Advanced GCP features in Terra</a>.
</div>


# Setup
Edit the global variables in your clone of this notebook to refer to a native Google Cloud Platform project to which you have WRITE acces.
* **The destination BigQuery dataset should already exist. Your pet account must have WRITE access to it.**       
[**Click for step-by-step instructions to create a BQ dataset**](https://support.terra.bio/hc/en-us/articles/360051229072#h_01EPCCS08S69VE4VMT0F0NNDWR)     


* Make sure to change to your own project and dataset names. The remaining cells can be run as-is.

In [None]:
import os
import time

import numpy as np
import pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery import LoadJobConfig, SchemaField

In [None]:
CSV_PATH = (
    "gs://genomics-public-data/platinum-genomes/other/platinum_genomes_sample_info.csv"
)
# Also try this CSV which will yield some autodetect errors.
# gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv

**Note that you will need to change the variables below to your own values** (expand the tips if you need help finding the variables)

In [None]:
# CHANGE THESE VARIABLES
DESTINATION_PROJECT_ID = "your_GCP-native_project_ID"
DESTINATION_DATASET = "your_BQ_dataset"

In [None]:
DESTINATION_PROJECT_ID = "ah-native-gcp-project-74939"
DESTINATION_DATASET = "BQ_dataset_autodelete_after_one_day"

### <font color="#FF6600">(expand for tip) </font> <font color="#445555">How to find your cloud-native project-ID</font><a class="tocSkip">

When logged in with your Terra user-ID, go to billing in the GCP console at [https://console.cloud.google.com/billing](https://console.cloud.google.com/billing)     
![finding project ID screen shot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_Find-Project-ID_Step1_Screen%20shot.png)

1. Select the Organization you used when creating your cloud-native project    
2. Find the Project ID at right    

### <font color="#FF6600">(expand for tip) </font> <font color="#445555">How to find your cloud-native BigQuery dataset</font><a class="tocSkip">

Go to [https://console.cloud.google.com/bigquery](https://console.cloud.google.com/bigquery)   

On the left column, select your cloud-native Project from the drop-down. You should see your BQ dataset listed:   

![Find BQ dataset Screen shiot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_Find-BQ-dataset-name_Screen%20shot.png)

# Load data to BigQuery from a CSV

## Via the `bq` command line tool

In [None]:
DESTINATION_TABLE = "py3_bq_" + time.strftime("%Y%m%d_%H%M%S")

In [None]:
%%bash -s "$DESTINATION_PROJECT_ID" "$DESTINATION_DATASET" "$DESTINATION_TABLE" "$CSV_PATH"

bq --project_id ${1} load --autodetect ${2}.${3} ${4}

Show the table schema.

In [None]:
%%bash -s "$DESTINATION_PROJECT_ID" "$DESTINATION_DATASET" "$DESTINATION_TABLE"

bq --project_id ${1} show ${2}.${3}

## Via the gcloud Python client

https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.LoadJobConfig.html

In [None]:
client = bigquery.Client(project=os.environ["GOOGLE_PROJECT"])
DESTINATION_TABLE = "py3_gcloud_py_client_" + time.strftime("%Y%m%d_%H%M%S")

In [None]:
table_ref = client.dataset(DESTINATION_DATASET, project=DESTINATION_PROJECT_ID).table(
    DESTINATION_TABLE
)

# https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.LoadJobConfig.html
job_config = LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True

load_job = client.load_table_from_uri(CSV_PATH, table_ref, job_config=job_config)
print("Loading {}, starting job {}".format(DESTINATION_TABLE, load_job.job_id))

In [None]:
# Waits for table load to complete.
load_job.result()
print("Job finished.")

In [None]:
load_job.errors

In [None]:
table = client.get_table(table_ref)  # API Request
print(table.schema)

In [None]:
print(table.description)

In [None]:
print(table.num_rows)

# Load data to BigQuery from a dataframe

## Via pandas-gbq

In [None]:
DESTINATION_TABLE = "py3_pandas_gbq_" + time.strftime("%Y%m%d_%H%M%S")

In [None]:
df = pd.DataFrame(
    np.random.randint(low=0, high=10, size=(5, 5)), columns=["a", "b", "c", "d", "e"]
)

In [None]:
df.to_gbq(
    destination_table=".".join([DESTINATION_DATASET, DESTINATION_TABLE]),
    project_id=DESTINATION_PROJECT_ID,
)

In [None]:
pd.io.gbq.read_gbq(
    f"SELECT COUNT(*) AS cnt FROM `{DESTINATION_PROJECT_ID}.{DESTINATION_DATASET}.{DESTINATION_TABLE}`"
)

### <font color="#FF6600">(expand for tip) </font> <font color="#445555">What to expect</font><a class="tocSkip">

After running the cells above, you should see three new BQ datasets available under your cloud-native project. Note that if you set the dataset up to autodelete, they will disappear after the set time (so you will not be charged for storage costs)         
![BQ datasets_Screen shot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_BQ-datasets_Screen%20shot.png)

# Provenance

In [None]:
import datetime

print(datetime.datetime.now())

In [None]:
!pip3 freeze

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.