# How to load data to BigQuery

Here we demonstrate a few different ways to load data to BigQuery from an R notebook.
* [bq](https://cloud.google.com/bigquery/docs/bq-command-line-tool) command line tool
* [bigrquery](https://cloud.google.com/blog/products/gcp/google-cloud-platform-for-data-scientists-using-r-with-google-bigquery-part-2-storing-and-retrieving-data-frames)

<div class="alert alert-block alert-info">
<b>Tip:</b> See also the companion Terra Support article <a href='https://support.terra.bio/hc/en-us/articles/360051229072-Accessing-Advanced-GCP-features-in-Terra'>Accessing Advanced GCP features in Terra</a>.
</div>

## Setup

Edit the global variables in your clone of this notebook to refer to a native Google Cloud Platform project to which you have WRITE acces.
* **The destination BigQuery dataset should already exist. Your pet account must have WRITE access to it.**       
[**Click for step-by-step instructions to create a BQ dataset**](https://support.terra.bio/hc/en-us/articles/360051229072#h_01EPCCS08S69VE4VMT0F0NNDWR)     


* Make sure to change to your own project and dataset names. The remaining cells can be run as-is.

In [None]:
library(jsonlite)
library(bigrquery)
library(lubridate)
library(tidyverse)

In [None]:
# This file loads fine via autodetect.
CSV_PATH <- 'gs://genomics-public-data/platinum-genomes/other/platinum_genomes_sample_info.csv'

# Also try this CSV which will yield some autodetect errors.
CSV_PATH_AUTODETECT_FAILS <- 'gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv'

BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')

**Note that you will need to change the variables below to your own values** (expand the tips if you need help finding the variables)

In [None]:
# CHANGE THESE VARIABLES
DESTINATION_PROJECT_ID <- 'your_GCP-native_project_ID'
DESTINATION_DATASET <- 'your_BQ_dataset'

In [None]:
DESTINATION_PROJECT_ID <- 'ah-native-gcp-project-74939'
DESTINATION_DATASET <- 'BQ_dataset_autodelete_after_one_day'

### <font color="#FF6600">(expand for tip) </font> <font color="#445555">How to find your cloud-native project-ID</font><a class="tocSkip">

When logged in with your Terra user-ID, go to billing in the GCP console at [https://console.cloud.google.com/billing](https://console.cloud.google.com/billing)     
![finding project ID screen shot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_Find-Project-ID_Step1_Screen%20shot.png)

1. Select the Organization you used when creating your cloud-native project    
2. Find the Project ID at right  

### <font color="#FF6600">(expand for tip) </font> <font color="#445555">How to find your cloud-native BigQuery dataset</font><a class="tocSkip">

Go to [https://console.cloud.google.com/bigquery](https://console.cloud.google.com/bigquery)   

On the left column, select your cloud-native Project from the drop-down. You should see your BQ dataset listed:   

![Find BQ dataset Screen shiot](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Advanced-GCP-features_Find-BQ-dataset-name_Screen%20shot.png)

# Load data to BigQuery from a CSV

We'll do this using the `bq` command line tool and the `--autodetect` flag.

In [None]:
DESTINATION_TABLE <- paste0('r_bq_autodetect_', strftime(now(), '%Y%m%d_%H%M%S'))

In [None]:
system(str_glue(str_c('bq --project_id {BILLING_PROJECT_ID} load ',
                      '--autodetect ',
                      '{DESTINATION_PROJECT_ID}:{DESTINATION_DATASET}.{DESTINATION_TABLE} ',
                      '{CSV_PATH}  2>&1')),
      intern = TRUE)

Show the table schema.

In [None]:
system(str_glue(str_c('bq --project_id {BILLING_PROJECT_ID} show ',
                      '{DESTINATION_PROJECT_ID}:{DESTINATION_DATASET}.{DESTINATION_TABLE}')),
      intern = TRUE)

# Load data to BigQuery from a dataframe

We'll do this using R package `bigrquery`.

In [None]:
DESTINATION_TABLE <- paste0('r_bigrquery_', strftime(now(), '%Y%m%d_%H%M%S'))

In [None]:
insert_upload_job(project = DESTINATION_PROJECT_ID,
                  dataset = DESTINATION_DATASET,
                  table = DESTINATION_TABLE,
                  billing = BILLING_PROJECT_ID,
                  write_disposition = 'WRITE_EMPTY',
                  mtcars)

In [None]:
# Create a "connection" to a public BigQuery dataset.
dbcon <- bigrquery::src_bigquery(project = DESTINATION_PROJECT_ID,
                                 dataset = DESTINATION_DATASET,
                                 billing = BILLING_PROJECT_ID)

# Create a 'virtual dataframe' backed by a BigQuery table.
tbl <- dplyr::tbl(dbcon, DESTINATION_TABLE)
colnames(tbl)

# Provenance

In [None]:
devtools::session_info()

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.