# How to load data to BigQuery

Here we demonstrate a few different ways to load data to BigQuery from an R notebook.
* [bq](https://cloud.google.com/bigquery/docs/bq-command-line-tool) command line tool
* [bigrquery](https://cloud.google.com/blog/products/gcp/google-cloud-platform-for-data-scientists-using-r-with-google-bigquery-part-2-storing-and-retrieving-data-frames)

## Setup

In [1]:
library(jsonlite)
library(bigrquery)
library(lubridate)
library(tidyverse)


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.1     [32m✔[39m [34mdplyr  [39m 1.0.0
[32m✔[39m [34mtidyr  [39m 1.1.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mlubridate[39m::[32mas.difftime()[39m masks [34mbase[39m::as.difftime()
[31m✖[39m [34mlubridate[39m::[32mdate()[39m        masks [34mbase[39m::date()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m          masks [34mstats[39m::filter()
[31m✖[39m [34mpurrr[39m::[32mflatten()[39m         masks [34mjsonlite[39m::flatten()
[31m✖[39m [34mlubridate

Edit these global variables in your clone of this notebook if you do not have permission to WRITE data to this native Google Cloud Platform project.
* The destination BigQuery dataset should already exist. Your pet account must have WRITE access to it.
* The remaining cells can be run as-is.

In [2]:
# CHANGE THESE VARIABLES, IF NEEDED
DESTINATION_PROJECT_ID <- 'terra-resources'
DESTINATION_DATASET <- 'autodelete_after_one_day'

In [3]:
# This file loads fine via autodetect.
CSV_PATH <- 'gs://genomics-public-data/platinum-genomes/other/platinum_genomes_sample_info.csv'

# Also try this CSV which will yield some autodetect errors.
CSV_PATH_AUTODETECT_FAILS <- 'gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv'

BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')

# Load data to BigQuery from a CSV

We'll do this using the `bq` command line tool and the `--autodetect` flag.

In [4]:
DESTINATION_TABLE <- paste0('r_bq_autodetect_', strftime(now(), '%Y%m%d_%H%M%S'))

In [5]:
system(str_glue(str_c('bq --project_id {BILLING_PROJECT_ID} load ',
                      '--autodetect ',
                      '{DESTINATION_PROJECT_ID}:{DESTINATION_DATASET}.{DESTINATION_TABLE} ',
                      '{CSV_PATH}  2>&1')),
      intern = TRUE)

Show the table schema.

In [6]:
system(str_glue(str_c('bq --project_id {BILLING_PROJECT_ID} show ',
                      '{DESTINATION_PROJECT_ID}:{DESTINATION_DATASET}.{DESTINATION_TABLE}')),
      intern = TRUE)

# Load data to BigQuery from a dataframe

We'll do this using R package `bigrquery`.

In [7]:
DESTINATION_TABLE <- paste0('r_bigrquery_', strftime(now(), '%Y%m%d_%H%M%S'))

In [8]:
insert_upload_job(project = DESTINATION_PROJECT_ID,
                  dataset = DESTINATION_DATASET,
                  table = DESTINATION_TABLE,
                  billing = BILLING_PROJECT_ID,
                  write_disposition = 'WRITE_EMPTY',
                  mtcars)

“'insert_upload_job' is deprecated.
Use 'bq_perform_upload' instead.
See help("Deprecated") and help("bigrquery-deprecated").”


In [10]:
# Create a "connection" to a public BigQuery dataset.
dbcon <- bigrquery::src_bigquery(project = DESTINATION_PROJECT_ID,
                                 dataset = DESTINATION_DATASET,
                                 billing = BILLING_PROJECT_ID)

# Create a 'virtual dataframe' backed by a BigQuery table.
tbl <- dplyr::tbl(dbcon, DESTINATION_TABLE)
colnames(tbl)

# Provenance

In [11]:
devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Ubuntu 18.04.4 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2020-07-27                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version  date       lib source        
 assertthat    0.2.1    2019-03-21 [2] CRAN (R 4.0.2)
 backports     1.1.8    2020-06-17 [2] CRAN (R 4.0.2)
 base64enc     0.1-3    2015-07-28 [2] CRAN (R 4.0.2)
 bigrquery   * 1.3.1    2020-05-15 [2] CRAN (R 4.0.2)
 bit           1.1-15.2 2020-02-10 [2] CRAN (R 4.0.2)
 bit64         0.9-7    2017-05-08 [2] CRAN (R 4.0.2)
 blob          1.2.1    2020-01-20 [2] CRAN (R 4.0.2)
 broom         0.5.

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.