# How to save and load R objects from the workspace bucket

R users like to save intermediate work in R's native format for rapid loading.

See also [Notebooks 101 - How not to lose data output files or collaborator edits](https://broadinstitute.zendesk.com/hc/en-us/articles/360027300571-Notebooks-101-How-not-to-lose-data-output-files-or-collaborator-edits).

## Setup

First, be sure to run notebook **`R environment setup`** in this workspace.

In [1]:
library(lubridate)
library(tidyverse)


Attaching package: ‘lubridate’


The following object is masked from ‘package:base’:

    date


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──

[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mlubridate[39m::[32mas.difftime()[39m masks [34mbase[39m::as.difftime()
[31m✖[39m [34mlubridate[39m::[32mdate()[39m        masks [34mbase[39m::date()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m          masks [34mstats[39m::filter()
[31m✖[39m [34mlubridate[39m::[32mintersect()[39m   masks [34mbase[39m::intersect()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m             

Get the Cloud Storage bucket associated with this workspace.

In [2]:
(WORKSPACE_BUCKET <- Sys.getenv('WORKSPACE_BUCKET'))

In [3]:
TIMESTAMP <- strftime(now(), '%Y%m%d-%H%M%S')
(RDA_FILE <- str_glue('thousand_genomes_{TIMESTAMP}.rda'))

## Read some data from Cloud Storage.
Let’s retrieve the sample information for [1000 Genomes](http://www.internationalgenome.org/data "1000 Genomes").

This approach uses `gsutil cat` to transfer the contents of the CSV file since we want to load the whole thing. 

If you instead want to load a subset of columns or a subset of rows, instead retrieve the data from BigQuery table [bigquery-public-data.human_genome_variants.1000_genomes_sample_info](https://bigquery.cloud.google.com/table/bigquery-public-data:human_genome_variants.1000_genomes_sample_info).

In [4]:
df <- read_csv(pipe('gsutil cat gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv'),
               guess_max = 5000)

Parsed with column specification:
cols(
  .default = col_character(),
  In_Low_Coverage_Pilot = [32mcol_double()[39m,
  In_High_Coverage_Pilot = [32mcol_double()[39m,
  In_Exon_Targetted_Pilot = [32mcol_double()[39m,
  Has_Sequence_in_Phase1 = [32mcol_double()[39m,
  In_Phase1_Integrated_Variant_Set = [32mcol_double()[39m,
  Has_Phase1_chrY_SNPS = [32mcol_double()[39m,
  Has_phase1_chrY_Deletions = [32mcol_double()[39m,
  Has_phase1_chrMT_SNPs = [32mcol_double()[39m,
  Total_LC_Sequence = [32mcol_double()[39m,
  LC_Non_Duplicated_Aligned_Coverage = [32mcol_double()[39m,
  Total_Exome_Sequence = [32mcol_double()[39m,
  X_Targets_Covered_to_20x_or_greater = [32mcol_double()[39m,
  VerifyBam_E_Omni_Free = [32mcol_double()[39m,
  VerifyBam_E_Affy_Free = [32mcol_double()[39m,
  VerifyBam_E_Omni_Chip = [32mcol_double()[39m,
  VerifyBam_E_Affy_Chip = [32mcol_double()[39m,
  VerifyBam_LC_Omni_Free = [32mcol_double()[39m,
  VerifyBam_LC_Affy_Free = [32mcol_dou

## Save the object(s) to a local file.

In [5]:
save(df, file = RDA_FILE)

## Transfer the file to the workspace bucket

In [6]:
system(str_glue('gsutil cp {RDA_FILE} {WORKSPACE_BUCKET}/r-objects/ 2>&1'), intern = TRUE)

## Now, load that object from the native format file in Cloud Storage

In [7]:
# The object exists in memory.
head(df)

Sample,Family_ID,Population,Population_Description,Gender,Relationship,Unexpected_Parent_Child,Non_Paternity,Siblings,Grandparents,⋯,In_Final_Phase_Variant_Calling,Has_Omni_Genotypes,Has_Axiom_Genotypes,Has_Affy_6_0_Genotypes,Has_Exome_LOF_Genotypes,EBV_Coverage,DNA_Source_from_Coriell,Has_Sequence_from_Blood_in_Index,Super_Population,Super_Population_Description
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>
HG00096,HG00096,GBR,British in England and Scotland,male,,,,,,⋯,1.0,1,,,1,20.31,,,EUR,European
HG00097,HG00097,GBR,British in England and Scotland,female,,,,,,⋯,1.0,1,,,1,169.49,,,EUR,European
HG00098,HG00098,GBR,British in England and Scotland,male,,,,,,⋯,,1,,,1,,,,EUR,European
HG00099,HG00099,GBR,British in England and Scotland,female,,,,,,⋯,1.0,1,,,1,23.04,,,EUR,European
HG00100,HG00100,GBR,British in England and Scotland,female,,,,,,⋯,1.0,1,,,1,116.22,,,EUR,European
HG00101,HG00101,GBR,British in England and Scotland,male,,,,,,⋯,1.0,1,,,1,82.0,,,EUR,European


In [8]:
# Go ahead and delete it.
rm(df)

In [9]:
# Okay, its gone.
head(df)

                                              
1 function (x, df1, df2, ncp, log = FALSE)    
2 {                                           
3     if (missing(ncp))                       
4         .Call(C_df, x, df1, df2, log)       
5     else .Call(C_dnf, x, df1, df2, ncp, log)
6 }                                           

In [10]:
load(pipe(str_glue('gsutil cat {WORKSPACE_BUCKET}/r-objects/{RDA_FILE}')))

In [11]:
# The object exists in memory again!
head(df)

Sample,Family_ID,Population,Population_Description,Gender,Relationship,Unexpected_Parent_Child,Non_Paternity,Siblings,Grandparents,⋯,In_Final_Phase_Variant_Calling,Has_Omni_Genotypes,Has_Axiom_Genotypes,Has_Affy_6_0_Genotypes,Has_Exome_LOF_Genotypes,EBV_Coverage,DNA_Source_from_Coriell,Has_Sequence_from_Blood_in_Index,Super_Population,Super_Population_Description
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>
HG00096,HG00096,GBR,British in England and Scotland,male,,,,,,⋯,1.0,1,,,1,20.31,,,EUR,European
HG00097,HG00097,GBR,British in England and Scotland,female,,,,,,⋯,1.0,1,,,1,169.49,,,EUR,European
HG00098,HG00098,GBR,British in England and Scotland,male,,,,,,⋯,,1,,,1,,,,EUR,European
HG00099,HG00099,GBR,British in England and Scotland,female,,,,,,⋯,1.0,1,,,1,23.04,,,EUR,European
HG00100,HG00100,GBR,British in England and Scotland,female,,,,,,⋯,1.0,1,,,1,116.22,,,EUR,European
HG00101,HG00101,GBR,British in England and Scotland,male,,,,,,⋯,1.0,1,,,1,82.0,,,EUR,European


# Provenance

In [12]:
devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       Debian GNU/Linux 9 (stretch)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2020-01-13                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date       lib source                            
 assertthat    0.2.1      2019-03-21 [2] CRAN (R 3.5.2)                    
 backports     1.1.5      2019-10-02 [1] CRAN (R 3.5.2)                    
 base64enc     0.1-3      2015-07-28 [2] CRAN (R 3.5.2)                    
 broom         0.5.2      2019-04-07 [2] CRAN (R 3.5.2)                    
 callr         3.3.1      2019-07-18 [2] CRAN (R 3.5.2)                

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.