The following dependencies are required to annotate your BigQuery variant data with the Mastermind Cited Variants Reference public dataset.
To install these dependencies on a Mac, you can use Homebrew.
These are needed to run Google Cloud queries from the command line:
- Google Cloud Platform Billing Account
- Google Cloud SDK
brew tap caskroom/cask brew cask install google-cloud-sdk
If you already have variant data in BigQuery tables, you can skip to step 2.
- Setting up BigQuery and importing your data
- Annotating your data with the Mastermind CVR
- Exporting your annotated data
Commands are based on these instructions.
Use the -h
flag to print help text for any of the commands.
All arguments that specify BigQuery tables should include both the
dataset and table name, separated by a period. For example, if your
dataset is dataset_1
and your table is table_1
, the argument would
be dataset_1.table_1
.
- Create a new project:
./create-project [project-id] [billing-id]
- List billing ids:
./list-billing-ids
- List billing ids:
- Set active project:
./set-active-project [project-id]
. - Create a dataset:
./create-dataset [dataset-name]
- Create a bucket:
./create-bucket [bucket-name]
- Upload VCF to bucket
- Local file:
./upload-to-bucket [bucket-name] [path-to-file]
- From URL:
./upload-url-to-bucket [bucket-name] [URL]
- Local file:
- Convert VCF to BigQuery table:
./vcf-to-bq [bucket-name] [bucket-vcf-path] [VCF-table]
- Wait for task to finish
./watch-task [task-id]
- Wait for task to finish
- Set active project:
./set-active-project [project-id]
- Annotate VCF BigQuery table with CVR:
Example:
./annotate-vcf [VCF-table] [output-table] [assembly-version] [reference-name-type]
./annotate-vcf my_dataset.my_table my_dataset.my_annotated_table GRCh37 chr
-
[VCF-table]
: The input dataset table you want to annotate.To list project datasets:
bq ls
Then to list dataset tables:
bq ls [dataset]
-
[output-table]
: The output dataset table you want to create with the annotated variants -
[assembly-version]
: The assembly version your variants are using, eitherGRCh37
orGRCh38
. -
[reference-name-type]
: The type of data defined in thereference_name
of your input dataset table imported from your VCF data. This will be the same as in the original VCF file's#CHROM
column, which is one of the following data types:contig
: For example,NC_000014.9
chr
: For example,14
chr_prefix
: For example,chr14
You can also list the available Genomenon Mastermind CVR public datasets available from which to annotate your data. These consist of a GRCh37 and GRCh38 version for each date the CVR was released as a BigQuery public dataset:
./list-cvr-tables -v [assembly-version]
This is only needed if you want to export your annotated variant data from BigQuery to an annotated VCF file.
You will need bcftools to run this, which can be installed on Mac with Homebrew:
brew install bcftools
- Generate representative header:
bcftools merge [vcf-file] [cvr-file] --print-header -O z -o [header.vcf.gz]
- Upload header file to bucket:
./upload-to-bucket [bucket-name] [path-to-header-file]
- Convert table to VCF:
./bq-to-vcf [bucket-name] [annotated-table] [header-bucket-path]