# Introduction

TileDB-VCF offers several interfaces to create and query your genomic variant data. Here we'll utilize the command-line interface (CLI) to walk through the process of loading raw VCF data into TileDB and then performing a few different types of exports.

***Hint: By the way, you can launch an interactive version of this tutoral using [TileDB Cloud Notebooks](https://console.tiledb.com/notebooks).***


# Create a Dataset

The process of creating a new VCF dataset involves 3 phases: `create`, `register`, and `store`. Detailed information about each phase is provided in the [ingestion algorithm](https://docs.tiledb.com/genomics/advanced/ingestion-algorithm) section.


We'll start with a small example using 3 synthetic VCF files available in `data/vcfs`:

In [2]:
tree data

data
├── s3-bcf-samples.txt
└── vcfs
    ├── G1.vcf.gz
    ├── G1.vcf.gz.csi
    ├── G2.vcf.gz
    ├── G2.vcf.gz.csi
    ├── G3.vcf.gz
    └── G3.vcf.gz.csi

1 directory, 7 files


Index files are required for ingestion. If your VCF/BCF files have not been indexed you can use [`bcftools`](https://samtools.github.io/bcftools/bcftools.html) to do so:

```shell
for f in data/vcfs/*.vcf.gz; do bcftools index -c $f; done
```

Creating an empty dataset is the first step to ingesting your VCF data. Let's save the dataset in a new directory called `small_dataset`:

In [3]:
tiledbvcf create --uri small_dataset

The next step is to **register** the samples that will be ingested using either a text file that contains the location of each VCF or by directly passing a list of VCF files to the command.

In [4]:
tiledbvcf register --uri small_dataset data/vcfs/G*.vcf.gz

You can always determine which samples have been registered with a dataset using the `list` command:

In [5]:
tiledbvcf list --uri small_dataset

G1
G2
G3


The final step is to actually ingest the VCF files. 

In [6]:
tiledbvcf store --uri small_dataset data/vcfs/G*.vcf.gz

That's it! Let's verify evertyhing went okay using the `stat` command to provide high-level statistics about our dataset including the number of samples it contains and the variant attributes it includes.

In [7]:
tiledbvcf stat --uri small_dataset

Statistics for dataset 'small_dataset':
- Version: 2
- Row tile extent: 10
- Tile capacity: 10,000
- Anchor gap: 1,000
- Number of registered samples: 3
- Extracted attributes: none


At this point you have succesfully created and populated a TileDB VCF dataset. By default *all* metadata recorded in the VCF data lines are ingested but you can override this behavior with the `--attributes` flag, which allows you to specify the subset of fields you want to include.



# Incremental Updates

One of the key advantages to using TileDB as the data store for your variant data is the ability to efficiently add new samples as they become available. Furthermore, TileDB's native cloud features makes it possible to ingest samples directly from remote locations. Here, we'll register and ingest the following additional samples, which are located on [AWS S3](https://aws.amazon.com/s3/):

In [8]:
cat data/s3-bcf-samples.txt

s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf
s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G5.bcf
s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G6.bcf
s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G7.bcf
s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G8.bcf
s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G9.bcf
s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G10.bcf


*Note: Samples in this second batch are stored as `BCF` files which are also supported by TileDB-VCF.*

This process is identical to the steps perfomed above, the only changes needed to our code involve setting `--scratch-mb` to allocate some temporary space for downloading the files and providing the URLs for the remote files. In this case, we'll simply pass the `s3-bcf-samples.txt` file, which includes a list of the BCF files we want to register and ingest.

In [9]:
tiledbvcf register \
  --uri small_dataset \
  --scratch-mb 1 \
  --samples-file data/s3-bcf-samples.txt

*Note: When ingesting samples from S3, you must configure enough disk scratch space to hold at least 20 samples (in general, 2 &times; `row_tile_extent` samples).*

You can add the `--verbose` flag to print out more information during the `store` phase.

In [10]:
tiledbvcf store \
  --uri small_dataset \
  --scratch-mb 10 \
  --samples-file data/s3-bcf-samples.txt \
  --verbose

Initialization completed in 21.3871 sec.
Finished ingesting 0 / 7 samples (1.17e-05 sec)...
Done. Ingested 1,391 records (+ 69,548 anchors) from 7 samples in 30.4475 seconds.


Again, let's run the `stat` command to verify our dataset now includes 10 samples.

In [11]:
tiledbvcf stat --uri small_dataset

Statistics for dataset 'small_dataset':
- Version: 2
- Row tile extent: 10
- Tile capacity: 10,000
- Anchor gap: 1,000
- Number of registered samples: 10
- Extracted attributes: none


Here, you've updated an existing dataset, which inititally contained only *local VCF files*, by adding new *BCF files* located on a *remote* server. Because TileDB is designed to be easily updatable, this process happens efficiently and without ever touching the previously ingested data—avoiding computationally expensive operations like regenerating combined VCF files.

# Querying and Exporting

Now that you've created a TileDB-VCF dataset, you can begin to query the variant data for the 10 ingested samples. As mentioned earlier, several APIs are available for accessing the data. Notebook 2 provides examples using the Python module. Here, we'll continue working with the CLI, focusing now on the `export` command, which can produce 3 different types of outputs for any given query:

1. VCF (or BCF), producing one `.vcf` output file per exported sample
2. Tabular, producing a single tab-separated value (TSV) text file containing all intersecting records across the exported samples
3. Count-only, outputs a count of the total number of intersecting records (no output file is produced)

The following examples demonstrate basic usage of the CLI for export; see the `tiledbvcf export --help` or the [online documentation](https://docs.tiledb.com/genomics/apis/cli) for more information.

## Export VCF/BCFs

In this example we will export several genomic regions from `small_dataset` created above:

In [12]:
mkdir -p data/exported-subsets

tiledbvcf export \
  --uri small_dataset \
  --regions 1:1-50000,2:1-50000,3:1-50000 \
  --sample-names G1,G2,G3 \
  --output-dir data/exported-subsets \
  --verbose

Initialized TileDB query with 3 column ranges.
Processed 315 cells in 6.62e-05 sec. Reported 13 cells.
Done. Exported 13 records in 15.2308 seconds.


This produced the three `bcf` files shown below, each of which contains the records intersecting the specified regions for the corresponding sample.

In [13]:
tree data/exported-subsets

data/exported-subsets
├── G1.bcf
├── G2.bcf
└── G3.bcf

0 directories, 3 files


You can use `bcftools` to examine any of the exported `bcf` files:

In [14]:
bcftools view --no-header data/exported-subsets/G1.bcf

1	13350	.	A	<NON_REF>	.	.	END=36258	GT:DP:GQ:MIN_DP:PL	1/0:50:3:43:44,29,99
1	42091	.	A	<NON_REF>	.	.	END=101445	GT:DP:GQ:MIN_DP:PL	0/0:8:91:60:35,62,92
2	11625	.	T	<NON_REF>	.	.	END=106375	GT:DP:GQ:MIN_DP:PL	0/0:27:72:76:70,30,83
3	14580	.	T	<NON_REF>	.	.	END=86190	GT:DP:GQ:MIN_DP:PL	0/1:50:78:41:67,11,43


If a sample does not contain any intersecting records an output file is still created but it will include 0 records. 

In order to export *all* records, simply omit the `--regions` argument. For example, to recover the original BCFs for two samples:

In [15]:
mkdir -p data/exported-full

tiledbvcf export \
  --uri small_dataset \
  --sample-names G4,G5 \
  --output-dir data/exported-full \
  --verbose

Initialized TileDB query with 1 column ranges.
Processed 20758 cells in 0.0010847 sec. Reported 412 cells.
Done. Exported 412 records in 2.61805 seconds.


This will output two BCFs for the corresponding samples that include the complete set of information from the original files. Note, while these exports are *lossless* in terms of the actual stored data, they may not be *identical* to the original files. For example, exported BCF/VCFs will always include an `END` position field even if one was not present in the original files, and the ordering of fields within the `INFO` and `FORMAT` columns may also differ. 

## Export to TSV

Next we'll look at exporting data from `small_dataset` in *tabular* format, which makes it convenient to load into other tools for downstream analyses. The command arguments are largely the same as for exporting to VCF/BCF, you just need to change the output type to `t` (short for TSV) and specify which VCF fields should be included as columns in the output table. The standard fields include:

- `SAMPLE` (always included)
- `ID`
- `REF`
- `ALT`
- `QUAL`
- `CHR`
- `POS`
- `FILTER`

Fields within the `FORMAT` and `INFO` columns are accessed using special prefixes, `S:` and `I:`, respectively. This notation is used below to include genotype information from the `FORMAT:GT` field:

In [3]:
tiledbvcf export \
    --uri small_dataset \
    -Ot --tsv-fields CHR,POS,I:END,REF,S:GT \
    -s G1,G2,G3,G4 \
    --regions 1:1-50000

SAMPLE	CHR	POS	I:END	REF	S:GT
G4	1	10485	40554	G	0,1
G3	1	11758	92302	T	1,1
G2	1	12410	50341	A	1,1
G1	1	13350	36258	A	1,0
G1	1	42091	101445	A	0,0
G4	1	46864	86622	G	1,0


The results are output in a long form table where each row represents a single variant that overlapped one of the specified regions for a particular sample. 

There's one more special prefix worth pointing out: `Q:` (for *query*), which allows you to include columns containing the start (`Q:POS`) and end (`Q:END`) positions for the query region in which the variant is located. Here we perform the same query across a few more regions, add the query columns, and save the output to `data/exported-regions.tsv`.

In [6]:
tiledbvcf export \
    --uri small_dataset \
    -Ot --tsv-fields Q:POS,Q:END,CHR,POS,I:END,REF,S:GT \
    -s G1,G2,G3,G4 \
    --regions 1:1-50000,2:60000-100000,3:110000-160000 \
    --output-path data/exported-regions.tsv

cat data/exported-regions.tsv

SAMPLE	Q:POS	Q:END	CHR	POS	I:END	REF	S:GT
G4	1	50000	1	10485	40554	G	0,1
G3	1	50000	1	11758	92302	T	1,1
G2	1	50000	1	12410	50341	A	1,1
G1	1	50000	1	13350	36258	A	1,0
G1	1	50000	1	42091	101445	A	0,0
G4	1	50000	1	46864	86622	G	1,0
G3	60000	100000	2	11431	93855	A	1,0
G1	60000	100000	2	11625	106375	T	0,0
G2	60000	100000	2	14907	92383	T	1,1
G4	60000	100000	2	62758	132118	A	1,1
G2	60000	100000	2	92670	145136	C	0,1
G3	60000	100000	2	93967	149850	C	1,0
G2	110000	160000	3	78418	148323	G	0,0
G4	110000	160000	3	91891	169481	A	1,0
G3	110000	160000	3	111926	162835	A	0,0
G1	110000	160000	3	112593	174752	T	1,0


# Session

In [59]:
echo "Updated: $(date)"

Updated Fri Jun  5 19:41:05 UTC 2020


In [61]:
tiledbvcf --version

TileDB-VCF build 984aeda-modified
TileDB version 1.7.7
