Skip to content

Database & API for investigating variant allele frequencies across multiple BAM files.

Notifications You must be signed in to change notification settings

CLIMB-COVID/vafdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vafdb

Setup

Clone the repository:

$ git clone https://github.com/CLIMB-COVID/vafdb.git
$ cd vafdb/

Run the setup.sh script. This creates the vafdb conda environment, initialises the database and builds the client program:

$ ./setup.sh

To start the vafdb server, run the start.sh script:

$ ./start.sh
VAFDB started.

To stop the vafdb server, run the stop.sh script:

$ ./stop.sh
VAFDB stopped.

Once the conda environment is activated, the client program can be used:

$ conda activate vafdb
$ vafdb -h
usage: vafdb [-h] [--host HOST] [--port PORT] [-v] {command} ...

positional arguments:
  {command}
    generate     Generate VAFs from metadata.
    filter       Filter VAFs and their metadata.
    delete       Delete VAFs and their metadata.

options:
  -h, --help     show this help message and exit
  --host HOST    Host of VAFDB instance. Default: localhost
  --port PORT    Port number of VAFDB instance. Default: 8000
  -v, --version  Client version number.

Projects

Define a project:

$ python manage.py newproject example_project --references /path/to/references.fasta

The following arguments can be provided when defining a project:

positional arguments:
  code

options:
  --references REFERENCES
                        Path of FASTA file containing reference sequence(s).
  --description DESCRIPTION
                        [optional] Project description.
  --region REGION       [optional] Specific region to store. Enter in 'CHROM:START-END' format. Default:
                        All regions
  --base-quality BASE_QUALITY
                        [optional] Minimum base quality for storing a VAF. Default: 0
  --mapping-quality MAPPING_QUALITY
                        [optional] Minimum mapping quality for storing a VAF. Default: 0
  --min-coverage MIN_COVERAGE
                        [optional] Minimum coverage for storing a VAF. Default: 0
  --min-entropy MIN_ENTROPY
                        [optional] Minimum entropy for storing a VAF. Default: 0
  --min-secondary-entropy MIN_SECONDARY_ENTROPY
                        [optional] Minimum secondary entropy for storing a VAF. Default: 0
  --insertions          [optional] Store insertion VAFs. Default: False
  --diff-confidence DIFF_CONFIDENCE
                        [optional] Only store VAFs with a different base from the reference, above a
                        certain confidence. Default: None

To delete a project:

$ python manage.py deleteproject example_project

Generate data for a project

Create a metadata file, containing paths to BAM files:

$ cat metadata.tsv
sample_id   site   bam_path           collection_date
E21294149D  site1  /path/to/file.bam  2022-10-2
5523DEB355  site6  /path/to/file.bam  2022-9-3
FE3B496871  site2  /path/to/file.bam  2022-5-8
E2A89A963D  site0  /path/to/file.bam  2022-6-5
99508919E2  site1  /path/to/file.bam  2022-10-1
...

Call vafdb generate, with a project name, and the metadata file as an argument:

$ vafdb generate example_project --tsv metadata.tsv
<[200] OK>
{
    "project" : "example_project",
    "sample_id": "E21294149D",
    "task_id": "f13e7d1b-b4f6-40fd-890f-b100ca5b27ee"
}
<[200] OK>
{
    "project" : "example_project",
    "sample_id": "5523DEB355",
    "task_id": "ed9ba54b-e1a6-4613-b8f9-12a295544943"
}
<[200] OK>
{
    "project" : "example_project",
    "sample_id": "FE3B496871",
    "task_id": "a77bbf24-f243-498f-938b-d8c8db7ba3ab"
}
<[200] OK>
{
    "project" : "example_project",
    "sample_id": "E2A89A963D",
    "task_id": "851bf18c-53e6-4f47-90be-191f0d8aa976"
}
<[200] OK>
{
    "project" : "example_project",
    "sample_id": "99508919E2",
    "task_id": "88aa7765-847d-437c-abe1-744114577ebb"
}
...

Retrieve data from a project

Filter data via the CLI with vafdb filter, and send the results to a file:

$ vafdb filter example_project --field reference chrom1 --field position__range 250,300 > vafs.tsv
$ vafdb filter example_project --field sample_id E2A89A963D > vafs.tsv
$ vafdb filter example_project --field position 500 --field collection_date__gt 2023-01-01 > vafs.tsv

Execute complex filtering via the python client API with vafdb query:

# script.py

from vafdb import Client, utils, F

# Initialise the client
client = Client()

# Send a query to the database

# This query asks for all VAFs on chrom1 across all samples
# where there was a C->T or T->C mutation above a confidence
# of 70% and with coverage greater than 50 reads
results = client.query(
    "example_project",
    query=(
        F(reference="chrom1")
        & ((F(ref_base="C") & F(base="T")) | (F(ref_base="T") & F(base="C")))
        & F(confidence__gt=70)
        & F(coverage__gt=50)
    ),
)

# Convert VAFs into a Pandas DataFrame
df = utils.pandafy(results)

# Print the result in tsv format
print(df.to_csv(index=False, sep="\t"), end="")
$ python script.py
sample_id   ref_base  base  reference   position  coverage  confidence  diff  num_a  num_c  num_g  num_t  num_ds  pc_a   pc_c    pc_g   pc_t    pc_ds  entropy  secondary_entropy  site   bam_path                   
E21294149D  C         T     chrom1  10029     361       91.69       True  1      14     1      331    14      0.277  3.878   0.277  91.69   3.878  0.226    0.677              site8  /path/to/file.bam...
E21294149D  C         T     chrom1  14408     613       86.46       True  1      74     0      530    8       0.163  12.072  0.0    86.46   1.305  0.278    0.275              site8  /path/to/file.bam...
E21294149D  C         T     chrom1  16466     199       90.452      True  1      15     1      180    2       0.503  7.538   0.503  90.452  1.005  0.239    0.529              site8  /path/to/file.bam...
E21294149D  C         T     chrom1  19220     124       99.194      True  0      1      0      123    0       0.0    0.806   0.0    99.194  0.0    0.029    0.0                site8  /path/to/file.bam...
E21294149D  C         T     chrom1  21846     516       94.961      True  2      7      6      490    11      0.388  1.357   1.163  94.961  2.132  0.163    0.904              site8  /path/to/file.bam...
E21294149D  T         C     chrom1  26767     1351      90.6        True  10     1224   5      33     79      0.74   90.6    0.37   2.443   5.848  0.251    0.702              site8  /path/to/file.bam...
E21294149D  T         C     chrom1  27638     110       91.818      True  0      101    1      4      4       0.0    91.818  0.909  3.636   3.636  0.225    0.696              site8  /path/to/file.bam...
E21294149D  C         T     chrom1  27752     145       92.414      True  2      3      2      134    4       1.379  2.069   1.379  92.414  2.759  0.23     0.968              site8  /path/to/file.bam...

Delete data from a project

$ vafdb delete example_project E21294149D
<[200] OK>
{
    "project" : "example_project",
    "sample_id": "E21294149D",
    "deleted": true
}

About

Database & API for investigating variant allele frequencies across multiple BAM files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published