DBofAFs

Pipeline for building a database of allele frequencies. The objective of this pipeline is to create and update a database of Allele Frequencies (AF) using vcf files. A metadata file in TSV must be provided with the sample id, family id and category (e.g. disease or phenotype) to calculate the AF. The AF is calculated for the whole cohort, subcohorts defined by each category and for a subset of samples acting as pseudocontrols defined by the whole cohort but the selected category (e.g. non-related diseases).

License

DBofAFs source code is provided under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). DBofAFs includes several third party packages provided under other open source licenses, please check them for additional details.

Developers

Main developers

Gonzalo Núñez Moreno
Ionut-Florin Iancu
Lorena de la Fuente Lorente

Collaborators

Raquel Romero Fernández
Pablo Mínguez Paniagua

Contact

Gonzalo Núñez Moreno (gonzalo.nunezm@quironsalud.es)

Dependencies

Programming languages:

Python 3.6.7
R v3.5.0

Bioinformatic tools:

bcftools v1.3
bedtools v2.30.0
tabix (htslib) 1.9
bgzip (htslib) 1.9
GNU parallel 20210222
PLINK v1.90b6.9 64-bit (4 Mar 2019)
rename from util-linux 2.23.2 To run rename from perl package, please comment the lines where the util-linux is used and uncomment the ones where the perl one is used.

Python libraries

argparse
pandas
hail (0.2.30-2ae07d872f43)
time
gzip
ast
sys
glob
datetime

Pipeline description

This pipeline has the following steps:

First family filter and sample duplication management: in this step known related samples (using the family ID of the metadata file) are discarded if any relative is already inside the database. If the same sample is introduced, a prefix is added and the sample with less coverage will be discarded during the second familiar relationship filter.
Merge: All VCFs are merged into a single multi-sample VCF.
Imputation: Coverage information is used to differentiate between a non covered position and a covered-non-variant position.
Second family filter: Coefficient of relationship is calculated using PLINK to discard related or duplicated samples samples. If samples are related, samples with more coverage are kept.
AF calculation: The Python package Hail is used to calculate the general, subcohort (category) specific and pseudocontrol (all except the selected category) AF.

Requirements

Metadata A TSV file must be provided with information of all samples. This file must have a header with the following names: SAMPLE, FAMILY and CATEGORY. These column names must be in capital letters. The category can be a disease, phenotype or any feature that can be used to make subcohorts. Directory structure The database directory must have the following directories created before running the pipeline: IMPORTANTISIMOOOO: hay que poner merged_vcf no merged_vcfs en el nombre de la carpeta (originalmente pone merged_vcfs y no es así)

└─coverage
│   discarded_bed
│   incorporated_bed   
│   new_bed   
│
└─db
└─imputed_vcf
└─individual_vcf
│   discarded_vcf
│   incorporated_vcf   
│   new_vcf   
│
└─merged_vcf
└─metadata

Input VCF files must be copied into individual_vcf/new_vcf/ directory and BED files with coverage information for each sample in coverage/new_bed/. This coverage files can be created using mosdepth: mosdepth --quantize 10: -n -x ${output_prefix} ${bamfile} to create a bed file with the regions captured with a depth equal or higher than 10 reads. Samples names from the VCF are retrieved from the header of the file. BED file should follow this naming [SAMPLEID]_*.bed or [SAMPLEID].*.bed .

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
tasks		tasks
README.md		README.md
pipeline_BD_SNV.sh		pipeline_BD_SNV.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBofAFs

License

Developers

Main developers

Collaborators

Contact

Dependencies

Pipeline description

Requirements

About

Releases

Packages

Contributors 3

Languages

TBLabFJD/DBofAFs

Folders and files

Latest commit

History

Repository files navigation

DBofAFs

License

Developers

Main developers

Collaborators

Contact

Dependencies

Pipeline description

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages