Nextflow pipeline for genotyping from epigenomics data
- Nextflow (https://www.nextflow.io/)
- samtools (http://www.htslib.org/)
- bcftools (http://www.htslib.org/)
- pyfaidx (https://github.com/mdshw5/pyfaidx)
Samples BAM files are merged by corresponding individual and then used for a bcftools
-based genotyping pipeline.
Genetic relatedness calculated using plink2.
[sabramov@dev0 ~]$ nextflow run genotyping.nf -profile Altius
[sabramov@dev0 ~]$ nextflow run clustering.nf -profile Altius
Sample file [--samples_file]
A tab-delimited file containing information about each sample. The file must contain a header and the following columns (other columns are permitted and ignored):
- indiv_id: Individual identifier for each sample; many samples can refer to one individual
- bam_file: Absolute path the BAM-formated file
Genome reference [--genome_fasta_file]
dbSNP reference [--dbsnp_file]
Ancestral genome [--genome_ancestral_fasta_file]
Encode blacklisted regions [--encode_blacklist_regions]
Chunk size [--chunksize 5000000]
Specificies the size (in base-pairs) to use when dividing the genome into chunks for parallel processing.
SNP quality [--min_SNPQ 10]
Filter variants with poor quality
Genotype quality [--min_GQ 50]
Set genotype for an individual to ./. (missing) when genotyping score (FORMAT/GQ) is less than this value.
Sequencing depth [--min_DP 12]
Minimum sequencing depth per individual to call heterozygous sites.
Per-allele depth [--min_AD 4]
Minimum sequencing depth at each allele per individual to call heterozygous sites.
Hardy-Weinberg equilbrium [--hwe_cutoff 0.01]
Filter variants that are out of Hardy-Weinberg equilibrium (p-value threshold)
Output directory [--outdir output]
Specify output direectory
The pipeline outputs a single VCF-formated file containing the called and filtered genotypes for each distinct invididual in the samples file. Each variant is annotated with the following extra infornation:
- ID field: dbSNP rs number
- INFO/CAF: 1000 genomes project allele frequency (from dbSNP annotation file)
- INFO/TOPMED: TOPMED project allele frequency (from dbSNP annotation file)
- INFO/AA: Inferred ancenstral allele from EPO/PECAN alignments (see "Input" for information about how this is obtained)