The V2G pipeline links genetic variants to their target genes on a cell-type specific basis. The only required input provided by the user is a tab-delimited table of variants for each trait, formatted as described in Variant table section .
You can download ABC predictions in 131 cell types and tissues from here, and the corresponding accessible peaks from here. Additional customization is described in Config section.
Create a variant list for each trait. This step is not required as long as the variant.list has the requreid columes specified in Variant table section.
LD-expand Aragam and Harst variants and include fine-mapped variants from the publication.
# LD-expand
bash preprocessing/CAD_*/log.addRsid.0.9.sh
# reformat
Rscript preprocessing/CAD_*/CreateVariantList.R
Preprocess the fine-mapped UK biobank variants.
# reformat
Rscript preprocessing/UKBB_*/CreateVariantList.R
Create the environment from the snakmake/envs/V2G.yaml file
conda env create -f V2G.yaml
Run the snakemake pipeline after setting up the config file. Each step of the pipeline is explained in the snakefile.
mkdir logs
snakemake --conda-frontend conda --profile sherlock --configfile snakemake/config/config.yaml --rerun-incomplete --snakefile snakemake/workflow/Snakefile --cluster "sbatch -n 1 -J {rule} -o logs/{rule}_{wildcards}.qout -e logs/{rule}_{wildcards}.e --cpus-per-task {threads} --mem {resources.mem_gb}GB --time {resources.runtime_hr}:00:00"
VariantTable: describled in the **Variable table** section.
OutDir: the absolute path to the output file location.
CodeDir: the absolute path to the script directory. E.g. {path}/V2G/snakemake/workflow/scripts
hg19sizes: the absolute path to the chromosome sizes file. E.g. {path}/V2G/resources/hg19_chr_sizes
RemoveNonCoding: whether to remove non-coding genes. (TRUE/FALSE)
MungeCredibleSet:
Genes: the absolute path to the bed file containing gene annotations. E.g., {path}/V2G/resources/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed
Promoters: the absolute path to the file containing promoter annotations. E.g.,{path}/V2G/resources/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed"
UbiquitouslyExpressedGenes: the absolute path to the bed file containing ubiquitously expressed genes. E.g., {path}/V2G/resources/UbiquitouslyExpressedGenes.txt
ABCPRED: the absolute path to the ABC results for variant overlapping. E.g.{path}/V2G/resources/size_sorted_CombinedPredictions.AvgHiC.ABC0.015.minus150.txt.gz"
ALLPEAKS: the absolute path to the directory containing [cell type-specific peaks](https://mitra.stanford.edu/engreitz/public/SchnitzlerKang2023/EnhancerList.minus150).
CelltypeTable: the absolute path to the table specifying cell type name patterns for cell type groups. The table is for catagorizing the cell types in ABCPRD and ALLPEAKS. E.g. {path}/V2G/resources/grouped_celltype_table.txt"
LipidBloodAssociationTable: the absolute path to the table containing variants associated with lipid level regulation. E.g., {path}/V2G/resources/lipid.level.csv"
intersectWithABC:
header: the absolute path to the file containing the header for the output file of the intersectWithABC step. E.g. {path}/V2G/resources/ABC_overlap.header"
intersectWithCelltypes:
PIP: only consider variants with PIP larger than the value. E.g., 0
groupABCByCelltypes:
PIP: same as above.
groupPeakOverlap:
PIP: same as above
This table provides information for the snakemake pipeline to operate on. The values of the fine_mapped_table column should link to the fine mapped variant table of each trait.
ColumnName | Definition | Example |
---|---|---|
Trait | The unique identifier of each trait. Required. | CAD_Aragam2021 |
fine_mapped_table | The absolute path to the fine-mapped variant list. Required. | variant.list.txt |
LeadVariantCol | The lead variant column in the fine-mapped variant table. Required. | LeadVariant |
VariantCol | The variant ID column in the fine-mapped variant table. Required. | RSID |
ChrCol | The chromosome column in the fine-mapped variant table. Required. | chr |
PosCol | The variant position column in the fine-mapped variant table. Required. | position |
PCol | The p value column in the fine-mapped variant table. Required. | P.value |
PIPCol | The posterior probability column in the fine-mapped variant table. NA if the column does not exist | PIP |
Source | The source of the data | NA |
ZeroIndexed | Whether the variant position is zero indexed (F/T). Required. | F |
ExcludeVariants | IDs of variants to exclude from the analysis, if any. Required. | NA |
- CAD_Aragam2021_cell2gene.txt: credible set to gene links.
- Credibleset_gene_variant_info.tsv: variants in the credible sets contributing to the credible set to gene links.
Variant information:
- * relevant_variants.tsv: variants in enhancers or accessible regions of the * cell type group.
Peak information:
- PeaksOverlapFull.Peaks.tsv: all variant-overlapping peaks.
- PeaksOverlapFull.tsv: all variants in peaks.
- CredibleSetPeakOverlapSummary.tsv: whether the credible sets overlapping with peaks in the cell type group.
- CredibleSetPeakOverlapSummary.with*peakinfo.tsv: *cell type group peaks containing variants.
- CredibleSetPeakOverlapSummary.withpeakinfo.only*.tsv: * cell type group peaks containing variants that only overlap peaks in this cell type group.
ABC information:
- ABCOverlapFull.tsv: Vairant-overlapping ABC enhancers.
- ABCOverlapFull.ranked.tsv: target genes of each credible set ranked by the maximum ABC scores of each gene in all cell types.
- ABCOverlap.ranked.*.grouped.tsv: target genes of each credible set ranked by the maximum ABC scores of each gene in all cells in * cell type group.
- ABCTopGeneTableWithCellCategories.tsv: cell type groups of top-ranked credible set-gene pairs (by ABC scores).
- ABCVariantOverlapSummary.tsv: a summary table of the ABC interactions of each variant.
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /share/software/user/open/openblas/0.2.19/lib/libopenblasp-r0.2.19.so
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rejection
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2
[4] S4Vectors_0.24.4 BiocGenerics_0.32.0 tibble_3.1.4
[7] stringr_1.4.0 optparse_1.6.6 dplyr_1.0.7
[10] tidyr_1.1.2
loaded via a namespace (and not attached):
[1] XVector_0.26.0 magrittr_2.0.1 zlibbioc_1.32.0
[4] tidyselect_1.1.0 getopt_1.20.3 R6_2.5.1
[7] rlang_0.4.11 fansi_0.5.0 tools_3.6.1
[10] utf8_1.2.2 DBI_1.1.0 ellipsis_0.3.2
[13] assertthat_0.2.1 lifecycle_1.0.1 crayon_1.4.1
[16] GenomeInfoDbData_1.2.2 purrr_0.3.4 bitops_1.0-7
[19] vctrs_0.3.8 RCurl_1.98-1.2 glue_1.6.2
[22] stringi_1.5.3 compiler_3.6.1 pillar_1.6.3
[25] generics_0.1.0 pkgconfig_2.0.3