Skip to content
3UTR edited this page Jun 21, 2022 · 32 revisions

Welcome to the 3aQTL-pipe wiki!

Before starting the pipeline

Download test dataset

wget -i test_data_RNA.links.txt
wget -i test_data_genotype.links.txt
  • Refseq gene annotation file (hg19_refseq_whole_gene.bed) and refseq transcript id map to gene symbol (hg19_refseq_id_to_symbol.txt), both files can be downloaded from UCSC table browser follow the instructions shown in the figure below.
image
  • choose species in box 1
  • choose reference genome version in box 2
  • select "Genes and Gene Predictions", NCBI RefSeq", and "RefSeq All" in box 3
  • choose "BED" in box 4
  • check "name" in box 5
  • "name2" in box 6

NOTE: 3aQTL-pipe is not limited to human reference genome hg19, but also open to other human reference genomes and reference genomes of other species. If users choose other version of genomes, please replace the choosed species in red box 1 and assembly version in red box 2. The important thing is, the reference genome annotation used should be consistent with the one used in generating RNA-seq bam files.

Prepare the environment

1. Download and install python3

wget -c https://repo.anaconda.com/archive/Anaconda3-5.3.0-Linux-x86_64.sh
bash Anaconda3-5.3.0-Linux-x86_64.sh

2. Download and install other tools with conda

conda install -c r r-base
conda install -c conda-forge r-dplyr
conda install -c bioconda r-peer
conda install -c bioconda Bioconductor-impute
conda install -c bioconda matrixeqtl
conda install -c bioconda bedtools
conda install -c bioconda vcftools
conda install -c bioconda samtools
conda install -c bioconda plink=1.90

3. Install 3'aQTL-pipe

git clone https://github.com/3UTR/3aQTL-pipe.git

4. Install SuSieR

# start R
R
# use the following R function to install the latest susieR
> remote::install_github("stephenslab/susieR")

Alternatively, Users can also use our pipeline through the prebuilt docker container, see the corresponding section below.

Quick Start

Required data list before starting this pipe

Data description Example Data
RNA-seq alignment files (bam files) The 89 bam files in GEUVADIS RNA-seq project (CEU subpopulation)
Genotype data in VCF format The VCF files in GEUVADIS RNA-seq project
A text file contains all samples sample_list.txt (in the 3aQTL_pipe repository)
A reference genome annotation (BED) hg19_refseq_whole_gene.bed
ID mapping between Refseq ID and gene symbol hg19_refseq_transcript2geneSymbol.txt
A text file contains VCF file(s) vcf_list.txt
A tab-delimited text file contains known covariates, e.g. gender (optional) known_covariates.txt

1. Download 3aQTL-pipe and prepare required data files

Move/copy the data in above table to "./3aQTL-pipe/" If users want to repeat the pipeline on GEUVADIS dataset ("CEU" sub-population), you can use the "hg19_refseq_whole_gene.bed", "hg19_refseq_transcript2geneSymbol.txt" in the github repository and prepare a "sample_list.txt" and a "vcf_list.txt" for the dataset according to the example files we provided in the github repository.

Note: the bam files in "sample_list.txt" and VCF file(s) in "vcfList.txt" should including path.

For example:

sample1_id /path/to/bam/sample1_id.bam # change "/path/to/bam/" to the location where you put bam files

/path/to/vcf/testdat.vcf # change "/path/to/vcf/" to the location where you put vcf files

2. Start the pipe

2.1 change directory to 3aQTL_pipe

cd 3aQTL-pipe

2.2 prepare input data for running Dapars2

bash ./src/prepare_inputs_for_apa_quant.sh -s sample_list.txt -g hg19_refseq_whole_gene.bed -r hg19_refseq_id_to_symbol.txt 

This shell script will be run with 8 threads in parallel by default, users can change this by option "-t"

Three files will returned: refseq_3utr_annotation.bed, wigFile_and_readDepth.txt, Dapars2_running_configure.txt

2.3 run APA quantitative analysis across samples

# analyze single chromosome
python ./src/Dapars2_Multi_Sample.py Dapars2_running_configure.txt chr1

# alternatively, users can analyze all chromosomes automatically
cat refseq_3utr_annotation.bed | cut -f 1|sort|uniq |grep -v “MT” > chrList.txt
python ./src/DaPars2_Multi_Sample_Multi_Chr.py Dapars2_running_configure.txt chrList.txt

2.4 Aggregate Dapars2 results

Rscript ./src/merge_apa_quant_res_by_chr.R -c chrList.txt

This will generate a final APA quantitative profile "Dapars2_res.all_chromosomes.txt" by default. For more options please check with option "-h".

2.5 prepare inputs for 3'aQTL mapping

bash ./src/prepare_inputs_for_3aQTL_mapping.sh -c known_covariates.txt

Note: option “-c” takes a tab-delimited covariate file (default is “NA” if not available). An example can be found at https://github.com/3UTR/3aQTL-pipe.

2.6 Perform 3'aQTL mapping

Rscript ./src/run_3aQTL_mapping.R

Note: The cis 3’aQTLs “Cis_3aQTL_all_control_gene_exprs.txt” and trans 3’aQTLs “Trans_3aQTL_all_control_gene_exprs.txt” will be output in directory “Matrix_eQTL/”.

2.7 Visualize significant associationsRscript ./src/prepare_input_files_matrixeqtl.R

Rscript ./src/QTL_plot.R -s “chr7_128640188_A_G” -g “NM_001347928.2|IRF5|chr7|+”

Note: The first parameter specifies the SNP for visualization. The second one specifies the related transcript. This will generate a publication-ready plot “chr7_128640188_A_G.IRF5.pdf”

2.8 Prepare input files for fine-mapping

bash ./src/prepare_inputs_for_finemapping.sh

Note: The shell script requires “3UTR_location.txt”, genetic associations result (e.g. “Cis_3aQTL_all_control_gene_exprs.txt”) that generated in step 5 and step 6, respectively. And genotype data (“Genotype_matrix.txt”) is generated in step 5. It outputs a directory for each significant transcript with “3aQTL.vcf” and “expr.phen”.

2.9 Fine mapping analysis

bash ./src/run_fine_mapping.sh -t 8 
# option "-t" specifies the number of threads to be used, default if "1"

Note: SusieR will generate three files in each transcript directory. One plot describes the independent signals, an R binary file contains the results of susieR fine mapping, and a text file lists all independent signals with the suffix “.pdf”, “.rds”, and “.txt”, respectively.

3.0 merge fine mapping results of individual genes

Rscript ./src/merge_finemap_results.R

Docker image for 3aQTL-pipe

We have built a docker image in name of "3aqtl_pipe" for the whole pipeline includes all source codes of scripts involved in this pipeline. The docker image "3aqtl_pipe" has been pushed into Docker hub, users can pull down and run a new container from this image and use the pipeline through the created container.

Quick guide

We assumed users have docker installed in the server. If not, please contact the administrator of the server to install docker at first.

  1. Pull down the docker image "3aqtl_pipe"
docker pull 3utr/3aqtl_pipe:miniv4 # “miniv4" denotes the tag info
docker image ls # list all images to see whether "3utr/3aqtl_pipe:miniv4 was ready
  1. Run a docker container from the pulled docker image

Before creating a container, prepare all bam files and vcf files in a directory (e.g. "bam_vcf/", could be two sub directories within "bam_vcf") in local sever and mount this local directory to docker container (setted by option "-v"). Create another directory in local host (e.g. "workspace_3aqtl/") to save all results files.

docker run -it --name="3aqtl_container" -v /PATH/TO/bam_vcf:/home/bam_vcf -v /PATH/TO/workspace_3aqtl:/home/worksapce/3aqtl_pipe:miniv4 /bin/bash
# -v option creates volume for directory in local host and it will be map to the directory in container "/home/bam_vcf" (will be created if not exits).
# The purpose of creating a volume for "workspace_3aqtl" is to save the running results in the container, the data in this directory will be Synchronous between local host and the container
# the initial location of the container was set to /home
ls # list the contents in current location,you will see a "3aQTL_pipe" directory, "workspace" directory and a "bam_vcf" directory
cp -r ./3aQTL_pipe/src ./workspace/
cp ./3aQTL_pipe/known_covariates.txt ./workspace # copy the known covariates of the test dataset to workspace
cd ./workspace # change the directory to workspace
  1. Create a "sample_list.txt" and "vcf_list.txt" in "./workspace" directory

The two files contains samples and corresponding location of bam files, and location of vcf files, therefore they should be created after run a container.

**An example of "sample_list.txt":

image

**An example of "vcf_list.txt":

image

Here is an example way to do this:

#we assume the bam files and vcf files are located in the directory "bam_vcf" in the container initial location, if not you need to change 
# the location in commands below

cd workspace
for i in `ls ../bam_vcf/*.bam`;do sample=${i##*/};sample=${sample%%.*};echo -e "$sample\t$i";done > sample_list.txt
ls ../bam_vcf/*.vcf* > vcf_list.txt

After this, users can jump to "Quick Start" to go through the whole pipeline step by step.

Examples of output

1. Example of APA quantification across samples

An example of format of Dapars2 output can be found in the wiki of Dapars2

2. Example of 3'aQTL mapping by Matrix-eQTL

Matrix-eQTL will report the association statistics of tested SNP-Gene pair in tab-delimited text file as shown below:

image

3. Example of fine-mapping results

image

Authors

Xudong Zou, Ruofan Ding, Wenyan Chen, Gao Wang, Shumin Cheng, Wei Li, Lei Li

Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China

Citation

  • Code and Execution:

[Ref TBD]

  • The first 3'aQTL atlas of human tissues:

An atlas of alternative polyadenylation quantitative trait loci contributing to complex trait and disease heritability

Lei Li, Kai-Lieh Huang, Yipeng Gao, Ya Cui, Gao Wang, Nathan D. Elrod, Yumei Li, Yiling Elaine Chen, Ping Ji, Fanglue Peng, William K. Russell, Eric J. Wagner & Wei Li. Nature Genetics,53,994-1005 (2021). DOI:https://doi.org/10.1038/s41588-021-00864-5

https://www.nature.com/articles/s41588-021-00864-5

Contact

For any issues, please create a GitHub Issue.