Skip to content
3UTR edited this page Jun 21, 2022 · 17 revisions

Github Release python Release numpy Release scipy Release R Release

DaPars (v2.1)

Dynamics analysis of Alternative PolyAdenylation from multiple RNA-seq data

Introduction

The dynamic usage of the 3’untranslated region (3’UTR) resulting from alternative polyadenylation (APA) is emerging as a pervasive mechanism for regulating mRNA diversity, stability and translation. Though RNA-seq provides the whole-transcriptome information and a lot of tools for analyzing gene/isoform expression are available, very few tool focus on the analysis of 3’UTR from standard RNA-seq. DaPars v2 is the next generation of DaPars that directly infers the dynamic alternative polyadenylation (APA) usage by comparing standard RNA-seq from multiple samples. Given the annotated gene model, DaPars v2 can infer the de novo proximal APA sites as well as the long and short 3’UTR expression levels. Finally, the dynamic APA usages of each samples will be identified.

Input files and format

Dapars2 requires the following file formats as input:

Format Description Example
BED a BED12 file of gene models. It can be downloaded from UCSC hg38_wholeGene_annotation.bed
Bedgraph files in this format store the reads alignment information, which can be converted from BAM files by bedtools (e.g.:"bedtools genomecov -ibam *.bam -bga -split -trackline") sample_1.wig
plain text a tab-delimited two columns file contains Refseq ID mapping with gene name hg38_refseq_IDmapping.txt

Usage

Step 1: Generate region annotation by DaPars_Extract_Anno.py

1.1 Download reference gene annotation from UCSC Table browser

Gene 3'UTR regions can be extracted from reference gene annotation, any reference gene annotations available at UCSC table browser are supported by DaPars2. Follow the guide shown in the figure below to get the gene annotation and the ID mapping between RefSeq gene id and gene symbol.

NOTE: DaPars2 is not limited to human reference genome hg38, but also open to other human reference genomes and reference genomes of other species. If users choose other version of genomes, please replace the choosed species in red box 1 and assembly version in red box 2. The important thing is, the reference genome annotation used should be consistent with the one used in generating RNA-seq bam files.

image
  • choose species in box 1
  • choose reference genome version in box 2
  • select "Genes and Gene Predictions", NCBI RefSeq", and "RefSeq All" in box 3
  • choose "BED" in box 4
  • check "name" in box 5 and "name2" in box 6

1.2 Extract 3'UTR region from gene annotation

DaPars2 will use the extracted distal polyadenylation sites to infer the proximal polyadenylation sites based on the alignment wiggle files of all samples. The output in this step will be used by step 3.

Run the following commands to generate 3‘UTR region annotation:

python DaPars_Extract_Anno.py -b hg38_wholeGene_annotation.bed -s hg38_refseq_IDmapping.txt -o hg38_3UTR_annotation.bed
-h, --help                                             Show this help message and exit.
-b GENE_BED_FILE, --bed=GENE_BED_FILE                  The gene model in BED format. The BED file can be downloaded from UCSC Table Browser.
-s ID_mapping_FILE, --gene_symbol_map=ID_mapping_FILE  The ID mapping between RefSeq transcript and gene symbol.
-o OUTPUT_FILE, --out-preifx=OUTPUT_FILE               The extracted annotation region will be stored into this file.

Step 2: Generate mapped reads files for all samples

DaPars2 will use number of mapping reads for each samples to normalize library sizes. The output in this step will be used by step 3.

The format of the mapping reads file is:

wig depth
wig/sample1.wig 76485651
wig/sample2.wig 57005615
wig/sample3.wig 68960763

The first column is the wig file location and the second column is the corresponding mapped reads. The order of the samples in this file must be the same as the order of wig files in step 3.


Step 3: Run DaPars2 (DaPars2_Multi_Sample_Multi_Chr.py)

Dapars2 uses multiple threads, Please run the following commands to perform DaPar2 analysis.

python DaPars2_Multi_Sample_Multi_Chr.py Dapars2_configure_file chrList.txt

The "Dapars2_configure_file" contains most of the parameter settings for running DaPars2.

The format of a configure file is:

# Specify the reference of 3'UTR region

Annotated_3UTR=RefSeq_hg38_3UTR_annotation.bed

# A comma separated list of wig files of all samples

Aligned_Wig_files=sample1.wig,sample2.wig

Output_directory=Dapars2_test/

Output_result_file=Dapars2

# Specify Coverage threshold

Coverage_threshold=10

# Specify the number of threads to process the analysis

Num_Threads=4

# Provide sequencing depth file for normalization

sequencing_depth_file=mapping_wig_location_with_depth.txt

The "chrList.txt" contains all chromosomes need to be analyzed. It can be extracted from "hg38_3UTR_annotation.bed" generated in Step 2 using command below:

> cat hg38_3UTR_annotation.bed |cut -f1|sort|uniq > chrList.txt

If you want to analyze each chromosome separately to save time and memory, Run the following commands to perform DaPars2 analysis:

python Dapars2_Multi_Sample.py Dapars2_configure_file current_processing_chromosome

Output format

Dapars2 will output a tab-delimited table contains at least 5 columns (as shown below). The first four columns are: Gene, fit_value, Predicted_Proximal_APA, Loci. And the additional columns represent PDUI values across samples.

"Predicted_Proximal_APA" means predicted proximal Poly(A) site; "Loci" means the 3'UTR region of the transcript. image

Docker container

We have built a docker image in name of "3aqtl_pipe" which includes the source codes of a pipeline for 3'aQTL mapping. The source codes of Dapars2 are also involved in this image. The "3aqtl_pipe" has been pushed to Docker Hub.

Users can use the 3'aQTL pipeline after creating a container from "3utr/3aqtl_pipe" image.

Release history

DaPars2

  • DaPars2 v2.1 is released

Contact