Gene Annotation

RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene.

Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. For human and mouse genomes, there are several major sources of gene annotations that can be used for quantification, such as Ensembl, GENCODE, UCSC, and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the quantification of gene expression in a RNA-seq pipeline. In this analysis, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using benchmark RNA-seq data generated by the SEquencing Quality Control (SEQC/MAQC III) consortium. We show that the use of RefSeq gene annotation led to better quantification accuracy, based on the correlation with ground truth such as expression data from >800 real-time PCR validated genes.

Data

The raw data used in this analysis can be downloaded from here. The data should be downloaded to the root directory of the analysis pipeline.

This consists of the following;

RNA-seq data
RNA-seq data consists four reference RNA samples which have been well characterised by the SEQC/MAQC Consortium. The first two are A (Universal Reference RNA) and B (Human Brain Reference RNA) from the SEQC/MAQC Consortium. Other samples include C and D which are derivatives of A and B mixed in the ratios of 3:1 in C and 1:3 in D respectively. Each sample has four replicates and with each replicate having paired-end library reads. Each read was sequenced to a depth of 100bp.

The files should be saved in the directory "fqs".

RefSeq NCBI chromosome alias file
The annotation from NCBI does not use the UCSC chromosome name format, a chromosome alias file is thus provided for RefSeq-NCBI release 39 annotation.

qRT-PCR gene expression
A TaqMan RT-PCR dataset, also, from the SEQC project with expression values measured for over a 1000 genes was used to validate the expression of the RNA-seq data. The expression values were measured for both the UHRR and HBRR samples together with their respective combinations.

Microarray BeadChip
Microarray data from the SEQC project with samples A to D hybridized to the Illumina Bead arrays

GRCh38 reference genome
The human GRCh38.p12 version 34 reference genome available at GENCODE is used.

Gene annotation files

The following gtf files should be downloaded from their respective gene annotation databases and saved in the same directory as the code;

Ensembl release 100

RefSeq-NCBI release 39

Software

The entire pipeline should be run using R 4.0, you will need the following Bioconductor packages

Other tools needed include;

Python
Bash
Rstudio

Running the analysis

To run the analysis, simply download the Rmarkdown file (Analysis_pipeline.Rmd) and follow the guided steps.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Analysis_pipeline.Rmd		Analysis_pipeline.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis_pipeline.Rmd

Analysis_pipeline.Rmd

README.md

README.md

Repository files navigation

Gene Annotation

Data

Software

Running the analysis

About

Releases

Packages

Contributors 2

ShiLab-Bioinformatics/GeneAnnotation

Folders and files

Latest commit

History

Analysis_pipeline.Rmd

Analysis_pipeline.Rmd

README.md

README.md

Repository files navigation

Gene Annotation

Data

Software

Running the analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages