Skip to content

PoolLab/ReferenceEnhancer

Repository files navigation

# ReferenceEnhancer

The goal of ReferenceEnhancer is to generate a scRNA-seq optimized transcriptomic reference.

Generating a scRNA-seq optimized transcriptomic reference requires optimizing the genome annotation ("xxx.gtf") file that transcriptomic references are based on.

The following three aspects of genome annotations need to be optimized: A) Resolving gene overlap derived read loss; B) Recovering intergenic reads from 3' un-annotated exons; and C) Recovering intronic reads.

After optimizing and assembling the genome annotation, you can use "cellranger mkref" pipeline to assemble the optimized transcriptomic reference for mapping sequencing read data and compiling gene-cell matrices with the "cellranger count" (or other) pipeline.

## Installation

You can install the development version of ReferenceEnhancer as follows:

``` r
install.packages("devtools") 
require(devtools) 
install_github("PoolLab/ReferenceEnhancer")
```

## Example

# This is a sample workflow of the package:

This is the basic workflow for optimizing a genome annotation for single-cell RNA-seq work using ReferenceEnhancer:

1.  Load ReferenceEnhancer and import ENSEMBL/10x Genomics default genome annotation file (GTF).

This file can be downloaded from 10x Genomics provided reference transcriptome "gene" folder at "<https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest>" or Ensembl.org if wish to customize more.

For testing, we have provided a sample file.

library(ReferenceEnhancer)

genome_annotation <- LoadGtf(unoptimized_annotation_path = "test_genes.gtf")

2.  Identify all overlapping genes based on the ENSEMBL/10x Genomics default genome annotation file (GTF), rank-order them according to the number of gene overlaps.

Prioritize this gene list for manual curation focusing on exonically overlapping genes. The function saves the list of overlapping genes in working directory as overlapping_gene_list.csv.

gene_overlaps <- IdentifyOverlappers(genome_annotation = genome_annotation)

3.  Generate recommended actions for overlapping genes based on original genome annotation .gtf file and a list of overlapping genes.

The function updates overlapping_gene_list.csv file with added recommendations.

OverlapResolutions(genome_annotation = genome_annotation, overlap_data = gene_overlaps, gene_pattern = c("Rik$", "^Gm"))

4.  Extract intergenic reads from Cell Ranger aligned bam file. The function saves extracted intergenic reads in working directory as intergenic_reads.bed.

IsolateIntergenicReads(bam_file_name = "test_bam.bam", index_file_name = "test_index.bam.bai", barcode_length = 26)

5.  Generate gene boundaries in order to assign intergenic reads to a specific gene. The function save resulting in working directory as gene_ranges.bed.

Note: This step runs partially in bash/linux terminal. Before this step, make sure that bedops (<https://bedops.readthedocs.io/en/latest/>) has been installed to your computer.

GenerateGeneLocationBed(genome_annotation = genome_annotation, bedops_loc = NULL)

6.  Identify candidate genes for extension with excess 3' intergenic reads and create a rank ordered list of genes as a function of 3' intergenic read mapping within 10kb of known gene end. A rank ordered list of gene extension candidates is saved in working directory as gene_extension_candidates.csv.

Note: This step runs partially in bash/linux terminal. Before this step, make sure that bedtools (<https://bedtools.readthedocs.io/en/latest/content/installation.html>) has been installed to your computer and that it has been added to to the path in your R environment.

GenerateExtensionCandidates(bedtools_loc = NULL)

7.  Create the final optimized annotation file. The function saves the result in working directory as optimized_reference.gtf.

OptimizedAnnotationAssembler(unoptimized_annotation_path = "test_genes.gtf", gene_overlaps = "test_overlapping_gene_list.csv", gene_extension = "gene_extension_candidates.csv", gene_replacement = "test_gene_replacement.csv")

About

No description, website, or topics provided.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages