<div class="alert alert-warning">
    <strong>Analyst Note: Fill In</strong><br />
    Fill in the human-readable name of your project as a header, such as:
    
   > # Dr. Doe Human Patient Time-Series
   
</div>

# RNASeq RSEM Counts Extraction

<div class="alert alert-warning">
    <strong>Analyst Note: Fill In</strong><br />
    Fill in the author attributions for your analysis, such as:
    
   > * Amanda Birmingham, CCBB (abirmingham@ucsd.edu)
   > * Based on upstream analysis by Guorong Xu, CCBB (g1xu@ucsd.edu)
</div>

## Table of Contents
* [Introduction](#Introduction)
* [Parameter Input](#Parameter-Input)
* [Library Import](#Library-Import)
* [Counts Extraction](#Counts-Extraction)
* [Citations](#Citations)
* [Appendix: R Session Info](#Appendix:-R-Session-Info)

## Introduction

This notebook takes in results of the RSEM ([1](#Citations)) RNASeq transcript quantification method and extracts a per-sample-per-gene count file for use in future analyses.  It performs a partial subset of the work in the "0b_Optional_RNASeq_RSEM_QC_and_Counts_Preparation.ipynb" notebook.

[Table of Contents](#Table-of-Contents)

## Parameter Input

<div class="alert alert-warning">
<h4>Analyst note: Modify Code</h4>
The values shown below are example settings, and should be overwritten with appropriate values for your analysis.
</div>

In [1]:
gDataDir = "../inputs/"
gOutputCountsFilename = "all_gene_counts.txt"

<div class="alert alert-warning">
<h4>Analyst note: Info</h4>
The values shown below are standard settings, and should NOT be changed without a clear reason and understanding of the consequences.
</div>

In [2]:
gOutputDir = "../outputs"
gRsemGenesFp = file.path(gDataDir, "all_genes_results.txt")
gOutputCountsFp = file.path(gOutputDir, gOutputCountsFilename)

[Table of Contents](#Table-of-Contents)

## Library Import

Import the necessary R ([1](#Citations)) libraries:

In [None]:
# install.packages("splitstackshape")

In [None]:
#library(IRdisplay)
#library(splitstackshape)

[Table of Contents](#Table-of-Contents)


## Counts Extraction

Extract raw counts column from RSEM output per sample. 

In [3]:
loadAndCleanStarRsemAllGeneResults = function(rsemGenesFp, sep="\t"){
    rsemGenesDf = read.table(rsemGenesFp, header = TRUE, sep=sep, stringsAsFactors=FALSE)
    geneCountsDf <- rsemGenesDf[,sapply(colnames(rsemGenesDf), function(x) any(grepl(".results_expected_count",x)))]
    colnames(geneCountsDf) <- gsub(".genes.results_expected_count","", colnames(geneCountsDf))
    row.names(geneCountsDf) <- rsemGenesDf$gene_id   
    return(geneCountsDf)
}

In [4]:
gUnorderedGeneCountsDf = loadAndCleanStarRsemAllGeneResults(gRsemGenesFp)

In [5]:
dim(gUnorderedGeneCountsDf)
head(gUnorderedGeneCountsDf)

Unnamed: 0,EL20_S20_L004_R1_001,EL62_S62_L004_R1_001,EL60_S60_L004_R1_001,EL59_S59_L004_R1_001,EL61_S61_L004_R1_001,EL29_S29_L004_R1_001,EL49_S49_L004_R1_001,EL32_S32_L004_R1_001,EL51_S51_L004_R1_001,EL22_S22_L004_R1_001,⋯,EL6_S6_L004_R1_001,EL24_S24_L004_R1_001,EL10_S10_L004_R1_001,EL12_S12_L004_R1_001,EL34_S34_L004_R1_001,EL38_S38_L004_R1_001,EL58_S58_L004_R1_001,EL50_S50_L004_R1_001,EL44_S44_L004_R1_001,EL63_S63_L004_R1_001
ENSG00000000003.14,1.0,2.0,2.0,4.0,4.0,0.0,3.0,2.0,2.0,4.0,⋯,1.0,3.0,2.0,0.0,1.0,4.0,3.0,0.0,4.0,4.0
ENSG00000000005.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.12,118.0,127.0,107.0,159.0,79.0,83.0,100.0,151.0,189.0,79.0,⋯,47.0,118.0,92.0,64.0,183.0,189.0,86.0,137.0,184.0,118.0
ENSG00000000457.13,240.62,276.31,254.35,225.02,234.61,200.96,165.91,269.95,323.61,157.46,⋯,106.28,294.79,227.07,203.26,263.25,272.84,186.32,216.41,292.29,253.39
ENSG00000000460.16,88.38,68.69,50.65,75.98,65.39,55.04,54.09,53.05,87.39,42.54,⋯,31.72,57.21,56.93,56.74,72.75,66.16,71.68,46.59,86.71,57.61
ENSG00000000938.12,2142.0,2545.0,2243.0,2249.0,1732.0,2083.0,2114.0,2422.0,3041.0,1814.0,⋯,754.0,2276.0,2138.0,2273.0,2817.0,2381.0,1847.0,1936.0,2874.0,2931.0


In [6]:
colnames(gUnorderedGeneCountsDf) = gsub("X","", colnames(gUnorderedGeneCountsDf))

Write out the resulting gene counts to a tab-delimited text file:

In [7]:
write.table(gUnorderedGeneCountsDf, gOutputCountsFp, sep = "\t")

[Table of Contents](#Table-of-Contents)

## Citations

1. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011 Aug 4;12:323.
3. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

[Table of Contents](#Table-of-Contents)

## Appendix: R Session Info

In [None]:
sessionInfo()

[Table of Contents](#Table-of-Contents)

Copyright (c) 2018 UC San Diego Center for Computational Biology & Bioinformatics under the MIT License

Notebook template by Amanda Birmingham