<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#E.-coli-MG1655-RNAseq-Data" data-toc-modified-id="E.-coli-MG1655-RNAseq-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><em>E. coli</em> MG1655 RNAseq Data</a></span><ul class="toc-item"><li><span><a href="#Load-your-SummarizedExperiment-file" data-toc-modified-id="Load-your-SummarizedExperiment-file-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load your SummarizedExperiment file</a></span></li><li><span><a href="#Ensure-colData-is-correct" data-toc-modified-id="Ensure-colData-is-correct-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Ensure colData is correct</a></span></li><li><span><a href="#Merge-with-PRECISE-SummarizedExperiment" data-toc-modified-id="Merge-with-PRECISE-SummarizedExperiment-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Merge with PRECISE SummarizedExperiment</a></span></li><li><span><a href="#Create-the-DESeqDataSet" data-toc-modified-id="Create-the-DESeqDataSet-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Create the DESeqDataSet</a></span></li><li><span><a href="#Remove-noisy-genes" data-toc-modified-id="Remove-noisy-genes-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Remove noisy genes</a></span></li><li><span><a href="#Calculate-TPM" data-toc-modified-id="Calculate-TPM-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Calculate TPM</a></span></li></ul></li><li><span><a href="#Multi-strain-E.-coli-RNAseq" data-toc-modified-id="Multi-strain-E.-coli-RNAseq-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Multi-strain <em>E. coli</em> RNAseq</a></span><ul class="toc-item"><li><span><a href="#Load-TPM-data-files" data-toc-modified-id="Load-TPM-data-files-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load TPM data files</a></span></li><li><span><a href="#Merge-TPM-files" data-toc-modified-id="Merge-TPM-files-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Merge TPM files</a></span></li></ul></li></ul></div>

There are two methods to join RNAseq data for ICA:
1. Merge the SummarizedExperiment files (Preferred for MG1655)
2. Merge the TPM files (Necessary for multi-strain)

Merging the SE files is preferred, because you can get a more accurate idea of which genes are noisy and must be discarded. However, this method is both slower and more error-prone than merging the TPM files. If you get any errors at the cbind step, then you can just use the TPM merging method.

If you are creating a new compendium, use pipeline 1 (Merge SE) and skip step 1.2.

In [2]:
suppressPackageStartupMessages(library("DESeq2"))
suppressPackageStartupMessages(library("dplyr"))

# _E. coli_ MG1655 RNAseq Data

## Load your SummarizedExperiment file

Use the RNAseq Workflow to generate your `se.rda` file. Replace the file `data/example_data/example_se.rda` with your data file. Feel free to rename the `example_data` folder.

In [3]:
DATA_DIR <- '../data/example_data/'

In [4]:
SE_FILE <- file.path(DATA_DIR,'example_se.rda')
load(SE_FILE)
new_se <- se
new_se

class: RangedSummarizedExperiment 
dim: 4386 2 
metadata(0):
assays(1): counts
rownames(4386): b0001 b0002 ... b4706 b4708
rowData names(0):
colnames(2): PROJECT__CONDITION__1 PROJECT__CONDITION__2
colData names(2): project condition

**NOTE:** This pipeline assumes sample IDs of **&lt;project>\__&lt;condition>\__&lt;rep#\>** (double underscore separations)

In [5]:
# If this is FALSE, at least one of your sample IDs are incompatible.
# Change sample IDs using:
# colnames(new_se) <- c('id1','id2')
all(grepl('\\w+__\\w+__\\d+', colnames(new_se)))

## Ensure colData is correct

In [6]:
colData <- colnames(new_se) %>%                     # Get column names
            strsplit('__') %>%                      # Split them by double-underscores
            sapply(c) %>%                           # Turn them into column vectors
            t %>%                                   # Transpose into row vectors
            DataFrame(row.names=colnames(new_se))   # Create dataframe
colnames(colData) <- c('project','condition','rep')
colData(new_se) <- colData

In [7]:
colData(new_se)

DataFrame with 2 rows and 3 columns
                       project condition      rep
                      <factor>  <factor> <factor>
PROJECT__CONDITION__1  PROJECT CONDITION        1
PROJECT__CONDITION__2  PROJECT CONDITION        2

## Merge with PRECISE SummarizedExperiment

In [8]:
load('../data/precise_data/se.rda')
preci_se <- se

In [10]:
colData <- colnames(preci_se) %>%                   # Get column names
            strsplit('__') %>%                      # Split them by double-underscores
            sapply(c) %>%                           # Turn them into column vectors
            t %>%                                   # Transpose into row vectors
            DataFrame(row.names=colnames(preci_se)) # Create dataframe
colnames(colData) <- c('project','condition','rep')
colData(preci_se) <- colData

In [11]:
preci_se

class: RangedSummarizedExperiment 
dim: 4386 278 
metadata(0):
assays(1): counts
rownames(4386): b0001 b0002 ... b4706 b4708
rowData names(0):
colnames(278): control__wt_glc__1 control__wt_glc__2 ...
  efeU__menFentCubiC_ale38__1 efeU__menFentCubiC_ale38__2
colData names(3): project condition rep

In [12]:
# If this line fails, use the multistrain workflow
final_se <- cbind(preci_se,new_se)

## Create the DESeqDataSet

In [13]:
# To group biological replicates, remove the final 3 characters from each sample name
# E.g. control__wt_glc__1 and control__wt_glc__2 are replicates
# Removing the last three characters of each results in control__wt_glc for both

colData(final_se)$group <- gsub('.{3}$', '', colnames(final_se))

In [15]:
dds <- DESeqDataSet(final_se, design = ~group)

“some variables in design formula are characters, converting to factors”

## Remove noisy genes

ICA of small datasets can be sensitive to noise in expression data. The major source of noise in RNA-seq data results from low counts (see [shot noise](https://en.wikipedia.org/wiki/Shot_noise)). Here, we remove genes that have less than 10 fragments mapped per million reads across the entire dataset.

For the original ICA paper, we also removed genes whose length was shorter than 100 nts. New results with microarray data show that this is unnecessary.

In [16]:
nrow(dds)

# Get fragments per million
fpm <- sweep(assay(dds), 2, colSums(assay(dds)), FUN="/")*1e6
# Keep genes with max FPM > 10
keep_genes <- rownames(assay(dds))[apply(fpm,1,max) > 10]
dds <- dds[keep_genes]
nrow(dds)

## Calculate TPM

In [17]:
# The base condition for PRECISE is base__wt_glc
base_cond <- c('control__wt_glc__1','control__wt_glc__2')

In [18]:
fpkm_data <- fpkm(dds)

# TPM = FPKM / library size * 10^6
tpm_data <- sweep(fpkm_data,2,colSums(fpkm_data),`/`)*1e6

# log2(0) is undefined, so add 1 pseudocount to each value
log_tpm <- log2(tpm_data+1)

# Subtract the mean expression value of the baseline condition from each gene
log_tpm_norm <- log_tpm - rowMeans(log_tpm[,base_cond])

In [19]:
# Save files
write.csv(log_tpm, file = file.path(DATA_DIR,'log_tpm.csv'))
write.csv(log_tpm_norm, file = file.path(DATA_DIR,'log_tpm_norm.csv'))

**MAKE SURE TO USE LOG_TPM_NORM FOR ICA CALCULATIONS**

# Multi-strain _E. coli_ RNAseq

This pipeline also works for new E. coli RNAseq data.

## Load TPM data files

In [20]:
new_tpm <- read.csv(file = file.path(DATA_DIR,'example_tpm.csv'),row.names=1)
head(new_tpm)
nrow(new_tpm)

Unnamed: 0_level_0,PROJECT__CONDITION__1,PROJECT__CONDITION__2
Unnamed: 0_level_1,<dbl>,<dbl>
b0001,3568.40476,4769.24885
b0002,1505.30782,1227.13934
b0003,1205.91303,1048.46846
b0004,904.27614,824.4345
b0005,165.61081,124.00526
b0006,81.75999,74.04125


In [21]:
# If this is FALSE, at least one of your sample IDs are incompatible.
# Change sample IDs using:
# colnames(new_se) <- c('id1','id2')
all(grepl('\\w+__\\w+__\\d+', colnames(new_tpm)))

In [45]:
precise_tpm <- read.csv(file=file.path('../data/precise_data/tpm.csv'),row.names=1)
head(precise_tpm)
nrow(precise_tpm)

Unnamed: 0_level_0,control__wt_glc__1,control__wt_glc__2,fur__wt_dpd__1,fur__wt_dpd__2,fur__wt_fe__1,fur__wt_fe__2,fur__delfur_dpd__1,fur__delfur_dpd__2,fur__delfur_fe2__1,fur__delfur_fe2__2,⋯,efeU__menFentC_ale29__1,efeU__menFentC_ale29__2,efeU__menFentC_ale30__1,efeU__menFentC_ale30__2,efeU__menFentCubiC_ale36__1,efeU__menFentCubiC_ale36__2,efeU__menFentCubiC_ale37__1,efeU__menFentCubiC_ale37__2,efeU__menFentCubiC_ale38__1,efeU__menFentCubiC_ale38__2
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
b0002,60272.78323,68198.72323,121169.20304,145540.59326,63882.06447,48004.52925,21511.75213,29469.20971,75339.01286,74165.29294,⋯,28891.2075,29042.9495,69461.22274,71009.1183,117782.59229,123758.69981,144570.48763,150727.71647,98361.06471,104063.02886
b0003,33377.06196,37164.56513,91475.25578,133756.16752,47904.42245,40046.05466,14717.08888,26674.55082,54063.75675,52111.35419,⋯,25854.8116,24755.9265,30160.51171,32604.9932,55072.62933,54680.31553,61322.83601,62218.66057,41121.10277,47295.6773
b0004,39928.51639,45480.54989,34795.42726,48008.86017,32693.80035,24683.96368,6257.00956,8796.56898,26505.65856,25967.8903,⋯,35420.3085,34007.0294,43456.61021,45430.1312,69098.96811,67017.28474,72025.89555,76254.50263,57135.92525,59933.42721
b0005,552.41138,521.43392,225.36046,206.70542,824.56606,607.41926,101.20178,115.29765,682.79173,461.27862,⋯,394.0763,299.5415,546.915,538.8763,156.36015,124.90783,231.31556,315.69882,354.656,332.15846
b0006,1007.04777,988.89326,875.24416,881.36419,863.208,911.8139,800.72576,952.40131,954.41607,912.67985,⋯,4320.1949,4113.9843,3419.40603,3201.9816,1561.63852,1579.56184,1521.48854,1501.31304,3157.1771,3310.68706
b0007,32.81408,32.35787,37.14096,32.07092,25.89253,31.45357,39.78911,38.56178,28.49857,30.14528,⋯,399.8913,404.0319,34.05348,36.5612,34.02924,41.92358,37.48823,41.76407,31.90325,26.59931


## Merge TPM files
Only keep genes that exist in both PRECISE and the new TPM files

In [53]:
# Merge turns row.names into the first column
final_tpm <- merge(precise_tpm,new_tpm,by="row.names")
# Convert row.names column back to row names
rownames(final_tpm) <- final_tpm[,1]
# Delete row.names column (In R, minus sign means "except")
final_tpm <- final_tpm[,-1]

head(final_tpm)
nrow(final_tpm)

Unnamed: 0_level_0,control__wt_glc__1,control__wt_glc__2,fur__wt_dpd__1,fur__wt_dpd__2,fur__wt_fe__1,fur__wt_fe__2,fur__delfur_dpd__1,fur__delfur_dpd__2,fur__delfur_fe2__1,fur__delfur_fe2__2,⋯,efeU__menFentC_ale30__1,efeU__menFentC_ale30__2,efeU__menFentCubiC_ale36__1,efeU__menFentCubiC_ale36__2,efeU__menFentCubiC_ale37__1,efeU__menFentCubiC_ale37__2,efeU__menFentCubiC_ale38__1,efeU__menFentCubiC_ale38__2,PROJECT__CONDITION__1,PROJECT__CONDITION__2
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
b0002,60272.78323,68198.72323,121169.20304,145540.59326,63882.06447,48004.52925,21511.75213,29469.20971,75339.01286,74165.29294,⋯,69461.22274,71009.1183,117782.59229,123758.69981,144570.48763,150727.71647,98361.06471,104063.02886,1505.30782,1227.139336
b0003,33377.06196,37164.56513,91475.25578,133756.16752,47904.42245,40046.05466,14717.08888,26674.55082,54063.75675,52111.35419,⋯,30160.51171,32604.9932,55072.62933,54680.31553,61322.83601,62218.66057,41121.10277,47295.6773,1205.91303,1048.468459
b0004,39928.51639,45480.54989,34795.42726,48008.86017,32693.80035,24683.96368,6257.00956,8796.56898,26505.65856,25967.8903,⋯,43456.61021,45430.1312,69098.96811,67017.28474,72025.89555,76254.50263,57135.92525,59933.42721,904.27614,824.434498
b0005,552.41138,521.43392,225.36046,206.70542,824.56606,607.41926,101.20178,115.29765,682.79173,461.27862,⋯,546.915,538.8763,156.36015,124.90783,231.31556,315.69882,354.656,332.15846,165.61081,124.00526
b0006,1007.04777,988.89326,875.24416,881.36419,863.208,911.8139,800.72576,952.40131,954.41607,912.67985,⋯,3419.40603,3201.9816,1561.63852,1579.56184,1521.48854,1501.31304,3157.1771,3310.68706,81.75999,74.041248
b0007,32.81408,32.35787,37.14096,32.07092,25.89253,31.45357,39.78911,38.56178,28.49857,30.14528,⋯,34.05348,36.5612,34.02924,41.92358,37.48823,41.76407,31.90325,26.59931,6.12688,7.589872


In [54]:
# The base condition for PRECISE is base__wt_glc
base_cond <- c('control__wt_glc__1','control__wt_glc__2')

In [55]:
log_tpm <- log2(final_tpm+1)
log_tpm_norm <- log_tpm - rowMeans(log_tpm[,base_cond])
head(log_tpm_norm)

Unnamed: 0_level_0,control__wt_glc__1,control__wt_glc__2,fur__wt_dpd__1,fur__wt_dpd__2,fur__wt_fe__1,fur__wt_fe__2,fur__delfur_dpd__1,fur__delfur_dpd__2,fur__delfur_fe2__1,fur__delfur_fe2__2,⋯,efeU__menFentC_ale30__1,efeU__menFentC_ale30__2,efeU__menFentCubiC_ale36__1,efeU__menFentCubiC_ale36__2,efeU__menFentCubiC_ale37__1,efeU__menFentCubiC_ale37__2,efeU__menFentCubiC_ale38__1,efeU__menFentCubiC_ale38__2,PROJECT__CONDITION__1,PROJECT__CONDITION__2
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
b0002,-0.089117631,0.089117631,0.9183148,1.18271135,-0.005214733,-0.41744767,-1.5754562,-1.12139093,0.23276803,0.2101154,⋯,0.11558032,0.147376357,0.87741842,0.9488214,1.1730629,1.2332341,0.61745377,0.6987513,-5.41156,-5.706102
b0003,-0.077533178,0.077533178,1.3769639,1.92511093,0.443755597,0.18526277,-1.258843,-0.40091531,0.61825508,0.5651921,⋯,-0.22372463,-0.111296203,0.64492832,0.6346146,0.8000146,0.8209372,0.22348067,0.4253055,-4.86704,-5.068703
b0004,-0.09391291,0.09391291,-0.2924293,0.17196223,-0.382307295,-0.78773599,-2.7675927,-2.27619168,-0.68501369,-0.7145841,⋯,0.02824033,0.092312719,0.69731654,0.6531863,0.7571672,0.8394729,0.42305489,0.4920164,-5.556867,-5.690071
b0005,0.041551874,-0.041551874,-1.2481785,-1.3722616,0.618583216,0.17826538,-2.3953802,-2.20897856,0.34675653,-0.2180378,⋯,0.02715164,0.005828401,-1.77273019,-2.0944325,-1.2107147,-0.763689,-0.59631793,-0.6905919,-1.690318,-2.104812
b0006,0.013109564,-0.013109564,-0.1890496,-0.17900828,-0.209003939,-0.13006177,-0.3172737,-0.06729892,-0.06425339,-0.1286938,⋯,1.77571316,1.680961056,0.64552964,0.661983,0.6079769,0.588731,1.66063764,1.7291119,-3.593377,-3.734627
b0007,0.009798498,-0.009798498,0.1835152,-0.02226242,-0.320620325,-0.04944859,0.2803582,0.23628149,-0.18718075,-0.1088121,⋯,0.06173173,0.161417199,0.06073411,0.3539447,0.1965916,0.3485735,-0.02959551,-0.2831938,-2.236483,-1.967117


**MAKE SURE TO USE LOG_TPM_NORM FOR ICA CALCULATIONS**

In [52]:
# write.csv(final_tpm, file = '../data/tpm.csv')
# write.csv(log_tpm, file = '../data/log_tpm.csv')
# write.csv(log_tpm_norm, file = '../data/log_tpm_norm.csv')