# Differential gene expression with DESeq2

This notebook with create two of the following
1. matrix of significant differentially expressed orf's from the results of the differential expression analysis
2. matrix of VSD normalized counts ordered by variance across samples

Both of these files will be created for transcript-level data and "gene-level" data using only transcripts with Kegg annotations and summing by Kegg annotation. 
<b>The second method will result in data with rownames as Kegg annotations, meaning each annotation appears once in the matrix. </b>


### Prepare environment
----

In [1]:
library('tximport', quietly=T)
library('DESeq2',quietly=T)
library('ashr',quietly=T)
library('tibble',quietly=T)
library('tidyverse',quietly=T)


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min



Attaching package: ‘S4Vectors’


The following object is masked from ‘package:utils’:

    findMatches


The following objects are masked from ‘package:base’:

    expand.grid, I, unname



Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
 

## DE analysis on all open reading frames (ORFs)
---
### 1. Read in Salmon Counts
First, the count data from Salmon are read with with the `read.in` function which takes:
- a pattern matching all salmon output files
- the directory to each salmon file

My data were separated into different folders, one for each organism, and the salmon output were placed within each. The pattern and directory can be changed based on file organization scheme. The raw counts are finally read in with `Tximport` specifying that <i>Salmon<i> was used.

Next metadata are created for each salmon file using information in the salmon file header. The header has all three metadata categories combined, organism_treatment_replicate, so I extract the column names and use patterns matching each to separate each accordingly. 

In [2]:
pattern='[[:digit:]]{2}(add_back|pFe21_9|pFe19)(A|B|C)'

read.in <- function(org){
    dir <- paste("../expression_analysis/salmon/",org,sep='')
    files <- file.path(dir,list.files(dir,pattern=".sf",recursive=TRUE))
    
    names(files)=str_extract(files,pattern)
    
    if (all(file.exists(files)) == FALSE) {
        print("ERROR IN FILE NAMES, not all files exist")
        print(paste("Directory:", dir, sep="/n"))
        print(paste("Files:", files, sep='/n'))
    }
    
    raw_counts <- tximport(files, type='salmon', txOut = TRUE) 
}

create.metadata=function(org){
    dir <- paste("../expression_analysis/salmon/",org,sep='')
    files <- file.path(dir,list.files(dir,pattern=".sf",recursive=TRUE))
    
    id=str_extract(files,pattern)
    metadata=data.frame('id'=id,
                        'isolate'=org,
                        'treatment'=str_extract(id,'[[:alpha:]]+(19|21_9|_back)'),
                        'rep'=str_extract(id, 'A|B|C'))
    metadata$treatment=str_replace_all(
        metadata$treatment,
        c('pFe19'='High_Iron', 'pFe21_9'='Low_Iron','add_back'='Add_Back'))
    metadata
    print(metadata)                    
}
counts_4=read.in('04')
metadata_4=create.metadata('04')

counts_8=read.in('08')
metadata_8=create.metadata('08')

counts_6=read.in('06')
metadata_6=create.metadata('06')

counts_13=read.in('13')
metadata_13=create.metadata('13')

reading in files with read_tsv

1 
2 
3 
4 
5 




           id isolate treatment rep
1 04add_backA      04  Add_Back   A
2 04add_backB      04  Add_Back   B
3  04pFe21_9A      04  Low_Iron   A
4  04pFe21_9B      04  Low_Iron   B
5  04pFe21_9C      04  Low_Iron   C


reading in files with read_tsv

1 
2 
3 
4 
5 
6 
7 
8 




           id isolate treatment rep
1 08add_backB      08  Add_Back   B
2 08add_backC      08  Add_Back   C
3    08pFe19A      08 High_Iron   A
4    08pFe19B      08 High_Iron   B
5    08pFe19C      08 High_Iron   C
6  08pFe21_9A      08  Low_Iron   A
7  08pFe21_9B      08  Low_Iron   B
8  08pFe21_9C      08  Low_Iron   C


reading in files with read_tsv

1 
2 
3 
4 
5 
6 
7 
8 
9 




           id isolate treatment rep
1 06add_backA      06  Add_Back   A
2 06add_backB      06  Add_Back   B
3 06add_backC      06  Add_Back   C
4    06pFe19A      06 High_Iron   A
5    06pFe19B      06 High_Iron   B
6    06pFe19C      06 High_Iron   C
7  06pFe21_9A      06  Low_Iron   A
8  06pFe21_9B      06  Low_Iron   B
9  06pFe21_9C      06  Low_Iron   C


reading in files with read_tsv

1 
2 
3 
4 
5 
6 
7 
8 




           id isolate treatment rep
1 13add_backA      13  Add_Back   A
2 13add_backB      13  Add_Back   B
3    13pFe19A      13 High_Iron   A
4    13pFe19B      13 High_Iron   B
5    13pFe19C      13 High_Iron   C
6  13pFe21_9A      13  Low_Iron   A
7  13pFe21_9B      13  Low_Iron   B
8  13pFe21_9C      13  Low_Iron   C


Note: 13 (G. huxleyi) add_backC did not sequence well, and was not included in the analysis.

In [12]:
print(metadata_13)

           id isolate treatment rep
1 13add_backA      13  Add_Back   A
2 13add_backB      13  Add_Back   B
3    13pFe19A      13 High_Iron   A
4    13pFe19B      13 High_Iron   B
5    13pFe19C      13 High_Iron   C
6  13pFe21_9A      13  Low_Iron   A
7  13pFe21_9B      13  Low_Iron   B
8  13pFe21_9C      13  Low_Iron   C


### 2. Create DeSeq object
---------
A DESeq2 object must be made to perform the differential expression analysis; this is done with the `dds` function. Since `Tximport` was used to read in the data, I used `DESeqDataSetFromTximport`. The `dds` function will compete a few more tasks, setting the low iron treatment as the point of comparison (this will enable multiple comparisons between treatments), and filtering out ORFs with fewer than 5 counts, here <b>n = lowest # of replicates in any treatment.<b> 

In [None]:
dds <- function(raw_counts, metadata, n){
    dds <- DESeqDataSetFromTximport(raw_counts,
                             colData=metadata,
                             design=~treatment)
    dds$treatment <- relevel(dds$treatment, ref = "Low_Iron")
    keep <- rowSums(counts(dds) >=5) >= n #filter out rows with too low expression
    print(nrow(dds))
    dds <- dds[keep, ]
    print(nrow(dds))
    dds
}

dds4 <- dds(counts_4, metadata_4, 2)
dds8 <- dds(counts_8, metadata_8, 2)
dds6 <- dds(counts_6, metadata_6, 3)
dds13 <- dds(counts_13, metadata_13, 3)

### 3. Run differential expression analysis
---
Because the low iron treatment was set as the base level, only one differential expression test needs to be run. The results from each comparison (high iron vs low iron and iron amendment vs low iron) can be extracted with `results()` and specifying the contrast, or comparison. `tidy = TRUE` creates a clean dataframe. 

In [None]:
# Run differential expression test
de4 <- DESeq(dds4)
de8 <- DESeq(dds8)
de6 <- DESeq(dds6)
de13 <- DESeq(dds13)

# Define contrasts
HvL <- c("treatment", "High_Iron", "Low_Iron")
AvL <- c("treatment", "Add_Back", "Low_Iron")

#results from Iron ammendment vs Low Iron
AvL4 <- results(de4, contrast=AvL, tidy=TRUE)
AvL8 <- results(de8, contrast=AvL, tidy=TRUE)
AvL6 <- results(de6, contrast=AvL, tidy=TRUE)
AvL13 <- results(de13, contrast=AvL, tidy=TRUE)

#results from High Iron vs Low Iron
HvL8  <- results(de8, contrast=HvL, tidy=TRUE)
HvL6  <- results(de6, contrast=HvL,  tidy=TRUE)
HvL13  <- results(de13,contrast=HvL, tidy=TRUE)

colnames(AvL4)[1] <- "orfs"
colnames(AvL8)[1] <- "orfs"
colnames(AvL6)[1] <- "orfs"
colnames(AvL13)[1] <- "orfs"

colnames(HvL8)[1] <- "orfs"
colnames(HvL6)[1] <- "orfs"
colnames(HvL13)[1] <- "orfs"

In [45]:
head(dds4@colData)

DataFrame with 5 rows and 4 columns
                     id     isolate treatment         rep
            <character> <character>  <factor> <character>
04add_backA 04add_backA          04  Add_Back           A
04add_backB 04add_backB          04  Add_Back           B
04pFe21_9A   04pFe21_9A          04  Low_Iron           A
04pFe21_9B   04pFe21_9B          04  Low_Iron           B
04pFe21_9C   04pFe21_9C          04  Low_Iron           C

### 4. Make table for differentially expressed ORFs
---
Now the results from the test have been extracted, the number and percent of differentially expressed genes (padj < 0.05) can be calculated and written to a csv file. 

This loop will go through each organism's de dataframe and loop through each contrast to extract the results, like above. Then the number and percent of significantly differentially expressed ORFs will be calculated and added to a dataframe. 

In [13]:
contrast = list('HvL' = c("treatment", "High_Iron", "Low_Iron"),
                'AvL' =c("treatment", "Add_Back", "Low_Iron"))
de_output = data.frame('Contrast'=as.character(), 'Organism'=as.character(), 
                       'number_de'=as.numeric(), 'percent_de'=as.numeric())
organism = list(de4, de8, de6, de13)

for (o in organism) {
     if (colData(o)[1,2] == colData(de4)[1,2]) {
        isolate='C. closterium UGA4'
    }else if (colData(o)[1,2] == colData(de8)[1,2]) {
        isolate='C. closterium UGA8'
    }else if (colData(o)[1,2] == colData(de6)[1,2]) {
           isolate='G. oceanica'
    }else if (colData(o)[1,2] == colData(de13)[1,2]) {            
        isolate='G. huxleyi'}
    for (c in contrast) {
        if (isolate =='C. closterium UGA4' & c[2]=='High_Iron'){
            next
        }
        de.c = results(o, contrast=c, tidy=TRUE)
        de.percent = (nrow(filter(de.c,(padj < 0.05)==T))/nrow(o)*100)
        de.num = nrow(filter(de.c,(padj < 0.05) ==T))
        deAdd = data.frame('Contrast'= c[2], 'Organism'=isolate, 'number_de'=de.num, 'percent_de'=de.percent)
        de_output = rbind(de_output, deAdd)
    }
              }

de_output

write.csv(de_output, '..//expression_analysis/de_res_files/de_output.csv', row.names=F)

Contrast,Organism,number_de,percent_de
<chr>,<chr>,<int>,<dbl>
Add_Back,C. closterium UGA4,883,3.250745
High_Iron,C. closterium UGA8,1646,6.941047
Add_Back,C. closterium UGA8,1914,8.071182
High_Iron,G. oceanica,3150,8.974103
Add_Back,G. oceanica,2195,6.253383
High_Iron,G. huxleyi,2911,10.066395
Add_Back,G. huxleyi,2539,8.779999


### 5. Normalization for visualization

#### 5.1 Log fold change shrinkage 
---
Using the `ashr` model, the `lfcShrink` function normalizes the differential expression results for later visualization. These results can be used in MA plots or PCA plots, and normalize the log fold change of an ORF between treatments.
    This loop will use the lists organism and contrast from above in a similar fashion, moving through each de result and contrast to perform the log fold change shrinkage and save the file as 'lfc4.Add_Back.csv' for `o = de4` and `c = Add_Back vs Low Iron`. 

In [41]:
lfc.HvL.save = function(de,organism){
    lfc.df = lfcShrink(de, contrast = c("treatment", "High_Iron", "Low_Iron"), type='ashr')
    lfc.df = lfc.df %>% as.data.frame() %>% rownames_to_column('orfs')
    write.csv(lfc.df, paste('../expression_analysis/de_res_files/lfc',organism,'.HvL.csv',sep=''),
              row.names=F)
}
lfc.AvL.save = function(de,organism){
    lfc.df = lfcShrink(de, contrast = c('treatment','Add_Back','Low_Iron'), type='ashr')
    lfc.df = lfc.df %>% as.data.frame() %>% rownames_to_column('orfs')
    write.csv(lfc.df, paste('../expression_analysis/de_res_files/lfc',organism,'.AvL.csv',sep=''),
              row.names=F)
}

lfc.HvL.save(de8,'8')
lfc.HvL.save(de6,'6')
lfc.HvL.save(de13,'13')

lfc.AvL.save(de4,'4')
lfc.AvL.save(de8,'8')
lfc.AvL.save(de6,'6')
lfc.AvL.save(de13,'13')  

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/

#### 5.2 VSD Normalize counts 
---
The count data can also be normalized for visualization in heatmaps. The `vsd.norm` function takes the counts of ORFs from the dds object (before the differential expression analysis was performed) and corrects for variance of ORFs between treatments. The normalized count data are then ordered by variance across treatment. Top rows will have highest variance in normalized counts between treatments. Save in vsd folder. 

In [52]:
vsd.norm <- function(dds){
    vst(dds,blind=FALSE)
    }

#full vsd deseq2 objects:
vsd4 <- vsd.norm(dds4)
vsd8 <- vsd.norm(dds8)
vsd6 <- vsd.norm(dds6)
vsd13 <- vsd.norm(dds13)

## order the df's by decreasing variance. top rows have highest varience between 
## samples. write dataframe
write.vsd <- function(vsd, org){
    vsd <- assay(vsd)
    vsd_order <- order(rowVars(vsd), decreasing=T)
    vsd_new <- vsd[vsd_order, ]
    print(paste('saving',org,sep=' '))
    vsd_new <- as.data.frame(vsd_new) %>% rownames_to_column("orfs")
    colnames(vsd_new) = str_remove(colnames(vsd_new),'[[:digit:]]{2}')
    write.csv(vsd_new, paste('..//expression_analysis/vsd_files/', org, "vsd.csv", sep=""), row.names=FALSE)
}

write.vsd(vsd4, "04")
write.vsd(vsd8, "08")
write.vsd(vsd6, "06")
write.vsd(vsd13, "13")

using 'avgTxLength' from assays(dds), correcting for library size

using 'avgTxLength' from assays(dds), correcting for library size

using 'avgTxLength' from assays(dds), correcting for library size

using 'avgTxLength' from assays(dds), correcting for library size



[1] "saving 04"
[1] "saving 08"
[1] "saving 06"
[1] "saving 13"


In [43]:
head(vsd4)

class: DESeqTransform 
dim: 6 5 
metadata(1): version
assays(1): ''
rownames(6): NODE_10000_length_2060_cov_11.205335_g5364_i0.p1
  NODE_10001_length_2060_cov_5.581278_g5365_i0.p1 ...
  NODE_10003_length_2059_cov_28.617321_g5366_i1.p1
  NODE_10004_length_2059_cov_25.532729_g4469_i1.p1
rowData names(4): baseMean baseVar allZero dispFit
colnames(5): 04add_backA 04add_backB 04pFe21_9A 04pFe21_9B 04pFe21_9C
colData names(4): id isolate treatment rep

## 2.4 Subset results tables by significance and save
df's made into a list and looped through to order rows by adjusted p value and pull out significant rows. saving both files. Sorting the _res files positive or negative log2fold change will enable me to get the up regulated and down regulated genes.

In [18]:
df.names <- list()
samples <- c('4','8','6','13')
contrasts <- c('AvL','HvL')

for (s in samples){
    for (c in contrasts){
        if (exists(paste(c,s, sep="")) == FALSE){
            print(paste('contrast ', c, ' for ', s, ' not found'))
            next}
        df.names <- append(df.names, paste(s,c,sep=''))
    }
}

res.ls <- list(AvL4, AvL8, AvL6, AvL13, HvL8, HvL6, HvL13)
names(res.ls) <- df.names

res.ls <- lapply(res.ls, function(df){   #order each df in list by p.value
    arrange(df, padj)
})

#save each df ordered by p.value
walk2(res.ls, paste0("../expression_analysis/de_res_files/", names(res.ls), "_res.csv", sep=""), write.csv,row.names=F)

res.ls.sig <- lapply(res.ls, function(df){   #pull out significant DE's
    filter(df, padj<=0.05)
})

#save significant de's for each df
walk2(res.ls.sig, paste0("../expression_analysis/de_res_files/", names(res.ls), "sig_res.csv", sep=""), write.csv,row.names=F)

[1] "contrast  HvL  for  4  not found"


# DE analysis on Kegg gene level

## 1. Make df of raw counts and sum to kegg level.
---
#### Read in the list of ORF-to-Ko's from kegg notebook 
This list repeates orfs when multiple ko's were assigned to a single orf by eggnog. Using the 'orf' column from the organism specific ko list, we want to map the counts to each row of the orf-to-ko list. This will automatically repeat the counts for each repeated orf in the list and allow us to sum orfs with matching ko's later. 

<b/> Remember, not all rows from the counts table will have a ko assigned and thus will not appear in the orf-to-ko list. Additionally, not all orfs annotated by eggnog were counted by salmon. Thus we must remove any rows from the orf-to-ko table for which orfs are not found </b>

Because some orfs had multiple ko assignments the merging relationship will be many-to-one, many orf-ko rows matching to one counts row. "Each row in x (orf-to-ko list) matches at most 1 row in y (counts table)."

Once the two tables are merged, we can group by ko_id and sum counts which have the same ko_id. The result is a new raw counts table (matrix) which we can used in deseq. 

In [34]:
sum.kegg <- function(org, counts.raw){
    ko_df= read.csv(paste('../expression_analysis/kegg_files/ko', org,'_ls.csv', sep=''))
    #ko_df = organsim specific orf-to-ko list
    ko_df <- select(ko_df, c('orfs', 'ko_id'))
    # make raw counts matrix into a tibble
    counts <- as_tibble(counts.raw$counts, rownames = "orfs")
    # remove orfs that eggnog annotated but salmon did not count
    ko <- filter(ko_df, (ko_df$orfs %in% counts$orfs)==TRUE)
    
    # remove orfs that did not have a matching kegg annotation
    #counts.ko <- filter(counts, (counts$orfs%in%ko_df$orfs)==TRUE)
    # merge the two, so orfs and thier counts are repeated when they match to multiple ko's
    b <- left_join(x=ko, y=counts, by='orfs', relationship="many-to-one")
    if(all(ko$orfs%in% counts$orfs)==FALSE){
        print(paste("ERROR: merged counts and ko_ids should be the same length as 
                    the ko_id df length of merged counts is ", nrow(b), " and ko_ids is: ", 
                    nrow(ko), sep=''))
    }
    #group by ko_id and sum counts for each ko_id
    b <- b %>% select(!orfs) %>% 
        group_by(ko_id) %>% 
        summarize(across(everything(), sum)) %>%
        column_to_rownames("ko_id") %>% 
        as.matrix
    mode(b) <- 'integer'
    print(' # unique ko_ids should equal number rows of ko-summed counts.')
    print('# unique ko_ids = ')
    print(length(unique(ko$ko_id)))
    print(' and # ko-summed counts = ')
    print(nrow(b))
    b
    }

kcounts_4 <- sum.kegg('4', counts_4)
kcounts_8 <- sum.kegg('8', counts_8)
kcounts_6 <- sum.kegg('6', counts_6)
kcounts_13 <- sum.kegg('13', counts_13)


[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 4246
[1] " and # ko-summed counts = "
[1] 4246
[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 4438
[1] " and # ko-summed counts = "
[1] 4438
[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 5378
[1] " and # ko-summed counts = "
[1] 5378
[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 5273
[1] " and # ko-summed counts = "
[1] 5273


## 2. Create DeSeq2 object for kegg-summed counts
---
Using `DESeqDataSetFromMatrix` because now counts are summed to the Kegg KO and in a matrix. This function is otherwise, just as the one used above. 

In [35]:
## create deseq object from matrix
k.dds <- function(k.counts, metadata,n){
    #n = lowest number of reps in any treatment
   dds <-  DESeqDataSetFromMatrix(
       countData = k.counts, 
       colData = metadata, 
       design=~treatment)
    dds$treatment <- relevel(dds$treatment, ref = "Low_Iron")
    keep <- rowSums(counts(dds) >=10) >= n #filter out rows with too low expression
    dds <- dds[keep, ]
    dds}

k.dds4 <- k.dds(kcounts_4, metadata_4, 2)
k.dds8 <- k.dds(kcounts_8, metadata_8, 2)
k.dds6 <- k.dds(kcounts_6, metadata_6, 3)
k.dds13 <- k.dds(kcounts_13, metadata_13, 3)

“some variables in design formula are characters, converting to factors”
“some variables in design formula are characters, converting to factors”
“some variables in design formula are characters, converting to factors”
“some variables in design formula are characters, converting to factors”


## 3. Run differential expression on Kegg counts
---
This process works just like above, the <i>contrasts</i> or comparisons were already defined so they can be reused to extract the results. 

In [36]:
## run differential expression analysis
de4.k <- DESeq(k.dds4)
de8.k <- DESeq(k.dds8)
de6.k <- DESeq(k.dds6)
de13.k <- DESeq(k.dds13)

#results from Iron ammendment vs Low Iron 
AvL4.k <- results(de4.k, contrast=AvL, tidy=TRUE)
AvL8.k <- results(de8.k, contrast=AvL, tidy=T)
AvL6.k <- results(de6.k, contrast=AvL, tidy=TRUE)
AvL13.k <- results(de13.k, contrast=AvL, tidy=TRUE)

#results from High Iron vs Low Iron
HvL8.k  <- results(de8.k, contrast=HvL, tidy=TRUE)
HvL6.k  <- results(de6.k, contrast=HvL,  tidy=TRUE)
HvL13.k  <- results(de13.k,contrast=HvL, tidy=TRUE)

#save deseq results

AvL4.k %>% na.omit() %>% write.csv('../expression_analysis/de_res_files/AvL4.kegg.csv', row.names=F)
AvL8.k%>% na.omit() %>% write.csv('../expression_analysis/de_res_files/AvL8.kegg.csv', row.names=F)
AvL6.k %>% na.omit() %>% write.csv('../expression_analysis/de_res_files/AvL6.kegg.csv', row.names=F)
AvL13.k %>% na.omit() %>% write.csv('../expression_analysis/de_res_files/AvL13.kegg.csv', row.names=F)

HvL8.k %>% na.omit() %>% write.csv('../expression_analysis/de_res_files/HvL8.kegg.csv', row.names=F)
HvL6.k %>% na.omit() %>% write.csv('../expression_analysis/de_res_files/HvL6.kegg.csv', row.names=F)
HvL13.k %>% na.omit() %>% write.csv('../expression_analysis/de_res_files/HvL13.kegg.csv', row.names=F)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



ERROR: Error in eval(expr, envir, enclos): object 'AvL' not found


## 4. Make table for differentially expressed Kegg genes
---
This loop is identical to the previous one, but the organisms (de results data frames) refer to the kegg level de results.

In [37]:
contrast = list('HvL' = c("treatment", "High_Iron", "Low_Iron"),
                'AvL' =c("treatment", "Add_Back", "Low_Iron"))
de_output = data.frame('Contrast'=as.character(), 'Organism'=as.character(), 
                       'number_de'=as.numeric(), 'percent_de'=as.numeric())
ko.organism = list(de4.k, de8.k, de6.k, de13.k)

for (o in ko.organism) {
     if (colData(o)[1,2] == colData(de4.k)[1,2]) {
        isolate='C. closterium UGA4'
    }else if (colData(o)[1,2] == colData(de8.k)[1,2]) {
        isolate='C. closterium UGA8'
    }else if (colData(o)[1,2] == colData(de6.k)[1,2]) {
           isolate='G. oceanica'
    }else if (colData(o)[1,2] == colData(de13.k)[1,2]) {            
        isolate='G. huxleyi'}
    for (c in contrast) {
        if (isolate =='C. closterium UGA4' & c[2]=='High_Iron'){
            next
        }
        de.c = results(o, contrast=c, tidy=TRUE)
        de.percent = (nrow(filter(de.c,(padj < 0.05)==T))/nrow(o)*100)
        de.num = nrow(filter(de.c,(padj < 0.05) ==T))
        deAdd = data.frame('Contrast'= c[2], 'Organism'=isolate, 'number_de'=de.num, 'percent_de'=de.percent)
        de_output = rbind(de_output, deAdd)
    }
              }

de_output

write.csv(de_output, '../expression_analysis/de_res_files/de_ko_output.csv', row.names=F)


Contrast,Organism,number_de,percent_de
<chr>,<chr>,<int>,<dbl>
Add_Back,C. closterium UGA4,207,6.569343
High_Iron,C. closterium UGA8,292,9.624258
Add_Back,C. closterium UGA8,829,27.323665
High_Iron,G. oceanica,1382,27.617906
Add_Back,G. oceanica,1291,25.799361
High_Iron,G. huxleyi,1681,36.134996
Add_Back,G. huxleyi,1507,32.394669


## 5. Normalization for visualization
---
### 5.1 Log fold change shrinkage on differential expression analysis results for PCA and MA plots

In [38]:

lfc.HvL.save = function(de,organism){
    lfc.df = lfcShrink(de, contrast = c("treatment", "High_Iron", "Low_Iron"), type='ashr')
    lfc.df = lfc.df %>% as.data.frame() %>% rownames_to_column('ko_id')
    write.csv(lfc.df, paste('../expression_analysis/de_res_files/lfc',organism,'.HvL.kegg.csv',sep=''),
              row.names=F)
}
lfc.AvL.save = function(de,organism){
    lfc.df = lfcShrink(de, contrast = c('treatment','Add_Back','Low_Iron'), type='ashr')
    lfc.df = lfc.df %>% as.data.frame() %>% rownames_to_column('ko_id')
    write.csv(lfc.df, paste('../expression_analysis/de_res_files/lfc',organism,'.AvL.kegg.csv',sep=''),
              row.names=F)
}

lfc.HvL.save(de8.k,'8')
lfc.HvL.save(de6.k,'6')
lfc.HvL.save(de13.k,'13')

lfc.AvL.save(de4.k,'4')
lfc.AvL.save(de8.k,'8')
lfc.AvL.save(de6.k,'6')
lfc.AvL.save(de13.k,'13')    

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/

### 5.2 VST normalization on kegg summed counts for heatmaps
---

In [39]:
## vsd normalize and save ordered by variance
vsd.norm <- function(dds){
    varianceStabilizingTransformation(dds,blind=FALSE)
    }
k.vsd4 <- vsd.norm(k.dds4)
k.vsd8 <- vsd.norm(k.dds8)
k.vsd6 <- vsd.norm(k.dds6)
k.vsd13 <- vsd.norm(k.dds13)

write.vsd.k <- function(vsd, org){
    vsd <- assay(vsd)
    vsd_order <- order(rowVars(vsd), decreasing=T)
    vsd_new <- vsd[vsd_order, ]
    colnames(vsd_new) = str_remove(colnames(vsd_new),'[[:digit:]]{2}')
    write.csv(vsd_new, paste('../expression_analysis/vsd_files/', org, "vsd.kegg.csv", sep=""))
}

write.vsd.k(k.vsd4, '04')
write.vsd.k(k.vsd8, '08')
write.vsd.k(k.vsd6, '06')
write.vsd.k(k.vsd13, '13')