# Differential gene expression with DESeq2


This notebook with create two of the following
1. matrix of significant differentially expressed orf's from the results of the differential expression analysis
2. matrix of VSD normalized counts ordered by variance across samples

Both of these files will be created for transcript-level data and "gene-level" data using only transcripts with Kegg annotations and summing by Kegg annotation. 
<b>The second method will result in data with rownames as Kegg annotations, meaning each annotation appears once in the matrix. </b>


### Prepare environment
----

In [None]:
 
library('tximport', quietly=T)
library('DESeq2',quietly=T)
library('ashr',quietly=T)
library('tibble',quietly=T)
library('tidyverse',quietly=T)
library('Glimma',quietly=T)
library('RCurl',quietly=T)

## DE analysis on all open reading frames (ORFs)
1. Read in Salmon Counts
----
First, the count data from Salmon are read with with the `read.in` function which takes:
- a pattern matching all salmon output files
- the directory to each salmon file

My data were separated into different folders, one for each organism, and the salmon output were placed within each. The pattern and directory can be changed based on file organization scheme. The raw counts are finally read in with `Tximport` specifying that <i>Salmon<i> was used.

Next metadata are created for each salmon file using information in the salmon file header. The header has all three metadata categories combined, organism_treatment_replicate, so I extract the column names and use patterns matching each to separate each accordingly. 

In [None]:
#fixing tximport function 

pattern='[[:alpha:]]+([[:digit:]]{2}|_)(_9[[:alpha:]]|[[:alpha:]]*)'

read.in <- function(org){
    dir <- paste("/work/nclab/lucy/SAB/Assembly/",org,"/salmon",sep='')
    files <- file.path(dir,list.files(dir,pattern=".sf",recursive=TRUE))
    
    names(files)=str_extract(files,pattern)
    names(files)=str_replace(names(files),'oFe', 'pFe') #correct a sample for 08 from oFe to pFe

    if (all(file.exists(files)) == FALSE) {
        print("ERROR IN FILE NAMES, not all files exist")
        print(paste("Directory:", dir, sep="/n"))
        print(paste("Files:", files, sep='/n'))
    }
    
    raw_counts <- tximport(files, type='salmon', txOut = TRUE) 
}

create.metadata=function(org){
    dir <- paste("/work/nclab/lucy/SAB/Assembly/",org,"/salmon",sep='')
    files <- file.path(dir,list.files(dir,pattern=".sf",recursive=TRUE))
    
    id=str_extract(files,pattern)
    id=str_replace(id,'oFe', 'pFe')
    metadata=data.frame('id'=id,
                        'isolate'=org,
                        'treatment'=str_extract(id,'[[:alpha:]]+(19|21_9|_back)'),
                        'rep'=str_extract(id, 'A|B|C'))
    metadata$treatment=str_replace_all(
        metadata$treatment,
        c('pFe19'='High_Iron', 'pFe21_9'='Low_Iron','add_back'='Add_Back'))
    metadata
    print(metadata)                    
}
counts_4=read.in('04')
metadata_4=create.metadata('04')

counts_8=read.in('08')
metadata_8=create.metadata('08')

counts_6=read.in('06')
metadata_6=create.metadata('06')

counts_13=read.in('13')
metadata_13=create.metadata('13')


### 2. Create DeSeq object
---------
A DESeq2 object must be made to perform the differential expression analysis; this is done with the `dds` function. Since `Tximport` was used to read in the data, I used `DESeqDataSetFromTximport`. The `dds` function will compete a few more tasks, setting the low iron treatment as the point of comparison (this will enable multiple comparisons between treatments), and filtering out ORFs with fewer than 5 counts, here <b>n = lowest # of replicates in any treatment.<b> 

In [None]:
dds <- function(raw_counts, metadata, n){
    dds <- DESeqDataSetFromTximport( raw_counts,
                             colData=metadata,
                             design=~treatment)
    dds$treatment <- relevel(dds$treatment, ref = "Low_Iron")
    keep <- rowSums(counts(dds) >=5) >= n #filter out rows with too low expression
    print(nrow(dds))
    dds <- dds[keep, ]
    print(nrow(dds))
    dds
}

dds4 <- dds(counts_4, metadata_4, 2)
dds8 <- dds(counts_8, metadata_8, 2)
dds6 <- dds(counts_6, metadata_6, 3)
dds13 <- dds(counts_13, metadata_13, 3)

### 3. Run differential expression analysis
---
Because the low iron treatment was set as the base level, only one differential expression test needs to be run. The results from each comparison (high iron vs low iron and iron amendment vs low iron) can be extracted with `results()` and specifying the contrast, or comparison. `tidy = TRUE` creates a clean dataframe. 

In [10]:
# Run differential expression test
de4 <- DESeq(dds4)
de8 <- DESeq(dds8)
de6 <- DESeq(dds6)
de13 <- DESeq(dds13)

# Define contrasts
HvL <- c("treatment", "High_Iron", "Low_Iron")
AvL <- c("treatment", "Add_Back", "Low_Iron")

#results from Iron ammendment vs Low Iron
AvL4 <- results(de4, contrast=AvL, tidy=TRUE)
AvL8 <- results(de8, contrast=AvL, tidy=TRUE)
AvL6 <- results(de6, contrast=AvL, tidy=TRUE)
AvL13 <- results(de13, contrast=AvL, tidy=TRUE)

#results from High Iron vs Low Iron
HvL8  <- results(de8, contrast=HvL, tidy=TRUE)
HvL6  <- results(de6, contrast=HvL,  tidy=TRUE)
HvL13  <- results(de13,contrast=HvL, tidy=TRUE)

colnames(AvL4)[1] <- "orfs"
colnames(AvL8)[1] <- "orfs"
colnames(AvL6)[1] <- "orfs"
colnames(AvL13)[1] <- "orfs"

colnames(HvL8)[1] <- "orfs"
colnames(HvL6)[1] <- "orfs"
colnames(HvL13)[1] <- "orfs"

estimating size factors

using 'avgTxLength' from assays(dds), correcting for library size

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

using 'avgTxLength' from assays(dds), correcting for library size

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

using 'avgTxLength' from assays(dds), correcting for library size

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

using 'avgTxLength' from assays(dds), correcting for library size

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



### 4. Make table for differentially expressed ORFs
---
Now the results from the test have been extracted, the number and percent of differentially expressed genes (padj < 0.05) can be calculated and written to a csv file. 

This loop will go through each organism's de dataframe and loop through each contrast to extract the results, like above. Then the number and percent of significantly differentially expressed ORFs will be calculated and added to a dataframe. 

In [17]:
contrast = list('HvL' = c("treatment", "High_Iron", "Low_Iron"),
                'AvL' =c("treatment", "Add_Back", "Low_Iron"))
de_output = data.frame('Contrast'=as.character(), 'Organism'=as.character(), 
                       'number_de'=as.numeric(), 'percent_de'=as.numeric())
organism = list(de4, de8, de6, de13)

for (o in organism) {
     if (colData(o)[1,2] == colData(de4)[1,2]) {
        isolate='C. closterium UGA4'
    }else if (colData(o)[1,2] == colData(de8)[1,2]) {
        isolate='C. closterium UGA8'
    }else if (colData(o)[1,2] == colData(de6)[1,2]) {
           isolate='G. oceanica'
    }else if (colData(o)[1,2] == colData(de13)[1,2]) {            
        isolate='G. huxleyi'}
    for (c in contrast) {
        if (isolate =='C. closterium UGA4' & c[2]=='High_Iron'){
            next
        }
        de.c = results(o, contrast=c, tidy=TRUE)
        de.percent = (nrow(filter(de.c,(padj < 0.05)==T))/nrow(o)*100)
        de.num = nrow(filter(de.c,(padj < 0.05) ==T))
        deAdd = data.frame('Contrast'= c[2], 'Organism'=isolate, 'number_de'=de.num, 'percent_de'=de.percent)
        de_output = rbind(de_output, deAdd)
    }
              }

de_output

write.csv(de_output, './de_res_files/de_output.csv', row.names=F)

### 2.2 VSD Normalize counts 
---
Order rows by variance across treatment. Top rows will have highest variance in normalized counts between treatments. Save in vsd folder. 

In [8]:
vsd.norm <- function(dds){
    vst(dds,blind=FALSE)
    }

#full vsd deseq2 objects:
vsd4 <- vsd.norm(dds4)
vsd8 <- vsd.norm(dds8)
vsd6 <- vsd.norm(dds6)
vsd13 <- vsd.norm(dds13)

## order the df's by decreasing variance. top rows have highest varience between 
## samples. write dataframe
write.vsd <- function(vsd, org){
    vsd <- assay(vsd)
    vsd_order <- order(rowVars(vsd), decreasing=T)
    vsd_new <- vsd[vsd_order, ]
    print(paste('saving',org,sep=' '))
    vsd_new <- as.data.frame(vsd_new) %>% rownames_to_column("orfs")
    write.csv(vsd_new, paste('./vsd_files/', org, "vsd.csv", sep=""), row.names=FALSE)
}

write.vsd(vsd4, "04")
write.vsd(vsd8, "08")
write.vsd(vsd6, "06")
write.vsd(vsd13, "13")

using 'avgTxLength' from assays(dds), correcting for library size

using 'avgTxLength' from assays(dds), correcting for library size

using 'avgTxLength' from assays(dds), correcting for library size

using 'avgTxLength' from assays(dds), correcting for library size



[1] "saving 04"
[1] "saving 08"
[1] "saving 06"
[1] "saving 13"


## Logfold 2 shrinkage

In [12]:

lfc4.AvL <- lfcShrink(de4,  contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc4.AvL <- lfc4.AvL %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc4.AvL, "./de_res_files/lfc4.AvL.csv", row.names=F)

lfc8.AvL <- lfcShrink(de8, contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc8.AvL <- lfc8.AvL %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc8.AvL, "./de_res_files/lfc8.AvL.csv", row.names=F)

lfc6.AvL <- lfcShrink(de6, contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc6.AvL <- lfc6.AvL %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc6.AvL, "./de_res_files/lfc6.AvL.csv", row.names=F)

lfc13.AvL <- lfcShrink(de13,  contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc13.AvL <- lfc13.AvL %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc13.AvL, "./de_res_files/lfc13.AvL.csv", row.names=F)

lfc8.HvL <- lfcShrink(de8, contrast=c("treatment", "High_Iron", "Low_Iron"), type='ashr')
lfc8.HvL <- lfc8.HvL %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc8.HvL, "./de_res_files/lfc8.HvL.csv", row.names=F)

lfc6.HvL <- lfcShrink(de6, contrast=c("treatment", "High_Iron", "Low_Iron"), type='ashr')
lfc6.HvL <- lfc6.HvL %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc6.HvL, "./de_res_files/lfc6.HvL.csv", row.names=F)

lfc13.HvL <- lfcShrink(de13,  contrast=c("treatment", "High_Iron", "Low_Iron"), type='ashr')
lfc13.HvL <- lfc13.HvL %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc13.HvL, "./de_res_files/lfc13.HvL.csv", row.names=F)

lfc8.AvH <- lfcShrink(de8, contrast=c("treatment", "Add_Back", "High_Iron"), type='ashr')
lfc8.AvH <- lfc8.AvH %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc8.AvH, "./de_res_files/lfc8.AvH.csv", row.names=F)

lfc6.AvH <- lfcShrink(de6, contrast=c("treatment", "Add_Back", "High_Iron"), type='ashr')
lfc6.AvH <- lfc6.AvH %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc6.AvH, "./de_res_files/lfc6.AvH.csv", row.names=F)

lfc13.AvH <- lfcShrink(de13,  contrast=c("treatment", "Add_Back", "High_Iron"), type='ashr')
lfc13.AvH <- lfc13.AvH %>% as.data.frame() %>% rownames_to_column("orfs") 
write.csv(lfc13.AvH, "./de_res_files/lfc13.AvH.csv", row.names=F)

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/

## 2.4 Subset results tables by significance and save
df's made into a list and looped through to order rows by adjusted p value and pull out significant rows. saving both files. Sorting the _res files positive or negative log2fold change will enable me to get the up regulated and down regulated genes.

In [13]:
df.names <- list()
samples <- c('4','8','6','13')
contrasts <- c('AvL','HvL','AvH')

for (s in samples){
    for (c in contrasts){
        if (exists(paste(c,s, sep="")) == FALSE){
            print(paste('contrast ', c, ' for ', s, ' not found'))
            next}
        df.names <- append(df.names, paste(s,c,sep=''))
    }
}

res.ls <- list(AvL4, AvL8, AvL6, AvL13, HvL8, HvL6, HvL13, AvH8, AvH6, AvH13)
names(res.ls) <- df.names

res.ls <- lapply(res.ls, function(df){   #order each df in list by p.value
    arrange(df, padj)
})

#save each df ordered by p.value
walk2(res.ls, paste0("./de_res_files/", names(res.ls), "_res.csv", sep=""), write.csv,row.names=F)

res.ls.sig <- lapply(res.ls, function(df){   #pull out significant DE's
    filter(df, padj<=0.05)
})

#save significant de's for each df
walk2(res.ls.sig, paste0("./de_res_files/", names(res.ls), "sig_res.csv", sep=""), write.csv,row.names=F)

[1] "contrast  HvL  for  4  not found"
[1] "contrast  AvH  for  4  not found"


# Now repeat steps but summarize to 'kegg gene level'
## 1. Make df of raw counts which are summed to kegg level.
#### Read in the list of ORF-to-Ko's for each sample 
This list repeates orfs when multiple ko's were assigned to a single orf by eggnog. Using the 'orf' column from the organism specific ko list, we want to map the counts to each row of the orf-to-ko list. This will automatically repeat the counts for each repeated orf in the list and allow us to sum orfs with matching ko's later. 

<b/> Remember, not all rows from the counts table will have a ko assigned and thus will not appear in the orf-to-ko list. Additionally, not all orfs annotated by eggnog were counted by salmon. Thus we must remove any rows from the orf-to-ko table for which orfs are not found </b>

Because some orfs had multiple ko assignments the merging relationship will be many-to-one, many orf-ko rows matching to one counts row. "Each row in x (orf-to-ko list) matches at most 1 row in y (counts table)."

Once the two tables are merged, we can group by ko_id and sum counts which have the same ko_id. The result is a new raw counts table (matrix) which we can used in deseq. 

In [6]:
ko.def=read.csv('../kegg_names/ko_def.csv')
sum.kegg <- function(org, counts.raw){
    ko_df= read.csv(paste('../kegg_names/ko', org,'_ls.csv', sep=''))
    #ko_df = organsim-specific orf-to-ko list
    ko_df <- select(ko_df, c('orfs', 'ko_id'))
    
    # make raw counts matrix into a tibble
    counts <- as_tibble(counts.raw$counts, rownames = "orfs")
    
    # remove orfs that eggnog annotated but salmon did not count
    ko <- filter(ko_df, (ko_df$orfs %in% counts$orfs)==TRUE)
    
    # remove orfs that did not have a matching kegg annotation
    #counts.ko <- filter(counts, (counts$orfs%in%ko_df$orfs)==TRUE)

    
    # merge the two, so orfs and thier counts are repeated when they match to multiple ko's
    b <- left_join(x=ko, y=counts, by='orfs', relationship="many-to-one")
    if(all(ko$orfs%in% counts$orfs)==FALSE){
        print(paste("ERROR: merged counts and ko_ids should be the same length as 
                    the ko_id df length of merged counts is ", nrow(b), " and ko_ids is: ", 
                    nrow(ko), sep=''))
    }

    #group by ko_id and sum counts for each ko_id
    b <- b %>% select(!orfs) %>% 
        group_by(ko_id) %>% 
        summarize(across(everything(), sum)) %>%
        column_to_rownames("ko_id") %>% 
        as.matrix
    
    mode(b) <- 'integer'
    print(' # unique ko_ids should equal number rows of ko-summed counts.')
    print('# unique ko_ids = ')
    print(length(unique(ko$ko_id)))
    print(' and # ko-summed counts = ')
    print(nrow(b))
    b
    }

kcounts_4 <- sum.kegg('4', counts_4)
kcounts_8 <- sum.kegg('8', counts_8)
kcounts_6 <- sum.kegg('6', counts_6)
kcounts_13 <- sum.kegg('13', counts_13)


[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 4243
[1] " and # ko-summed counts = "
[1] 4243
[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 4436
[1] " and # ko-summed counts = "
[1] 4436
[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 5374
[1] " and # ko-summed counts = "
[1] 5374
[1] " # unique ko_ids should equal number rows of ko-summed counts."
[1] "# unique ko_ids = "
[1] 5270
[1] " and # ko-summed counts = "
[1] 5270


## 2. Run DeSeq2 on kegg-summed counts
Using DESeqDataSetFromMatrix, read in summed counts matrix.

Next, add the ko name and symbol to the metadata for each deseq object. Run the DE and VST normalization 

In [7]:
## create deseq object from matrix
k.dds <- function(k.counts, metadata,n){
    #n = lowest number of reps in any treatment
   dds <-  DESeqDataSetFromMatrix(
       countData = k.counts, 
       colData = metadata, 
       design=~treatment)
    dds$treatment <- relevel(dds$treatment, ref = "Low_Iron")
    keep <- rowSums(counts(dds) >=10) >= n #filter out rows with too low expression
    dds <- dds[keep, ]
    dds}

k.dds4 <- k.dds(kcounts_4, metadata_4, 2)
k.dds8 <- k.dds(kcounts_8, metadata_8, 2)
k.dds6 <- k.dds(kcounts_6, metadata_6, 3)
k.dds13 <- k.dds(kcounts_13, metadata_13, 3)

add.rowData <- function(dds, ko_def){
    d <- data.frame('ko_id'= rownames(dds))
        #filter ko's which are present in the dds df
    v <- ko_def[(ko_def$ko_id%in%d$ko_id)==TRUE, c('ko_id', 'symbol', 'name')]
        #because counts are summed by ko_id, I need to take only the unique ko's from anno
    anno <- distinct(v)
    all(rownames(dds)==anno$ko_id)
    #reorder annotation table to same row order as dds
    anno <- anno[match(rownames(dds), anno$ko_id),]
    all(rownames(dds)==anno$ko_id)

    mcols(dds) <- cbind(mcols(dds), anno)
    print(mcols(dds))
}

#mcols(k.dds4) <- add.rowData(k.dds4, ko4_def)
#mcols(k.dds8) <- add.rowData(k.dds8, ko8_def)
#mcols(k.dds6) <- add.rowData(k.dds6, ko6_def)
#mcols(k.dds13) <- add.rowData(k.dds13, ko13_def)
#mcols(k.dds13)
    
## vsd normalize and save ordered by variance
vsd.norm <- function(dds){
    varianceStabilizingTransformation(dds,blind=FALSE)
    }
k.vsd4 <- vsd.norm(k.dds4)
k.vsd8 <- vsd.norm(k.dds8)
k.vsd6 <- vsd.norm(k.dds6)
k.vsd13 <- vsd.norm(k.dds13)
#k.vsd4 <- varianceStabilizingTransformation(k.dds4,blind = FALSE)
write.vsd.k <- function(vsd, org){
    vsd <- assay(vsd)
    vsd_order <- order(rowVars(vsd), decreasing=T)
    vsd_new <- vsd[vsd_order, ]
    write.csv(vsd_new, paste('./vsd_files/', org, "vsd.k.csv", sep=""))
}

write.vsd.k(k.vsd4, '04')
write.vsd.k(k.vsd8, '08')
write.vsd.k(k.vsd6, '06')
write.vsd.k(k.vsd13, '13')

## run differential expression analysis
de4.k <- DESeq(k.dds4)
de8.k <- DESeq(k.dds8)
de6.k <- DESeq(k.dds6)
de13.k <- DESeq(k.dds13)

#extract results
HvL <- c("treatment", "High_Iron", "Low_Iron")
AvL <- c("treatment", "Add_Back", "Low_Iron")
AvH <- c("treatment", "Add_Back", "High_Iron")

#results from Low Iron vs Iron ammendment
AvL4.k <- results(de4.k, contrast=AvL, tidy=TRUE)
AvL8.k <- results(de8.k, contrast=AvL, tidy=T)
AvL6.k <- results(de6.k, contrast=AvL, tidy=TRUE)
AvL13.k <- results(de13.k, contrast=AvL, tidy=TRUE)

#results from Low Iron vs Iron ammendment
AvL4.k <- results(de4.k, contrast=AvL, tidy=TRUE)
AvL8.k <- results(de8.k, contrast=AvL, tidy=T)
AvL6.k <- results(de6.k, contrast=AvL, tidy=TRUE)
AvL13.k <- results(de13.k, contrast=AvL, tidy=TRUE)

#results from High Iron vs Low Iron
HvL8.k  <- results(de8.k, contrast=HvL, tidy=TRUE)
HvL6.k  <- results(de6.k, contrast=HvL,  tidy=TRUE)
HvL13.k  <- results(de13.k,contrast=HvL, tidy=TRUE)

#results from High Iron vs Iron ammendment
AvH8.k  <- results(de8.k, contrast=AvH, tidy=TRUE)
AvH6.k  <- results(de6.k, contrast=AvH,  tidy=TRUE)
AvH13.k  <- results(de13.k, contrast=AvH, tidy=TRUE)

#save deseq results

AvL4.k %>% na.omit() %>% write.csv('./de_res_files/AvL4.k.csv', row.names=F)
AvL8.k%>% na.omit() %>% write.csv('./de_res_files/AvL8.k.csv', row.names=F)
AvL6.k %>% na.omit() %>% write.csv('./de_res_files/AvL6.k.csv', row.names=F)
AvL13.k %>% na.omit() %>% write.csv('./de_res_files/AvL13.k.csv', row.names=F)

AvH8.k %>% na.omit() %>% write.csv('./de_res_files/AvH8.k.csv', row.names=F)
AvH6.k %>% na.omit() %>% write.csv('./de_res_files/AvH6.k.csv', row.names=F)
AvH13.k %>% na.omit() %>% write.csv('./de_res_files/AvH13.k.csv', row.names=F)

HvL8.k %>% na.omit() %>% write.csv('./de_res_files/HvL8.k.csv', row.names=F)
HvL6.k %>% na.omit() %>% write.csv('./de_res_files/HvL6.k.csv', row.names=F)
HvL13.k %>% na.omit() %>% write.csv('./de_res_files/HvL13.k.csv', row.names=F)


“some variables in design formula are characters, converting to factors”
“some variables in design formula are characters, converting to factors”
“some variables in design formula are characters, converting to factors”
“some variables in design formula are characters, converting to factors”
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



In [25]:
# how many DE's per treatment-sample?

avl4.perc.k=(nrow(filter(AvL4.k,(padj < 0.05)==T))/nrow(de4.k))*100
avl4.k=nrow(filter(AvL4.k,(padj < 0.05)==T))

avl8.perc.k=(nrow(filter(AvL8.k,(padj < 0.05)==T))/nrow(de8.k))*100
avl8.k=nrow(filter(AvL8.k,(padj < 0.05)==T))

avl6.perc.k=(nrow(filter(AvL6.k,(padj < 0.05)==T))/nrow(de6.k))*100
avl6.k=nrow(filter(AvL6.k,(padj < 0.05)==T))

avl13.perc.k=(nrow(filter(AvL13.k,(padj < 0.05)==T))/nrow(de13.k))*100
avl13.k=nrow(filter(AvL13.k,(padj < 0.05)==T))


hvl8.perc.k=(nrow(filter(HvL8.k,(padj < 0.05)==T))/nrow(de8.k))*100
hvl8.k=nrow(filter(HvL8.k,(padj < 0.05)==T))

hvl6.perc.k=(nrow(filter(HvL6.k,(padj < 0.05)==T))/nrow(de6.k))*100
hvl6.k=nrow(filter(HvL6.k,(padj < 0.05)==T))

hvl13.perc.k=(nrow(filter(HvL13.k,(padj < 0.05)==T))/nrow(de13.k))*100
hvl13.k=nrow(filter(HvL13.k,(padj < 0.05)==T))


avh8.perc.k=(nrow(filter(AvH8.k,(padj < 0.05)==T))/nrow(de8.k))*100
avh8.k=nrow(filter(AvH8.k,(padj < 0.05)==T))

avh6.perc.k=(nrow(filter(AvH6.k,(padj < 0.05)==T))/nrow(de6.k))*100
avh6.k=nrow(filter(AvH6.k,(padj < 0.05)==T))

avh13.perc.k=(nrow(filter(AvH13.k,(padj < 0.05)==T))/nrow(de13.k))*100
avh13.k=nrow(filter(AvH13.k,(padj < 0.05)==T))

de_ko_output = data.frame('Organism'=c('C. closterium 4', 'C. closterium 8', 'G. oceanica', 'G. huxleyi'),
                       'Iron amendment vs low iron'=c(avl4.k,avl8.k,avl6.k,avl13.k),
                       'High iron vs low iron'=c('NA', hvl8.k,hvl6.k,hvl13.k),
                       'Iron amendment vs high iron'=c('NA',avh8.k,avh6.k,avh13.k))

de_ko_output_perc = data.frame('Organism'=c('C. closterium 4', 'C. closterium 8', 'G. oceanica', 'G. huxleyi'),
                       'Iron amendment vs low iron'=c(avl4.perc.k,avl8.perc.k,avl6.perc.k,avl13.perc.k),
                       'High iron vs low iron'=c('NA', hvl8.perc.k,hvl6.perc.k,hvl13.perc.k),
                       'Iron amendment vs high iron'=c('NA',avh8.perc.k,avh6.perc.k,avh13.perc.k)) 

write.csv(de_ko_output, './de_res_files/de_ko_output.csv', row.names=F)
write.csv(de_ko_output_perc, './de_res_files/de_ko_output_perc.csv', row.names=F)

head(de_ko_output)

Unnamed: 0_level_0,Organism,Iron.amendment.vs.low.iron,High.iron.vs.low.iron,Iron.amendment.vs.high.iron
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>
1,C. closterium 4,206,,
2,C. closterium 8,830,300.0,947.0
3,G. oceanica,1291,1381.0,221.0
4,G. huxleyi,1154,1132.0,396.0


In [46]:
#Save info for top 30 VSD normalized orfs with highest variance across samples

top30vsd.info <- function(k.vsd, ko_def, org){
    #order vsd matrix by variance across samples
    vsd <- assay(k.vsd)
    var_order <- order(rowVars(vsd), decreasing=T)
    vsd_new <- vsd[var_order, ]
    vsd_new <- as.data.frame(vsd_new) %>% rownames_to_column("ko_id") 
    #select top 30 orfs, these have the highest variance
    vsd30 <- vsd_new[1:30,]
    #merge with name and info for ko id and create seperate column of enzyme info
    vsd30 <- left_join(vsd30, ko.def, by = 'ko_id') %>% select('ko_id', 'symbol', 'name')
    vsd30 <- vsd30 %>% separate(name, c('name', 'enzyme'), "\\[EC:")  
    vsd30$enzyme <- str_replace(vsd30$enzyme, "\\]", "")
    vsd30
    write.csv(vsd30, paste("./vsd_files/vsd.30.", org, ".csv", sep=''), row.names=F)
}

top30vsd.info(k.vsd4, ko4_def, "4")
top30vsd.info(k.vsd8, ko8_def, "8")
top30vsd.info(k.vsd6, ko6_def, "6")
top30vsd.info(k.vsd13, ko13_def, "13")


“[1m[22mExpected 2 pieces. Missing pieces filled with `NA` in 17 rows [1, 2, 3, 5, 7, 8, 9, 10, 13, 17, 22, 24,
26, 27, 28, 29, 30].”
“[1m[22mExpected 2 pieces. Missing pieces filled with `NA` in 15 rows [2, 3, 4, 8, 9, 11, 12, 14, 15, 16, 20, 24,
26, 29, 30].”
“[1m[22mExpected 2 pieces. Missing pieces filled with `NA` in 14 rows [1, 4, 5, 6, 7, 8, 11, 12, 20, 21, 22, 27,
29, 30].”
“[1m[22mExpected 2 pieces. Missing pieces filled with `NA` in 18 rows [1, 3, 5, 6, 7, 10, 11, 12, 15, 17, 18, 20,
21, 22, 23, 24, 26, 29].”


In [8]:
lfc4.k.AvL <- lfcShrink(de4.k,  contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc4.k.AvL <- lfc4.k.AvL %>% as.data.frame() %>% rownames_to_column("ko_id") 
lfc8.k.AvL <- lfcShrink(de8.k, contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc8.k.AvL <- lfc8.k.AvL %>% as.data.frame() %>% rownames_to_column("ko_id") 
lfc6.k.AvL <- lfcShrink(de6.k, contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc6.k.AvL <- lfc6.k.AvL %>% as.data.frame() %>% rownames_to_column("ko_id") 
lfc13.k.AvL <- lfcShrink(de13.k,  contrast=c("treatment", "Add_Back", "Low_Iron"), type='ashr')
lfc13.k.AvL <- lfc13.k.AvL %>% as.data.frame() %>% rownames_to_column("ko_id") 


lfc8.k.HvL <- lfcShrink(de8.k, contrast=c("treatment", "High_Iron", "Low_Iron"), type='ashr')
lfc8.k.HvL <- lfc8.k.HvL %>% as.data.frame() %>% rownames_to_column("ko_id") 
lfc6.k.HvL <- lfcShrink(de6.k, contrast=c("treatment", "High_Iron", "Low_Iron"), type='ashr')
lfc6.k.HvL <- lfc6.k.HvL %>% as.data.frame() %>% rownames_to_column("ko_id") 
lfc13.k.HvL <- lfcShrink(de13.k,  contrast=c("treatment", "High_Iron", "Low_Iron"), type='ashr')
lfc13.k.HvL <- lfc13.k.HvL %>% as.data.frame() %>% rownames_to_column("ko_id") 

lfc8.k.AvH <- lfcShrink(de8.k, contrast=c("treatment", "Add_Back", "High_Iron"), type='ashr')
lfc8.k.AvH <- lfc8.k.AvH %>% as.data.frame() %>% rownames_to_column("ko_id") 
lfc6.k.AvH <- lfcShrink(de6.k, contrast=c("treatment", "Add_Back", "High_Iron"), type='ashr')
lfc6.k.AvH <- lfc6.k.AvH %>% as.data.frame() %>% rownames_to_column("ko_id")
lfc13.k.AvH <- lfcShrink(de13.k,  contrast=c("treatment", "Add_Back", "High_Iron"), type='ashr')
lfc13.k.AvH <- lfc13.k.AvH %>% as.data.frame() %>% rownames_to_column("ko_id") 

write.csv(lfc4.k.AvL, "./de_res_files/lfc4.k.AvL.csv", row.names=F)
write.csv(lfc8.k.AvL, "./de_res_files/lfc8.k.AvL.csv", row.names=F)
write.csv(lfc6.k.AvL, "./de_res_files/lfc6.k.AvL.csv", row.names=F)
write.csv(lfc13.k.AvL, "./de_res_files/lfc13.k.AvL.csv", row.names=F)
write.csv(lfc8.k.HvL, "./de_res_files/lfc8.k.HvL.csv", row.names=F)
write.csv(lfc6.k.HvL, "./de_res_files/lfc6.k.HvL.csv", row.names=F)
write.csv(lfc13.k.HvL, "./de_res_files/lfc13.k.HvL.csv", row.names=F)
write.csv(lfc8.k.AvH, "./de_res_files/lfc8.k.AvH.csv", row.names=F)
write.csv(lfc6.k.AvH, "./de_res_files/lfc6.k.AvH.csv", row.names=F)
write.csv(lfc13.k.AvH, "./de_res_files/lfc13.k.AvH.csv", row.names=F)

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/

In [22]:
ranked4.AvL = lfc4.k.AvL[order(abs(lfc4.k.AvL$log2FoldChange), decreasing=TRUE),]
ranked4.AvL = left_join(ranked4.AvL, ko.def)


ranked8.AvL = lfc8.k.AvL[order(abs(lfc8.k.AvL$log2FoldChange), decreasing=TRUE),]
ranked8.AvL = left_join(ranked8.AvL, ko.def)

ranked8.HvL = lfc8.k.HvL[order(abs(lfc8.k.HvL$log2FoldChange), decreasing=TRUE),]
ranked8.HvL = left_join(ranked8.HvL, ko.def)

ranked8.AvH = lfc8.k.AvH[order(abs(lfc8.k.AvH$log2FoldChange), decreasing=TRUE),]
ranked8.AvH = left_join(ranked8.AvH, ko.def)


ranked6.AvL = lfc6.k.AvL[order(abs(lfc6.k.AvL$log2FoldChange), decreasing=TRUE),]
ranked6.AvL = left_join(ranked6.AvL, ko.def)

ranked6.HvL = lfc6.k.HvL[order(abs(lfc6.k.HvL$log2FoldChange), decreasing=TRUE),]
ranked6.HvL = left_join(ranked6.HvL, ko.def)

ranked6.AvH = lfc6.k.AvH[order(abs(lfc6.k.AvH$log2FoldChange), decreasing=TRUE),]
ranked6.AvH = left_join(ranked6.AvH, ko.def)


ranked13.AvL = lfc13.k.AvL[order(abs(lfc13.k.AvL$log2FoldChange), decreasing=TRUE),]
ranked13.AvL = left_join(ranked13.AvL, ko.def)

ranked13.HvL = lfc13.k.HvL[order(abs(lfc13.k.HvL$log2FoldChange), decreasing=TRUE),]
ranked13.HvL = left_join(ranked13.HvL, ko.def)

ranked13.AvH = lfc13.k.AvH[order(abs(lfc13.k.AvH$log2FoldChange), decreasing=TRUE),]
ranked13.AvH = left_join(ranked13.AvH, ko.def)

head(ranked13.AvH)
write.csv(ranked4.AvL, "./de_res_files/ranked4.AvL.csv", row.names=F)
write.csv(ranked8.AvL, "./de_res_files/ranked8.AvL.csv", row.names=F)
write.csv(ranked6.AvL, "./de_res_files/ranked6.AvL.csv", row.names=F)
write.csv(ranked13.AvL, "./de_res_files/ranked13.AvL.csv", row.names=F)
write.csv(ranked8.AvH, "./de_res_files/ranked8.AvH.csv", row.names=F)
write.csv(ranked6.AvH, "./de_res_files/ranked6.AvH.csv", row.names=F)
write.csv(ranked13.AvH, "./de_res_files/ranked13.AvH.csv", row.names=F)
write.csv(ranked8.HvL, "./de_res_files/ranked8.HvL.csv", row.names=F)
write.csv(ranked6.HvL, "./de_res_files/ranked6.HvL.csv", row.names=F)
write.csv(ranked13.HvL, "./de_res_files/ranked13.HvL.csv", row.names=F)

[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`
[1m[22mJoining with `by = join_by(ko_id)`


Unnamed: 0_level_0,ko_id,baseMean,log2FoldChange,lfcSE,pvalue,padj,symbol,name
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
1,K00525,1073.92643,-3.904022,0.28052,8.315263e-46,1.930804e-42,"E1.17.4.1A, nrdA, nrdE",ribonucleoside-diphosphate reductase alpha chain [EC:1.17.4.1]
2,K10734,25.89301,-3.212495,0.5927836,1.303292e-09,1.06184e-07,GINS3,GINS complex subunit 3
3,K10808,501.63986,-3.196746,0.2288487,1.401861e-45,2.1700809999999997e-42,RRM2,ribonucleoside-diphosphate reductase subunit M2 [EC:1.17.4.1]
4,K18710,22.68003,-3.160268,0.8150535,4.954375e-07,2.130381e-05,SLBP,histone RNA hairpin-binding protein
5,K22156,56.82509,-3.08514,0.4034052,9.545449e-16,1.528588e-13,JADE3,protein Jade-3
6,K14291,83.10115,-3.078023,0.3364991,3.207594e-21,8.762391999999999e-19,PHAX,phosphorylated adapter RNA export protein
