**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../config/config_sing.R")))
suppressMessages(suppressWarnings(library("DESeq2")))
show_env()

You are in Singularity: singularity_proj_encode_fcc 
BASE DIRECTORY (FD_BASE): /data/reddylab/Kuei 
WORK DIRECTORY (FD_WORK): /data/reddylab/Kuei/out 
CODE DIRECTORY (FD_CODE): /data/reddylab/Kuei/code 
PATH OF PROJECT (FD_PRJ): /data/reddylab/Kuei/code/Proj_CombEffect_ENCODE_FCC 
PATH OF RESULTS (FD_RES): /data/reddylab/Kuei/out/proj_combeffect_encode_fcc 
PATH OF LOG     (FD_LOG): /data/reddylab/Kuei/out/proj_combeffect_encode_fcc/log 


## Import count matrix and metadata

In [2]:
PREFIX = "A001_K562_WSTARRseq"
FOLDER = "coverage_astarrseq_peak_macs_input"

fdiry = file.path(FD_RES, "results", PREFIX, FOLDER, "summary")

fname = "matrix.raw.count.WGS.tsv"
fpath = file.path(fdiry, fname)
dat_count = read_tsv(fpath, show_col_types = FALSE)

fname = "metadata.raw.WGS.tsv"
fpath = file.path(fdiry, fname)
dat_meta = read_tsv(fpath, show_col_types = FALSE)

**Arrange count matrix and metadata**

In [3]:
dat_col = dat_meta  %>% 
    dplyr::select(Sample, Group) %>% 
    dplyr::rename(condition = Group) %>%
    column_to_rownames(var = "Sample")

dat_cnt = dat_count %>% 
    column_to_rownames(var = "Peak")

dat_cnt[is.na(dat_cnt)] = 0

**Show data**

In [4]:
head(dat_cnt)

Unnamed: 0_level_0,Input.rep1,Input.rep2,Input.rep3,Input.rep4,Output.rep1,Output.rep2,Output.rep3
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chr1:100006256-100006880,17,51,63,41,69,57,136
chr1:100010437-100010915,5,31,29,35,49,39,77
chr1:10002087-10003910,13,68,72,64,85,85,177
chr1:100021298-100021629,0,10,13,9,14,9,19
chr1:100023727-100023976,2,14,14,6,16,17,48
chr1:100027983-100029702,23,75,84,57,103,107,225


In [5]:
dat_col

Unnamed: 0_level_0,condition
Unnamed: 0_level_1,<chr>
Input.rep1,Input
Input.rep2,Input
Input.rep3,Input
Input.rep4,Input
Output.rep1,Output
Output.rep2,Output
Output.rep3,Output


In [6]:
print(all(rownames(dat_col) %in% colnames(dat_cnt)))
print(all(rownames(dat_col) ==   colnames(dat_cnt)))

[1] TRUE
[1] TRUE


## Setup DESeq2

In [7]:
dds = DESeqDataSetFromMatrix(
    countData = dat_cnt, 
    colData   = dat_col, 
    design    = ~condition)

converting counts to integer mode

“some variables in design formula are characters, converting to factors”


**Pre-filtering**

In [8]:
### remove the peaks which have < 10 reads
cat("Before filter:", nrow(dds), "\n")
dds = dds[rowSums(counts(dds)) >= 10,]
cat("After  filter:", nrow(dds), "\n")

### set control condition as reference
dds$condition <- relevel(dds$condition, ref = "Input")

Before filter: 246832 
After  filter: 246688 


## Run DESeq2

In [9]:
dds = DESeq(dds)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



## Get results

In [10]:
resultsNames(dds)

In [11]:
res = results(dds)
res = as.data.frame(res) %>% rownames_to_column(var = "Peak")
head(res)

Unnamed: 0_level_0,Peak,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1:100006256-100006880,54.500445,-0.48849764,0.1997784,-2.4451969,0.01447731,0.09534129
2,chr1:100010437-100010915,30.645878,-0.30895395,0.2528932,-1.2216777,0.22182954,0.50094986
3,chr1:10002087-10003910,66.04553,-0.38624192,0.173012,-2.2324569,0.02558478,0.14019267
4,chr1:100021298-100021629,8.289442,-0.59404974,0.47994,-1.2377583,0.21580568,
5,chr1:100023727-100023976,12.258468,0.04988694,0.4075494,0.1224071,0.90257663,0.96366009
6,chr1:100027983-100029702,80.536286,-0.22927538,0.1665249,-1.3768236,0.1685668,0.43305688


## Save results

In [12]:
fdiry = file.path(FD_RES, "results", PREFIX, FOLDER, "summary")
fname = "result.Log2FC.raw.deseq.WGS.tsv"
fpath = file.path(fdiry, fname)
print(fpath)

write_tsv(res, fpath)

[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/A001_K562_WSTARRseq/coverage_astarrseq_peak_macs_input/summary/result.Log2FC.raw.deseq.WGS.tsv"
