**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../config/config_sing.R")))
suppressMessages(suppressWarnings(library("DESeq2")))
show_env()

You are in Singularity: singularity_proj_combeffect 
BASE DIRECTORY:     /data/reddylab/Kuei 
WORK DIRECTORY:     /data/reddylab/Kuei/out 
CODE DIRECTORY:     /data/reddylab/Kuei/code 
PATH OF SOURCE:     /data/reddylab/Kuei/source 
PATH OF EXECUTABLE: /data/reddylab/Kuei/bin 
PATH OF ANNOTATION: /data/reddylab/Kuei/annotation 
PATH OF PROJECT:    /data/reddylab/Kuei/code/Proj_CombEffect_ENCODE_FCC 
PATH OF RESULTS:    /data/reddylab/Kuei/out/proj_combeffect_encode_fcc 


## Import count matrix and metadata

In [2]:
PREFIX = "KS91_K562_ASTARRseq"
FOLDER = "coverage_astarrseq_peak_macs_input"

fdiry = file.path(FD_RES, "results", PREFIX, FOLDER, "summary")

fname = "matrix.raw.count.WGS.tsv"
fpath = file.path(fdiry, fname)
dat_count = read_tsv(fpath)

fname = "metadata.raw.WGS.tsv"
fpath = file.path(fdiry, fname)
dat_meta = read_tsv(fpath)

[1mRows: [22m[34m246852[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m  (2): Chrom, Peak
[32mdbl[39m (12): Start, End, Input.rep1, Input.rep2, Input.rep3, Input.rep4, Input....

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m10[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (3): Sample, Group, FPath

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


**Arrange count matrix and metadata**

In [3]:
dat_col = dat_meta  %>% 
    dplyr::select(Sample, Group) %>% 
    dplyr::rename(condition = Group) %>%
    column_to_rownames(var = "Sample")

dat_cnt = dat_count %>% 
    dplyr::mutate(Peak = paste(Chrom, Start, End, sep = "_")) %>%
    dplyr::select(-Chrom, -Start, -End) %>%
    column_to_rownames(var = "Peak")

dat_cnt[is.na(dat_cnt)] = 0

In [4]:
head(dat_cnt)

Unnamed: 0_level_0,Input.rep1,Input.rep2,Input.rep3,Input.rep4,Input.rep5,Input.rep6,Output.rep1,Output.rep2,Output.rep3,Output.rep4
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chr1_10015_10442,155,214,257,236,185,214,7,12,4,11
chr1_14253_14645,110,144,160,141,130,130,8,26,30,57
chr1_16015_16477,141,208,206,190,202,182,9,9,18,23
chr1_17237_17772,259,350,399,367,369,331,7,13,23,59
chr1_28903_29613,263,338,368,333,352,317,12,18,3,32
chr1_30803_31072,82,115,171,136,105,115,13,22,14,33


In [5]:
dat_col

Unnamed: 0_level_0,condition
Unnamed: 0_level_1,<chr>
Input.rep1,Input
Input.rep2,Input
Input.rep3,Input
Input.rep4,Input
Input.rep5,Input
Input.rep6,Input
Output.rep1,Output
Output.rep2,Output
Output.rep3,Output
Output.rep4,Output


In [6]:
print(all(rownames(dat_col) %in% colnames(dat_cnt)))
print(all(rownames(dat_col) ==   colnames(dat_cnt)))

[1] TRUE
[1] TRUE


## Run DESeq2

In [7]:
dds = DESeqDataSetFromMatrix(
    countData = dat_cnt, 
    colData   = dat_col, 
    design    = ~condition)

converting counts to integer mode

“some variables in design formula are characters, converting to factors”


In [8]:
### remove the peaks which have < 10 reads
dds = dds[rowSums(counts(dds)) >= 10,]

### set control condition as reference
dds$condition <- relevel(dds$condition, ref = "Input")

In [9]:
dds = DESeq(dds)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



## Get results

In [10]:
resultsNames(dds)

In [11]:
res = results(dds)
res

log2 fold change (MLE): condition Output vs Input 
Wald test p-value: condition Output vs Input 
DataFrame with 246850 rows and 6 columns
                          baseMean log2FoldChange     lfcSE      stat
                         <numeric>      <numeric> <numeric> <numeric>
chr1_10015_10442           71.4038      -2.213613  0.269152  -8.22441
chr1_14253_14645           68.6303       0.222186  0.153410   1.44831
chr1_16015_16477           71.2541      -1.271295  0.201963  -6.29470
chr1_17237_17772          124.0313      -1.385289  0.163863  -8.45396
chr1_28903_29613          113.6310      -1.942200  0.188613 -10.29727
...                            ...            ...       ...       ...
chrX_156000382_156003205 1532.3823       0.468703 0.0375633  12.47767
chrX_156009687_156010227   59.3053      -1.562456 0.2365061  -6.60641
chrX_156016391_156016836   57.0261      -3.944716 0.4857124  -8.12151
chrX_156024950_156025593  164.6780      -0.401819 0.1139605  -3.52595
chrX_156030187_1560307

In [12]:
res = results(dds)
res = as.data.frame(res) %>% rownames_to_column(var = "Peak")
head(res)

Unnamed: 0_level_0,Peak,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,chr1_10015_10442,71.40375,-2.2136128,0.2691517,-8.2244061,1.961605e-16,2.910688e-15
2,chr1_14253_14645,68.63035,0.2221859,0.1534101,1.4483134,0.1475294,0.2178106
3,chr1_16015_16477,71.25409,-1.271295,0.2019628,-6.2946978,3.079993e-10,2.65301e-09
4,chr1_17237_17772,124.03128,-1.3852891,0.1638628,-8.4539594,2.8158730000000005e-17,4.431893e-16
5,chr1_28903_29613,113.63103,-1.9422003,0.1886131,-10.2972733,7.2486810000000005e-25,1.723665e-23
6,chr1_30803_31072,57.37875,-0.1332069,0.2145692,-0.6208109,0.534724,0.62314


## Save results

In [13]:
fdiry = file.path(FD_RES, "results", PREFIX, FOLDER, "summary")
fname = "result.Log2FC.raw.deseq.WGS.tsv"
fpath = file.path(fdiry, fname)
print(fpath)

write_tsv(res, fpath)

[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/KS91_K562_ASTARRseq/coverage_astarrseq_peak_macs_input/summary/result.Log2FC.raw.deseq.WGS.tsv"
