**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../config/config_sing.R")))
suppressMessages(suppressWarnings(library("DESeq2")))
show_env()

You are in Singularity: singularity_proj_combeffect 
BASE DIRECTORY:     /data/reddylab/Kuei 
WORK DIRECTORY:     /data/reddylab/Kuei/out 
CODE DIRECTORY:     /data/reddylab/Kuei/code 
PATH OF SOURCE:     /data/reddylab/Kuei/source 
PATH OF EXECUTABLE: /data/reddylab/Kuei/bin 
PATH OF ANNOTATION: /data/reddylab/Kuei/annotation 
PATH OF PROJECT:    /data/reddylab/Kuei/code/Proj_CombEffect_ENCODE_FCC 
PATH OF RESULTS:    /data/reddylab/Kuei/out/proj_combeffect_encode_fcc 


In [2]:
PREFIX  = "Tewhey_K562_TileMPRA"
FOLDER  = "coverage_astarrseq_peak_macs_input"
REGIONS = c("GATA1", "MYC", "FADS")
TYPES   = c("raw", "norm")

In [11]:
TYPE  = "raw"
fdiry = file.path(FD_RES, "results", PREFIX, FOLDER, "summary")

for (REGION in REGIONS){
    ### show progress
    cat("\n+++++++++++++++++++++++++++++++\n")
    cat("Region:", REGION, "\n")
    flush.console()
    
    ### import count data
    fname = paste("matrix", TYPE, "count", REGION, "tsv", sep=".")
    fpath = file.path(fdiry, fname)
    dat_count = read_tsv(fpath, show_col_types = FALSE) %>% dplyr::select(-Type, -Region)
    print(fpath)
    flush.console()
    
    ### import metadata
    fname = paste("metadata", TYPE, REGION, "tsv", sep=".")
    fpath = file.path(fdiry, fname)
    dat_meta = read_tsv(fpath, show_col_types = FALSE)
    print(fpath)
    flush.console()
    
    ### Arrange count matrix and metadata
    dat_col = dat_meta  %>% 
        dplyr::select(Sample, Group) %>% 
        dplyr::rename(condition = Group) %>%
        column_to_rownames(var = "Sample")
    print(dat_col)
    
    dat_cnt = dat_count %>% 
        dplyr::mutate(Peak = paste(Chrom, Start, End, sep = "_")) %>%
        dplyr::select(-Chrom, -Start, -End) %>%
        column_to_rownames(var = "Peak")
    dat_cnt[is.na(dat_cnt)] = 0
    print(head(dat_cnt))
    
    ### create a DDS object
    dds = DESeqDataSetFromMatrix(
        countData = dat_cnt, 
        colData   = dat_col, 
        design    = ~condition)

    ### remove the peaks which have < 10 reads
    dds = dds[rowSums(counts(dds)) >= 10,]

    ### set control condition as reference
    dds$condition = relevel(dds$condition, ref = "Input")

    ### perform DDS preprocessing
    dds = DESeq(dds, fitType = 'local')

    ### extract results
    res = results(dds)
    res = as.data.frame(res) %>% rownames_to_column(var = "Peak")
    
    ### store results
    fname = paste("result", "Log2FC", TYPE, "deseq", REGION, "tsv", sep=".")
    fpath = file.path(fdiry, fname)
    write_tsv(res, fpath)
    print(fpath)
    flush.console()
}


+++++++++++++++++++++++++++++++
Region: GATA1 
[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/matrix.raw.count.GATA1.tsv"
[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/metadata.raw.GATA1.tsv"
            condition
Input.rep1      Input
Input.rep2      Input
Input.rep3      Input
Input.rep4      Input
Input.rep5      Input
Input.rep6      Input
Output.rep1    Output
Output.rep2    Output
Output.rep3    Output
Output.rep4    Output
Output.rep5    Output
                       Input.rep1 Input.rep2 Input.rep3 Input.rep4 Input.rep5
chrX_47796208_47796828       6811       7520       7050       7354       3542
chrX_47806139_47808167      22020      24874      22902      23825      11249
chrX_47809119_47809445       2180       2676       2383       2398       1167
chrX_47814810_47815443       7241       8307       7296       7542  

converting counts to integer mode

“some variables in design formula are characters, converting to factors”
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/result.Log2FC.raw.deseq.GATA1.tsv"

+++++++++++++++++++++++++++++++
Region: MYC 
[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/matrix.raw.count.MYC.tsv"
[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/metadata.raw.MYC.tsv"
            condition
Input.rep1      Input
Input.rep2      Input
Input.rep3      Input
Input.rep4      Input
Input.rep5      Input
Input.rep6      Input
Output.rep1    Output
Output.rep2    Output
Output.rep3    Output
Output.rep4    Output
Output.rep5    Output
                         Input.rep1 Input.rep2 Input.rep3 Input.rep4 Input.rep5
chr8_126778902_126779728      10372      11304      10609      11042       5318
chr8_126782925_126783318       3076       3477       3122      

converting counts to integer mode

“some variables in design formula are characters, converting to factors”
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/result.Log2FC.raw.deseq.MYC.tsv"

+++++++++++++++++++++++++++++++
Region: FADS 
[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/matrix.raw.count.FADS.tsv"
[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/metadata.raw.FADS.tsv"
            condition
Input.rep1      Input
Input.rep2      Input
Input.rep3      Input
Input.rep4      Input
Output.rep1    Output
Output.rep2    Output
Output.rep3    Output
Output.rep4    Output
                        Input.rep1 Input.rep2 Input.rep3 Input.rep4 Output.rep1
chr11_61554569_61556228     169523      95146     151286      95053       66392
chr11_61560645_61561556     106039      58543      94232      59762       33097
chr11_61567108_61567997      97609      55981   

converting counts to integer mode

“some variables in design formula are characters, converting to factors”
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/result.Log2FC.raw.deseq.FADS.tsv"


In [12]:
PREFIX  = "Tewhey_K562_TileMPRA"
FOLDER  = "coverage_astarrseq_peak_macs_input"
REGIONS = c("GATA1", "MYC", "FADS")
PROCESS = c("raw", "norm")

region  = REGIONS[1]
process = PROCESS[1]
fdiry = file.path(FD_RES, "results", PREFIX, FOLDER, "summary")

**Arrange count matrix and metadata**

In [18]:
### import count data
fname = paste("count", process, region, "tsv", sep=".")
fpath = file.path(fdiry, fname)
dat_count = read_tsv(fpath, show_col_types = FALSE)

### import metadata
fname = paste("metadata", process, region, "tsv", sep=".")
fpath = file.path(fdiry, fname)
dat_meta = read_tsv(fpath, show_col_types = FALSE)

### Arrange count matrix and metadata
dat_col = dat_meta  %>% 
    dplyr::select(Sample, Group) %>% 
    dplyr::rename(condition = Group) %>%
    column_to_rownames(var = "Sample")

dat_cnt = dat_count %>% 
    dplyr::mutate(Peak = paste(Chrom, Start, End, sep = "_")) %>%
    dplyr::select(-Chrom, -Start, -End) %>%
    column_to_rownames(var = "Peak")

dat_cnt[is.na(dat_cnt)] = 0

### create a DDS object
dds = DESeqDataSetFromMatrix(
    countData = dat_cnt, 
    colData   = dat_col, 
    design    = ~condition)

### remove the peaks which have < 10 reads
dds = dds[rowSums(counts(dds)) >= 10,]

### set control condition as reference
dds$condition = relevel(dds$condition, ref = "Input")

### perform DDS preprocessing
dds = DESeq(dds, fitType = 'local')

### extract results
res = results(dds)
res = as.data.frame(res) %>% rownames_to_column(var = "Peak")

converting counts to integer mode

“some variables in design formula are characters, converting to factors”
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



In [19]:
head(res)

Unnamed: 0_level_0,Peak,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,chrX_47796208_47796828,10513.875,0.95281868,0.0820797,11.608457,3.73291e-31,4.660482e-31
2,chrX_47806139_47808167,24157.939,0.02557172,0.02814903,0.9084403,0.3636456,0.3672108
3,chrX_47809119_47809445,7455.885,2.43603768,0.03405452,71.5334643,0.0,0.0
4,chrX_47814810_47815443,10298.26,0.72157768,0.02954264,24.4249592,9.289171e-132,1.5946409999999999e-131
5,chrX_47816459_47818070,29919.229,0.94178695,0.05242764,17.9635575,3.759239e-72,5.3778e-72
6,chrX_47836113_47837157,45600.703,2.97943519,0.04843894,61.5090955,0.0,0.0


In [21]:
fdiry = file.path(FD_RES, "results", PREFIX, FOLDER, "summary")
fname = "result.Log2FC.deseq.GATA1.tsv"
fpath = file.path(fdiry, fname)
print(fpath)

write_tsv(res, fpath)

[1] "/data/reddylab/Kuei/out/proj_combeffect_encode_fcc/results/Tewhey_K562_TileMPRA/coverage_astarrseq_peak_macs_input/summary/result.Log2FC.deseq.GATA1.tsv"


In [9]:
print(dim(dat_cnt))
head(dat_cnt)

[1] 206  11


Unnamed: 0_level_0,Input.rep1,Input.rep2,Input.rep3,Input.rep4,Input.rep5,Input.rep6,Output.rep1,Output.rep2,Output.rep3,Output.rep4,Output.rep5
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chrX_47796208_47796828,6811,7520,7050,7354,3542,3424,19544,11613,26592,20125,22423
chrX_47806139_47808167,22020,24874,22902,23825,11249,11262,33975,24256,43990,32303,35219
chrX_47809119_47809445,2180,2676,2383,2398,1167,1199,18841,13963,23003,18047,19074
chrX_47814810_47815443,7241,8307,7296,7542,3861,3962,18557,13950,22498,16826,18893
chrX_47816459_47818070,19441,21644,20106,21086,9853,10013,59095,35658,78412,51012,60721
chrX_47836113_47837157,9957,11604,10541,10802,5432,5058,137782,90865,140845,112907,118586


In [10]:
dat_col

Unnamed: 0_level_0,condition
Unnamed: 0_level_1,<chr>
Input.rep1,Input
Input.rep2,Input
Input.rep3,Input
Input.rep4,Input
Input.rep5,Input
Input.rep6,Input
Output.rep1,Output
Output.rep2,Output
Output.rep3,Output
Output.rep4,Output


In [11]:
print(all(rownames(dat_col) %in% colnames(dat_cnt)))
print(all(rownames(dat_col) ==   colnames(dat_cnt)))

[1] TRUE
[1] TRUE
