**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../run_config_project_sing.R")))
show_env()

You are working on        Singularity: singularity_proj_encode_fcc 
BASE DIRECTORY (FD_BASE): /data/reddylab/Kuei 
REPO DIRECTORY (FD_REPO): /data/reddylab/Kuei/repo 
WORK DIRECTORY (FD_WORK): /data/reddylab/Kuei/work 
DATA DIRECTORY (FD_DATA): /data/reddylab/Kuei/data 

You are working with      ENCODE FCC 
PATH OF PROJECT (FD_PRJ): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC 
PROJECT RESULTS (FD_RES): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/results 
PROJECT SCRIPTS (FD_EXE): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/scripts 
PROJECT DATA    (FD_DAT): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/data 
PROJECT NOTE    (FD_NBK): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/notebooks 
PROJECT DOCS    (FD_DOC): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/docs 
PROJECT LOG     (FD_LOG): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/log 
PROJECT REF     (FD_REF): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/references 



**Set global variables**

In [2]:
TXT_FOLDER_REGION = "region_for_analysis"

## Import data

**Helper function: loading data**

In [3]:
fun_load_data = function(txt_region_fdiry){
    ### set file directory
    txt_fdiry  = file.path(txt_region_fdiry, "summary")
    txt_fname = "description.tsv"
    txt_fpath = file.path(txt_fdiry, txt_fname)
    
    ### get column names
    dat = read_tsv(txt_fpath, show_col_types = FALSE)
    vec_txt_cname = dat$Name

    ### set file directory
    txt_fdiry  = file.path(txt_region_fdiry, "summary")
    txt_fname = "metadata.label.tsv"
    txt_fpath = file.path(txt_fdiry, txt_fname)
    
    ### get file labels
    dat_metadata = read_tsv(txt_fpath, show_col_types = FALSE)
    
    ### set directory
    txt_fdiry  = txt_region_fdiry
    txt_fglob  = file.path(txt_fdiry, "*bed*")
    
    ### get file names and labels
    vec_txt_fpath = Sys.glob(txt_fglob)
    vec_txt_fname = basename(vec_txt_fpath)
    vec_txt_label = fun_str_map_match(
        vec_txt_fname, 
        dat_metadata$FName, 
        dat_metadata$Label, 
        .default=vec_txt_fname)

    ### further modification of labels
    ### read all region files
    lst = lapply(vec_txt_fpath, function(txt_fpath){
        dat = read_tsv(txt_fpath, col_names = vec_txt_cname, show_col_types = FALSE)
        return(dat)
    })
    names(lst) = vec_txt_fname
    
    return(lst)
}

**Load data**

In [4]:
### set file directory
txt_folder = "fcc_astarr_macs_merge"
txt_fdiry  = file.path(FD_RES, "region", txt_folder)

### read tables
lst = fun_load_data(txt_fdiry)

### assign and show
lst_dat_region_astarr_input = lst
print(lapply(lst, nrow))

$K562.hg38.ASTARR.macs.KS91.input.rep_all.max_overlaps.q5.bed.gz
[1] 150042

$K562.hg38.ASTARR.macs.KS91.input.rep_all.union.q5.bed.gz
[1] 246852



In [5]:
### set file directory
txt_folder = "encode_open_chromatin"
txt_fdiry  = file.path(FD_RES, "region", txt_folder)

### read tables
lst = fun_load_data(txt_fdiry)

### assign and show
lst_dat_region_encode_ocr = lst
print(lapply(lst, nrow))

$K562.hg38.ENCSR000EKS.ENCFF274YGF.DNase.bed.gz
[1] 118721

$K562.hg38.ENCSR000EOT.ENCFF185XRG.DNase.bed.gz
[1] 159277

$K562.hg38.ENCSR483RKN.ENCFF558BLC.ATAC.bed.gz
[1] 203874

$K562.hg38.ENCSR483RKN.ENCFF925CYR.ATAC.bed.gz
[1] 123009

$K562.hg38.ENCSR868FGK.ENCFF333TAT.ATAC.bed.gz
[1] 269800

$K562.hg38.ENCSR868FGK.ENCFF948AFM.ATAC.bed.gz
[1] 181340



## Arrange table

**Arrange encode ocr tables**

In [6]:
lst = lst_dat_region_encode_ocr
lst = lapply(lst, function(dat){
    dat = dat %>% dplyr::mutate(Region = fun_gen_region(dat$Chrom, dat$ChromStart, dat$ChromEnd))
    return(dat)
})

### assign and show
lst_dat_region_encode_ocr_arrange = lst

res = lapply(lst, dim)
print(res)

dat = lst[[1]]
fun_display_table(head(dat, 3))

$K562.hg38.ENCSR000EKS.ENCFF274YGF.DNase.bed.gz
[1] 118721     11

$K562.hg38.ENCSR000EOT.ENCFF185XRG.DNase.bed.gz
[1] 159277     11

$K562.hg38.ENCSR483RKN.ENCFF558BLC.ATAC.bed.gz
[1] 203874     11

$K562.hg38.ENCSR483RKN.ENCFF925CYR.ATAC.bed.gz
[1] 123009     11

$K562.hg38.ENCSR868FGK.ENCFF333TAT.ATAC.bed.gz
[1] 269800     11

$K562.hg38.ENCSR868FGK.ENCFF948AFM.ATAC.bed.gz
[1] 181340     11



Chrom,ChromStart,ChromEnd,Name,Score,Strand,SignalValue,PValue,QValue,Peak,Region
chr1,181400,181530,.,0,.,0.299874,-1,-1,75,chr1:181400-181530
chr1,778660,778800,.,0,.,14.1383,-1,-1,75,chr1:778660-778800
chr1,779137,779200,.,0,.,0.33144,-1,-1,75,chr1:779137-779200


**Simplify all regions into four columns: Chrom, ChromStart, ChromEnd, Region**

In [7]:
### concatenate
lst = c(
    lst_dat_region_astarr_input,
    lst_dat_region_encode_ocr_arrange
)

### get the region only
lst = lapply(lst, function(dat){
    dat = dat %>% 
        dplyr::select(Chrom, ChromStart, ChromEnd, Region) %>%
        dplyr::distinct()
    return(dat)
})

### assign and show
lst_dat_region_merge = lst

res = lapply(lst, dim)
print(res)

dat = lst[[1]]
fun_display_table(head(dat, 3))

$K562.hg38.ASTARR.macs.KS91.input.rep_all.max_overlaps.q5.bed.gz
[1] 150042      4

$K562.hg38.ASTARR.macs.KS91.input.rep_all.union.q5.bed.gz
[1] 246852      4

$K562.hg38.ENCSR000EKS.ENCFF274YGF.DNase.bed.gz
[1] 118721      4

$K562.hg38.ENCSR000EOT.ENCFF185XRG.DNase.bed.gz
[1] 159277      4

$K562.hg38.ENCSR483RKN.ENCFF558BLC.ATAC.bed.gz
[1] 107082      4

$K562.hg38.ENCSR483RKN.ENCFF925CYR.ATAC.bed.gz
[1] 51861     4

$K562.hg38.ENCSR868FGK.ENCFF333TAT.ATAC.bed.gz
[1] 161693      4

$K562.hg38.ENCSR868FGK.ENCFF948AFM.ATAC.bed.gz
[1] 90015     4



Chrom,ChromStart,ChromEnd,Region
chr1,10038,10405,chr1:10038-10405
chr1,14282,14614,chr1:14282-14614
chr1,16025,16338,chr1:16025-16338


## Define column description

In [8]:
### set column name and description
dat = tribble(
    ~Name,        ~Note,
    "Chrom",      "Name of the chromosome",
    "ChromStart", "The starting position of the feature in the chromosome",
    "ChromEnd",   "The ending position of the feature in the chromosome",
    "Region",     "chr:start-end for each row"
)

### assign and show
dat_cname = dat
fun_display_table(dat)

Name,Note
Chrom,Name of the chromosome
ChromStart,The starting position of the feature in the chromosome
ChromEnd,The ending position of the feature in the chromosome
Region,chr:start-end for each row


## Save results

In [9]:
### set file path
txt_folder = TXT_FOLDER_REGION
txt_fdiry  = file.path(FD_RES, "region", txt_folder)
dir.create(txt_fdiry, showWarnings = FALSE)

lst = lst_dat_region_merge
for (idx in names(lst)){
    ### set directory
    txt_fname = idx
    txt_fpath = file.path(txt_fdiry, txt_fname)

    ### write table
    dat = lst[[idx]]
    dat = dat %>% dplyr::arrange(Chrom, ChromStart, ChromEnd) %>% dplyr::distinct()
    write_tsv(dat, txt_fpath, col_names = FALSE)

    ### show progress
    print(txt_fname)
    flush.console()
}

[1] "K562.hg38.ASTARR.macs.KS91.input.rep_all.max_overlaps.q5.bed.gz"
[1] "K562.hg38.ASTARR.macs.KS91.input.rep_all.union.q5.bed.gz"
[1] "K562.hg38.ENCSR000EKS.ENCFF274YGF.DNase.bed.gz"
[1] "K562.hg38.ENCSR000EOT.ENCFF185XRG.DNase.bed.gz"
[1] "K562.hg38.ENCSR483RKN.ENCFF558BLC.ATAC.bed.gz"
[1] "K562.hg38.ENCSR483RKN.ENCFF925CYR.ATAC.bed.gz"
[1] "K562.hg38.ENCSR868FGK.ENCFF333TAT.ATAC.bed.gz"
[1] "K562.hg38.ENCSR868FGK.ENCFF948AFM.ATAC.bed.gz"


In [10]:
txt_folder = TXT_FOLDER_REGION
txt_fdiry  = file.path(FD_RES, "region", txt_folder, "summary")
txt_fname  = "description.tsv"
txt_fpath  = file.path(txt_fdiry, txt_fname)

dir.create(txt_fdiry, showWarnings = FALSE)
dat = dat_cname
write_tsv(dat, txt_fpath)