**Data: Original files**

`PROJECT/data/processed/ASTARRseq_K562_hg38_KS91_210401/peaks`

```
KS91_K562_hg38_ASTARRseq_Input.all_reps.masked.union_narrowPeak.q5.bed
KS91_K562_hg38_ASTARRseq_Input.q5.in_all.max_overlaps.bed
```

**Results: SymLinks**

`PROJECT/results/region/starr_macs`

```
ASTARRseq_K562_KS91.hg38.Input.rep_all.union.q5.bed
ASTARRseq_K562_KS91.hg38.Input.rep_all.max_overlaps.q5.bed
```

**Note: Differences between these two region set**

Code chunk used for getting the union region set (less stringent region set): merge the narrowPeak files by overlapped across all replicates in the input libraries.
```
cat $(/bin/ls -1 ${SAMPLES}) \
| awk -vTHRES=${SLURM_ARRAY_TASK_ID} '$9>THRES' \
| sort -k1,1 -k2,2n \
| bedtools merge \
    -i - \
> "${OUTDIR}/KS91_K562_hg38_ASTARRseq_Input.all_reps.masked.union_narrowPeak.q${SLURM_ARRAY_TASK_ID}.bed" \
&& echo -e "Done!\t${OUTDIR}/KS91_K562_hg38_ASTARRseq_Input.all_reps.masked.union_narrowPeak.q${SLURM_ARRAY_TASK_ID}.bed" \
|| echo -e "Failed!\t${OUTDIR}/KS91_K562_hg38_ASTARRseq_Input.all_reps.masked.union_narrowPeak.q${SLURM_ARRAY_TASK_ID}.bed"
```

Code chunk used for getting the overlap region set
```
bedtools multiinter \
    -i \
        <(awk '$9>5' KS91_K562_hg38_ASTARRseq_Input_rep1.masked.dups_marked_peaks.narrowPeak) \
        <(awk '$9>5' KS91_K562_hg38_ASTARRseq_Input_rep2.masked.dups_marked_peaks.narrowPeak) \
        <(awk '$9>5' KS91_K562_hg38_ASTARRseq_Input_rep3.masked.dups_marked_peaks.narrowPeak) \
        <(awk '$9>5' KS91_K562_hg38_ASTARRseq_Input_rep4.masked.dups_marked_peaks.narrowPeak) \
        <(awk '$9>5' KS91_K562_hg38_ASTARRseq_Input_rep5.masked.dups_marked_peaks.narrowPeak) \
        <(awk '$9>5' KS91_K562_hg38_ASTARRseq_Input_rep6.masked.dups_marked_peaks.narrowPeak) \
| sed 's@/data/reddylab/Alex/encode4_duke//processing/atac_seq/210401_KS91_K562ASTARR_NovaSeq.hg38-pe-blacklist-removal//@@g' \
| /bin/grep -E "(.*,){5}" \
| bedtools merge -i - -c 4 -o max \
> KS91_K562_hg38_ASTARRseq_Input.q5.in_all.max_overlaps.bed
```

**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../run_config_project_sing.R")))
show_env()

You are working on        Singularity: singularity_proj_encode_fcc 
BASE DIRECTORY (FD_BASE): /data/reddylab/Kuei 
REPO DIRECTORY (FD_REPO): /data/reddylab/Kuei/repo 
WORK DIRECTORY (FD_WORK): /data/reddylab/Kuei/work 
DATA DIRECTORY (FD_DATA): /data/reddylab/Kuei/data 

You are working with      ENCODE FCC 
PATH OF PROJECT (FD_PRJ): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC 
PROJECT RESULTS (FD_RES): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/results 
PROJECT SCRIPTS (FD_EXE): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/scripts 
PROJECT DATA    (FD_DAT): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/data 
PROJECT NOTE    (FD_NBK): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/notebooks 
PROJECT DOCS    (FD_DOC): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/docs 
PROJECT LOG     (FD_LOG): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/log 
PROJECT REF     (FD_REF): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/references 



**Set global variable**

In [2]:
TXT_FOLDER_REGION = "fcc_astarr_macs"

## Import data

**Set the main chromosomes**

In [3]:
txt_fdiry = file.path(FD_DAT, "external")
txt_fname = "chrom.hg38.main.bed"
txt_fpath = file.path(txt_fdiry, txt_fname)

vec = c("Chrom", "ChromStart", "ChromEnd")
dat = read_tsv(txt_fpath, col_names = vec, show_col_types = FALSE)
vec = dat$Chrom

### assign and show
vec_txt_chrom = vec
print(vec)

 [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
[10] "chr18" "chr19" "chr2"  "chr20" "chr21" "chr22" "chr3"  "chr4"  "chr5" 
[19] "chr6"  "chr7"  "chr8"  "chr9"  "chrX" 


**Import ASTARR input MACS peaks**

There are six replicates of input libraries.

In [4]:
txt_fdiry = file.path(FD_DAT, "processed", "STARR_ATAC_K562_Reddy_KS91_210401/peaks")
vec = dir(txt_fdiry)

vec_txt_fname = vec
for(txt in vec){cat(txt, "\n")}

KS91_K562_hg38_ASTARRseq_Input_rep1.masked.dups_marked_peaks.narrowPeak 
KS91_K562_hg38_ASTARRseq_Input_rep2.masked.dups_marked_peaks.narrowPeak 
KS91_K562_hg38_ASTARRseq_Input_rep3.masked.dups_marked_peaks.narrowPeak 
KS91_K562_hg38_ASTARRseq_Input_rep4.masked.dups_marked_peaks.narrowPeak 
KS91_K562_hg38_ASTARRseq_Input_rep5.masked.dups_marked_peaks.narrowPeak 
KS91_K562_hg38_ASTARRseq_Input_rep6.masked.dups_marked_peaks.narrowPeak 
KS91_K562_hg38_ASTARRseq_Input.all_reps.masked.union_narrowPeak.q5.bed 
KS91_K562_hg38_ASTARRseq_Input.q5.in_all.max_overlaps.bed 


Prepare names and read tables

In [5]:
### Prepare names
vec_txt_fname = c(
    "KS91_K562_hg38_ASTARRseq_Input.all_reps.masked.union_narrowPeak.q5.bed",
    "KS91_K562_hg38_ASTARRseq_Input.q5.in_all.max_overlaps.bed"
)
names(vec_txt_fname) = c("union", "max_overlaps")

### Read tables into a list
lst = lapply(vec_txt_fname, function(txt_fname){
    txt_fpath = file.path(txt_fdiry, txt_fname)
    dat = read_tsv(txt_fpath, col_names = FALSE, show_col_types = FALSE)
    return(dat)
})

### assign and show
lst_dat_import = lst
print(vec_txt_fname)

                                                                   union 
"KS91_K562_hg38_ASTARRseq_Input.all_reps.masked.union_narrowPeak.q5.bed" 
                                                            max_overlaps 
             "KS91_K562_hg38_ASTARRseq_Input.q5.in_all.max_overlaps.bed" 


In [6]:
lst = lst_dat_import
dat = lst[[1]]
fun_display_table(head(dat, 3))

X1,X2,X3
chr1,10015,10442
chr1,14253,14645
chr1,16015,16477


In [7]:
lst = lst_dat_import
dat = lst[[2]]
fun_display_table(head(dat, 3))

X1,X2,X3,X4
chr1,10038,10405,6
chr1,14282,14614,6
chr1,16025,16338,6


In [8]:
lst = lst_dat_import
lst = lapply(lst, function(dat){
    dat = dat %>% 
        dplyr::filter(X1 %in% vec_txt_chrom) %>%
        dplyr::select(X1:X3) %>%
        dplyr::mutate(X4 = fun_gen_region(X1, X2, X3)) %>%
        dplyr::arrange(X1, X2, X3)
    return(dat)
})

lst_dat_arrange = lst

**Double check the filtering process**

In [9]:
lst = lst_dat_import
dat = lst[[1]]
dat = as.data.frame(table(dat$X1))
fun_display_table(head(dat, 3))

Var1,Freq
chr1,30534
chr1_KI270706v1_random,35
chr1_KI270707v1_random,1


In [10]:
lst = lst_dat_arrange
dat = lst[[1]]
dat = as.data.frame(table(dat$X1))
fun_display_table(head(dat, 3))

Var1,Freq
chr1,30534
chr10,11398
chr11,12010


## Define column description

In [11]:
### set column name and description
dat = tribble(
    ~Name,        ~Note,
    "Chrom",      "Name of the chromosome",
    "ChromStart", "The starting position of the feature in the chromosome",
    "ChromEnd",   "The ending position of the feature in the chromosome",
    "Region",     "chr:start-end for each row"
)

### assign and show
dat_cname = dat
fun_display_table(dat)

Name,Note
Chrom,Name of the chromosome
ChromStart,The starting position of the feature in the chromosome
ChromEnd,The ending position of the feature in the chromosome
Region,chr:start-end for each row


## Save results

**Write description table of columns**

In [12]:
txt_folder = TXT_FOLDER_REGION
txt_fdiry  = file.path(FD_RES, "region", txt_folder, "summary")
txt_fname  = "description.tsv"
txt_fpath  = file.path(txt_fdiry, txt_fname)

dir.create(txt_fdiry, showWarnings = FALSE)
dat = dat_cname
write_tsv(dat, txt_fpath)

**Test loop**

In [13]:
lst = lst_dat_arrange
for (idx in names(lst)){
    txt_fname = paste(
        "K562.hg38.ASTARR.macs.KS91.input.rep_all",
        idx,
        "q5.bed.gz",
        sep = "."
    )
    print(idx)
    print(txt_fname)
}

[1] "union"
[1] "K562.hg38.ASTARR.macs.KS91.input.rep_all.union.q5.bed.gz"
[1] "max_overlaps"
[1] "K562.hg38.ASTARR.macs.KS91.input.rep_all.max_overlaps.q5.bed.gz"


**Write tables**

In [14]:
### set file path
txt_folder = TXT_FOLDER_REGION
txt_fdiry  = file.path(FD_RES, "region", txt_folder)
dir.create(txt_fdiry, showWarnings = FALSE)

lst = lst_dat_arrange
for (idx in names(lst)){
    ### set directory
    txt_fname = paste(
        "K562.hg38.ASTARR.macs.KS91.input.rep_all",
        idx,
        "q5.bed.gz",
        sep = "."
    )
    txt_fpath = file.path(txt_fdiry, txt_fname)

    ### write table
    dat = lst[[idx]]
    write_tsv(dat, txt_fpath, col_names = FALSE)

    ### show progress
    print(idx)
    print(txt_fname)
    flush.console()
}

[1] "union"
[1] "K562.hg38.ASTARR.macs.KS91.input.rep_all.union.q5.bed.gz"
[1] "max_overlaps"
[1] "K562.hg38.ASTARR.macs.KS91.input.rep_all.max_overlaps.q5.bed.gz"


**Save a copy to the reference folder**

In [15]:
FD_REF

In [16]:
### set file path
txt_folder = "fcc_peak_call_atac"
txt_fdiry  = file.path(FD_REF, txt_folder)
dir.create(txt_fdiry, showWarnings = FALSE)

lst = lst_dat_arrange
for (idx in names(lst)){
    ### set directory
    txt_fname = paste(
        "STARRseq_ATAC.K562.ReddyLab.hg38.Input.rep_all",
        idx,
        "q5.bed.gz",
        sep = "."
    )
    txt_fpath = file.path(txt_fdiry, txt_fname)

    ### write table
    dat = lst[[idx]]
    write_tsv(dat, txt_fpath, col_names = FALSE)

    ### show progress
    print(idx)
    print(txt_fname)
    flush.console()
}

[1] "union"
[1] "STARRseq_ATAC.K562.ReddyLab.hg38.Input.rep_all.union.q5.bed.gz"
[1] "max_overlaps"
[1] "STARRseq_ATAC.K562.ReddyLab.hg38.Input.rep_all.max_overlaps.q5.bed.gz"
