**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../run_config_project_sing.R")))
show_env()

You are working on        Singularity 
BASE DIRECTORY (FD_BASE): /mount 
REPO DIRECTORY (FD_REPO): /mount/repo 
WORK DIRECTORY (FD_WORK): /mount/work 
DATA DIRECTORY (FD_DATA): /mount/data 

You are working with      ENCODE FCC 
PATH OF PROJECT (FD_PRJ): /mount/repo/Proj_ENCODE_FCC 
PROJECT RESULTS (FD_RES): /mount/repo/Proj_ENCODE_FCC/results 
PROJECT SCRIPTS (FD_EXE): /mount/repo/Proj_ENCODE_FCC/scripts 
PROJECT DATA    (FD_DAT): /mount/repo/Proj_ENCODE_FCC/data 
PROJECT NOTE    (FD_NBK): /mount/repo/Proj_ENCODE_FCC/notebooks 
PROJECT DOCS    (FD_DOC): /mount/repo/Proj_ENCODE_FCC/docs 
PROJECT LOG     (FD_LOG): /mount/repo/Proj_ENCODE_FCC/log 
PROJECT APP     (FD_APP): /mount/repo/Proj_ENCODE_FCC/app 
PROJECT REF     (FD_REF): /mount/repo/Proj_ENCODE_FCC/references 



## Prepare

In [2]:
txt_fdiry = file.path(FD_REF, "encode_chipseq_latest")
vec = dir(txt_fdiry)
for (txt in vec){cat(txt, "\n")}

files.processed_files.encode2.txt 
files.processed_files.encode3.txt 
files.processed_files.encode4.txt 
metadata.default_files.240705.tsv 
metadata.default_files.summary.tsv 
metadata.processed_selected.240705.tsv 
metadata.processed_selected.summary.tsv 


## Import metadata from reference

**ENCODE ChIP-seq Full**

In [3]:
### set file path
txt_fdiry = file.path(FD_REF, "encode_chipseq_latest")
txt_fname = "metadata.default_files.summary.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

### read table
dat = read_tsv(txt_fpath, show_col_types = FALSE)

### show and assign
dat_metadata_chipseq_full = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 1630   12


Assay,Index_Experiment,Index_File,File_Type,Output_Type,Genome,Target,Analysis,RFA,md5sum,File_Name,File_Url
TF ChIP-seq,ENCSR800KMQ,ENCFF124KCE,bed narrowPeak,IDR thresholded peaks,GRCh38,ZBTB43,ENCODE4 v1.8.0 GRCh38,ENCODE4,3154aa211bac58ce311f643c7aa9023e,ENCFF124KCE.bed.gz,https://www.encodeproject.org/files/ENCFF124KCE/@@download/ENCFF124KCE.bed.gz
TF ChIP-seq,ENCSR800KMQ,ENCFF334ZCX,bigWig,signal p-value,GRCh38,ZBTB43,ENCODE4 v1.8.0 GRCh38,ENCODE4,cf6a2fcaf92a09842452f7fa7437e777,ENCFF334ZCX.bigWig,https://www.encodeproject.org/files/ENCFF334ZCX/@@download/ENCFF334ZCX.bigWig
TF ChIP-seq,ENCSR744GHR,ENCFF173QUY,bed narrowPeak,IDR thresholded peaks,GRCh38,E2F5,ENCODE4 v1.8.0 GRCh38,ENCODE4,8a96fec76b8ca5f6a49053f771395398,ENCFF173QUY.bed.gz,https://www.encodeproject.org/files/ENCFF173QUY/@@download/ENCFF173QUY.bed.gz


## Filter to get the final metadata table

In [4]:
### init
txt_assay = "Histone"
txt_ftype = "bed narrowPeak"

### filter to get selected files
dat = dat_metadata_chipseq_full
dat = dat %>% 
    dplyr::filter(str_detect(Assay, txt_assay)) %>%
    dplyr::filter(File_Type == txt_ftype)
    
### assign and show
dat_metadata_final = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 19 12


Assay,Index_Experiment,Index_File,File_Type,Output_Type,Genome,Target,Analysis,RFA,md5sum,File_Name,File_Url
Histone ChIP-seq,ENCSR000AKX,ENCFF909RKY,bed narrowPeak,replicated peaks,GRCh38,H4K20me1,ENCODE4 v1.6.1 GRCh38,ENCODE2,fe4c93b6faa0ab4153ef41fe516bced4,ENCFF909RKY.bed.gz,https://www.encodeproject.org/files/ENCFF909RKY/@@download/ENCFF909RKY.bed.gz
Histone ChIP-seq,ENCSR000APE,ENCFF963GZJ,bed narrowPeak,replicated peaks,GRCh38,H3K9me3,ENCODE4 v1.5.1 GRCh38,ENCODE2,8e71cfdb547cdb5057d99d12d7be3ba6,ENCFF963GZJ.bed.gz,https://www.encodeproject.org/files/ENCFF963GZJ/@@download/ENCFF963GZJ.bed.gz
Histone ChIP-seq,ENCSR000APC,ENCFF213OTI,bed narrowPeak,pseudoreplicated peaks,GRCh38,H2AFZ,ENCODE4 v1.6.0 GRCh38,ENCODE2,a5934e60f1ebf8e393aa111a26f74074,ENCFF213OTI.bed.gz,https://www.encodeproject.org/files/ENCFF213OTI/@@download/ENCFF213OTI.bed.gz


**Explore: Target**

In [5]:
dat = dat_metadata_final
table(dat$Target, dat$File_Type)

          
           bed narrowPeak
  H2AFZ                 1
  H3K27ac               1
  H3K27me3              2
  H3K36me3              2
  H3K4me1               2
  H3K4me2               1
  H3K4me3               4
  H3K79me2              1
  H3K9ac                2
  H3K9me1               1
  H3K9me3               1
  H4K20me1              1

**Explore: RFA**

In [6]:
dat = dat_metadata_final
table(dat$RFA)


ENCODE2 ENCODE3 
     18       1 

**Check: all files are in hg38**

In [7]:
dat = dat_metadata_final
table(dat$Assay, dat$Genome)

                  
                   GRCh38
  Histone ChIP-seq     19

**Check: file/output types**

In [8]:
dat = dat_metadata_final
table(dat$Output_Type, dat$File_Type)

                        
                         bed narrowPeak
  pseudoreplicated peaks             14
  replicated peaks                    5

In [9]:
dat = dat_metadata_final
table(dat$Index_Experiment, dat$File_Type)

             
              bed narrowPeak
  ENCSR000AKP              1
  ENCSR000AKQ              1
  ENCSR000AKR              1
  ENCSR000AKS              1
  ENCSR000AKT              1
  ENCSR000AKU              1
  ENCSR000AKV              1
  ENCSR000AKW              1
  ENCSR000AKX              1
  ENCSR000APC              1
  ENCSR000APD              1
  ENCSR000APE              1
  ENCSR000DWB              1
  ENCSR000DWD              1
  ENCSR000EVZ              1
  ENCSR000EWA              1
  ENCSR000EWB              1
  ENCSR000EWC              1
  ENCSR668LDD              1

## Prepare download files

In [10]:
### get file url
dat = dat_metadata_final
dat = dat %>% dplyr::select(File_Url)

### assign and show
dat_download_furl = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 19  1


File_Url
https://www.encodeproject.org/files/ENCFF909RKY/@@download/ENCFF909RKY.bed.gz
https://www.encodeproject.org/files/ENCFF963GZJ/@@download/ENCFF963GZJ.bed.gz
https://www.encodeproject.org/files/ENCFF213OTI/@@download/ENCFF213OTI.bed.gz


## Prepare download checksum

In [11]:
### get md5sum for each file
dat = dat_metadata_final
dat = dat %>% dplyr::select(md5sum, File_Name)

### assign and show
dat_download_md5sum = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 19  2


md5sum,File_Name
fe4c93b6faa0ab4153ef41fe516bced4,ENCFF909RKY.bed.gz
8e71cfdb547cdb5057d99d12d7be3ba6,ENCFF963GZJ.bed.gz
a5934e60f1ebf8e393aa111a26f74074,ENCFF213OTI.bed.gz


## Save results

In [12]:
### init
txt_folder = "encode_chipseq_histone"

### create folder for data
txt_fdiry = file.path(FD_DAT, "external", txt_folder)
txt_cmd   = paste("mkdir -p", txt_fdiry)
system(txt_cmd)
print(txt_fdiry)

### create folder for results
txt_fdiry = file.path(FD_RES, "region", txt_folder, "summary")
txt_cmd   = paste("mkdir -p", txt_fdiry)
system(txt_cmd)
print(txt_fdiry)

[1] "/mount/repo/Proj_ENCODE_FCC/data/external/encode_chipseq_histone"
[1] "/mount/repo/Proj_ENCODE_FCC/results/region/encode_chipseq_histone/summary"


In [13]:
### set directory
txt_fdiry = file.path(FD_DAT, "external", txt_folder)
txt_fname = "files.txt"
txt_fpath = file.path(txt_fdiry, txt_fname)

### write table
dat = dat_download_furl
write_tsv(dat, txt_fpath, col_names = FALSE)  

In [14]:
### set directory
txt_fdiry = file.path(FD_DAT, "external", txt_folder)
txt_fname = "checksum_md5sum.txt"
txt_fpath = file.path(txt_fdiry, txt_fname)

### write table
dat = dat_download_md5sum
write_tsv(dat, txt_fpath, col_names = FALSE)  

In [15]:
### set output path
txt_fdiry = file.path(FD_RES, "region", txt_folder, "summary")
txt_fname = "metadata.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

### save table
dat = dat_metadata_final
write_tsv(dat, txt_fpath)  