**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../run_config_project_sing.R")))
show_env()

You are working on        Singularity 
BASE DIRECTORY (FD_BASE): /mount 
REPO DIRECTORY (FD_REPO): /mount/repo 
WORK DIRECTORY (FD_WORK): /mount/work 
DATA DIRECTORY (FD_DATA): /mount/data 

You are working with      ENCODE FCC 
PATH OF PROJECT (FD_PRJ): /mount/repo/Proj_ENCODE_FCC 
PROJECT RESULTS (FD_RES): /mount/repo/Proj_ENCODE_FCC/results 
PROJECT SCRIPTS (FD_EXE): /mount/repo/Proj_ENCODE_FCC/scripts 
PROJECT DATA    (FD_DAT): /mount/repo/Proj_ENCODE_FCC/data 
PROJECT NOTE    (FD_NBK): /mount/repo/Proj_ENCODE_FCC/notebooks 
PROJECT DOCS    (FD_DOC): /mount/repo/Proj_ENCODE_FCC/docs 
PROJECT LOG     (FD_LOG): /mount/repo/Proj_ENCODE_FCC/log 
PROJECT APP     (FD_APP): /mount/repo/Proj_ENCODE_FCC/app 
PROJECT REF     (FD_REF): /mount/repo/Proj_ENCODE_FCC/references 



**Set global variables**

In [2]:
TXT_FOLDER_REGION = "encode_chipseq_agarwal2023"

## Import data

**Read Agarwal2023 TableS5**

In [3]:
### set directory
txt_folder = TXT_FOLDER_REGION
txt_fdiry  = file.path(FD_REF, txt_folder)
txt_fname  = "Agarwal2023_TableS5.tsv"
txt_fpath  = file.path(txt_fdiry, txt_fname)

### read table
dat = read_tsv(txt_fpath, show_col_types = FALSE)
dat = dat %>% dplyr::filter(`Biosample term name` == "K562")

### assing and show
dat_meta_k562_agarwal2023 = dat
print(dim(dat))
fun_display_table(head(dat, 1))

[1] 1206   13


File accession,File type,Output type,File assembly,Experiment accession,Assay,Biosample term name,Experiment target,Biological replicate(s),File download URL,File Status,s3_uri,File analysis status
ENCFF254FCX,bigWig,fold change over control,GRCh38,ENCSR440VKE,TF ChIP-seq,K562,ADNP-human,1,https://www.encodeproject.org/files/ENCFF254FCX/@@download/ENCFF254FCX.bigWig,released,s3://encode-public/2020/12/04/ba7cfb89-7802-41db-aada-bef45e7e97a7/ENCFF254FCX.bigWig,released


**Read metadata from ENCODE portal**

In [4]:
### set directory
txt_folder = "encode_chipseq_latest"
txt_fdiry  = file.path(FD_REF, txt_folder)
txt_fname  = "metadata_240620.tsv"
txt_fpath  = file.path(txt_fdiry, txt_fname)

### read table
dat = read_tsv(txt_fpath, show_col_types = FALSE)

### assing and show
dat_meta_k562_encode_latest = dat
print(dim(dat))
fun_display_table(head(dat, 1))

[1] 18598    59


File accession,File format,File type,File format type,Output type,File assembly,Experiment accession,Assay,Donor(s),Biosample term id,Biosample term name,Biosample type,Biosample organism,Biosample treatments,Biosample treatments amount,Biosample treatments duration,Biosample genetic modifications methods,Biosample genetic modifications categories,Biosample genetic modifications targets,Biosample genetic modifications gene targets,Biosample genetic modifications site coordinates,Biosample genetic modifications zygosity,Experiment target,Library made from,Library depleted in,Library extraction method,Library lysis method,Library crosslinking method,Library strand specific,Experiment date released,Project,RBNS protein concentration,Library fragmentation method,Library size range,Biological replicate(s),Technical replicate(s),Read length,Mapped read length,Run type,Paired end,Paired with,Index of,Derived from,Size,Lab,md5sum,dbxrefs,File download URL,Genome annotation,Platform,Controlled by,File Status,s3_uri,Azure URL,File analysis title,File analysis status,Audit WARNING,Audit NOT_COMPLIANT,Audit ERROR
ENCFF076LYF,bigWig,bigWig,,fold change over control,GRCh38,ENCSR800KMQ,TF ChIP-seq,/human-donors/ENCDO000AAD/,EFO:0002067,K562,cell line,Homo sapiens,,,,CRISPR,insertion,"{'schema_version': '14', 'aliases': [], 'organism': {'schema_version': '6', '@type': ['Organism', 'Item'], 'name': 'human', 'taxon_id': '9606', 'scientific_name': 'Homo sapiens', '@id': '/organisms/human/', 'uuid': '7745b647-ff15-4ff3-9ced-b897d4e2983c', 'status': 'released'}, 'genes': ['/genes/23099/'], '@type': ['Target', 'Item'], 'name': 'ZBTB43-human', 'label': 'ZBTB43', '@id': '/targets/ZBTB43-human/', 'title': 'ZBTB43 (Homo sapiens)', 'investigated_as': ['transcription factor'], 'uuid': 'c4fd1273-5058-4d6b-90b2-4fa94d61d714', 'status': 'released'}",,,,ZBTB43-human,DNA,,,,formaldehyde,,2023-01-10,ENCODE,,sonication (generic),,1,1_1,,,,,,,"/files/ENCFF441EKA/, /files/ENCFF837JNH/",508674667,ENCODE Processing Pipeline,128444a048c5894ea1fdffda07a00a16,,https://www.encodeproject.org/files/ENCFF076LYF/@@download/ENCFF076LYF.bigWig,,,,released,s3://encode-public/2022/07/23/7fafea14-e96a-47d6-8945-706b0f1e0800/ENCFF076LYF.bigWig,https://datasetencode.blob.core.windows.net/dataset/2022/07/23/7fafea14-e96a-47d6-8945-706b0f1e0800/ENCFF076LYF.bigWig?sv=2019-10-10&si=prod&sr=c&sig=9qSQZo4ggrCNpybBExU8SypuUZV33igI11xw0P7rB3c%3D,ENCODE4 v1.8.0 GRCh38,released,"mild to moderate bottlenecking, missing genetic modification reagents",,


## Explore

In [5]:
dat = dat_meta_k562_agarwal2023
table(dat$`Output type`, dat$Assay)

                              
                               ATAC-seq DNase-seq Histone ChIP-seq TF ChIP-seq
  fold change over control            7         0               13        1181
  read-depth normalized signal        0         5                0           0

In [6]:
dat = dat_meta_k562_agarwal2023
table(dat$`File type`, dat$Assay)

        
         ATAC-seq DNase-seq Histone ChIP-seq TF ChIP-seq
  bigWig        7         5               13        1181

In [7]:
dat = dat_meta_k562_encode_latest
table(dat$`File type`, dat$Assay)

        
         ATAC-seq DNase-seq Histone ChIP-seq TF ChIP-seq
  bed          38        27              266        7719
  bigWig       26        19              414       10089

In [8]:
dat = dat_meta_k562_encode_latest
table(dat$`Output type`, dat$Assay)

                                        
                                         ATAC-seq DNase-seq Histone ChIP-seq
  base overlap signal                           0         3                0
  conservative IDR thresholded peaks            4         0                0
  fold change over control                     13         0              197
  IDR thresholded peaks                        15         0                0
  optimal IDR thresholded peaks                 0         0                0
  peaks                                         0        27              120
  peaks and background as input for IDR         0         0                0
  pseudoreplicated IDR thresholded peaks        0         0                0
  pseudoreplicated peaks                       13         0               77
  raw signal                                    0         0                4
  read-depth normalized signal                  0        13                0
  replicated peaks                 

In [9]:
dat = dat_meta_k562_encode_latest
vec = dat$`File accession`

dat = dat_meta_k562_agarwal2023
dat = dat %>% dplyr::select(`File accession`, Assay)

dat = dat %>% dplyr::mutate(Match = `File accession` %in% vec)
table(dat$Assay, dat$Match)

                  
                   FALSE TRUE
  ATAC-seq             0    7
  DNase-seq            0    5
  Histone ChIP-seq     0   13
  TF ChIP-seq          2 1179

In [10]:
dat %>% dplyr::filter(!Match)

File accession,Assay,Match
<chr>,<chr>,<lgl>
ENCFF576BIF,TF ChIP-seq,False
ENCFF543ZZV,TF ChIP-seq,False


**Both `ENCFF576BIF` and `ENCFF543ZZV` belong to the experiment `ENCSR000EGP`, which has been revoked.**
- [ENCFF543ZZV – ENCODE](https://www.encodeproject.org/files/ENCFF543ZZV/)
- [ENCFF576BIF – ENCODE](https://www.encodeproject.org/files/ENCFF576BIF/)
- [ENCSR000EGP – ENCODE](https://www.encodeproject.org/experiments/ENCSR000EGP/)
    - Status: revoked
    - Assay: ChIP-seq (TF ChIP-seq)
    - Target: ZNF143
    - Biosample summary: Homo sapiens K562

## Filter and arrange

**Fitler and arrange table**

In [12]:
### init: get file accession numbers
dat = dat_meta_k562_encode_latest
vec_txt_index = dat$`File accession`

### filter
dat = dat_meta_k562_agarwal2023
dat = dat %>% dplyr::filter(`File accession` %in% vec_txt_index)
cat("Before filter:", nrow(dat_meta_k562_agarwal2023), "\n")

### arrange
dat = dat %>% 
    dplyr::mutate(
        Biosample        = `Biosample term name`,
        Index_Experiment = `Experiment accession`,
        Index_File       = `File accession`,
        File_Type        = `File type`,
        Output_Type      = `Output type`,
        Genome           = `File assembly`,
        Target           = str_remove(`Experiment target`, "-human"),
        Replicate        = `Biological replicate(s)`,
        FName            = basename(`File download URL`),
        FUrl             = `File download URL`
    ) %>%
    dplyr::select(
        Assay, 
        Index_Experiment, 
        Index_File,
        File_Type,
        Output_Type,
        Genome,
        Target,
        Replicate,
        FName,
        FUrl)

### assing and show
dat_meta_results = dat
cat("After filter: ", nrow(dat), "\n")
fun_display_table(head(dat, 3))

Before filter: 1206 
After filter:  1204 


Assay,Index_Experiment,Index_File,File_Type,Output_Type,Genome,Target,Replicate,FName,FUrl
TF ChIP-seq,ENCSR440VKE,ENCFF254FCX,bigWig,fold change over control,GRCh38,ADNP,1,ENCFF254FCX.bigWig,https://www.encodeproject.org/files/ENCFF254FCX/@@download/ENCFF254FCX.bigWig
TF ChIP-seq,ENCSR440VKE,ENCFF580YKQ,bigWig,fold change over control,GRCh38,ADNP,2,ENCFF580YKQ.bigWig,https://www.encodeproject.org/files/ENCFF580YKQ/@@download/ENCFF580YKQ.bigWig
TF ChIP-seq,ENCSR641BSL,ENCFF057BCB,bigWig,fold change over control,GRCh38,AGO1,2,ENCFF057BCB.bigWig,https://www.encodeproject.org/files/ENCFF057BCB/@@download/ENCFF057BCB.bigWig


**Get File URL**

In [16]:
dat = dat_meta_results
dat = dat %>% dplyr::select(FUrl)

dat_meta_furl = dat
print(dim(dat))
head(dat)

[1] 1204    1


FUrl
<chr>
https://www.encodeproject.org/files/ENCFF254FCX/@@download/ENCFF254FCX.bigWig
https://www.encodeproject.org/files/ENCFF580YKQ/@@download/ENCFF580YKQ.bigWig
https://www.encodeproject.org/files/ENCFF057BCB/@@download/ENCFF057BCB.bigWig
https://www.encodeproject.org/files/ENCFF593VAW/@@download/ENCFF593VAW.bigWig
https://www.encodeproject.org/files/ENCFF020OLK/@@download/ENCFF020OLK.bigWig
https://www.encodeproject.org/files/ENCFF521JBC/@@download/ENCFF521JBC.bigWig


**Get File Checksum**

In [17]:
###
dat = dat_meta_results
vec = dat$FName

###
dat = dat_meta_k562_encode_latest
dat = dat %>% 
    dplyr::mutate(FName = basename(`File download URL`)) %>%
    dplyr::filter(FName %in% vec) %>%
    dplyr::select(md5sum, FName)

### assign and show
dat_meta_md5sum = dat
print(dim(dat))
head(dat)

[1] 1204    2


md5sum,FName
<chr>,<chr>
2a8eebea81e867a124a4d76d142c69e6,ENCFF403XOV.bigWig
7663da654f6de358e935f977633446ca,ENCFF216UCG.bigWig
5b270babaf701dd7ca2b615395e35198,ENCFF100VDV.bigWig
ce9fc55ad9517fd2b40625e5a01d3983,ENCFF961LTP.bigWig
f29a0fc38384d20345b512536a4f9b64,ENCFF702BPM.bigWig
04a1dbe8b21ecb46fd41d5a64b1b07aa,ENCFF738KPN.bigWig


## Save results

In [18]:
### set directory
txt_fdiry = file.path(FD_DAT, "external", TXT_FOLDER_REGION)
txt_fname = "files.txt"
txt_fpath = file.path(txt_fdiry, txt_fname)

###
dir.create(txt_fdiry, showWarnings = FALSE)

###
dat = dat_meta_furl
write_delim(dat, txt_fpath, col_names = FALSE)

In [20]:
### set directory
txt_fdiry = file.path(FD_DAT, "external", TXT_FOLDER_REGION)
txt_fname = "checksum_md5sum.txt"
txt_fpath = file.path(txt_fdiry, txt_fname)

###
dat = dat_meta_md5sum
write_delim(dat, txt_fpath, delim = " ", col_names = FALSE)

In [21]:
### set directory
txt_fdiry = file.path(FD_DAT, "external", TXT_FOLDER_REGION, "summary")
txt_fname = "metadata.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

###
dir.create(txt_fdiry, showWarnings = FALSE)

###
dat = dat_meta_results
write_delim(dat, txt_fpath, col_names = FALSE)