**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../config/config_sing.R")))
show_env()

You are in Singularity: singularity_proj_encode_fcc 
BASE DIRECTORY (FD_BASE): /data/reddylab/Kuei 
WORK DIRECTORY (FD_WORK): /data/reddylab/Kuei/out 
CODE DIRECTORY (FD_CODE): /data/reddylab/Kuei/code 
PATH OF PROJECT (FD_PRJ): /data/reddylab/Kuei/code/Proj_CombEffect_ENCODE_FCC 
PATH OF RESULTS (FD_RES): /data/reddylab/Kuei/out/proj_combeffect_encode_fcc 
PATH OF LOG     (FD_LOG): /data/reddylab/Kuei/out/proj_combeffect_encode_fcc/log 


In [2]:
fdiry = file.path(
    FD_RES, 
    "results", 
    "region", 
    "annotation_chipseq_tf_subset")
fnames = dir(fdiry)
fnames

## Import data

In [3]:
fdiry = file.path(
    FD_RES, 
    "results", 
    "region", 
    "annotation_chipseq_tf_subset")
fname = "K562.ENCSR000EGM.ENCFF660GHM.CTCF.bed.gz"
fpath = file.path(fdiry, fname)

dat = read_tsv(fpath, col_names = FALSE, show_col_types = FALSE)

dat_chipseq_tf_ctcf = dat
print(dim(dat))
head(dat)

[1] 58684    10


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10
<chr>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,16127,16367,.,776,.,24.18824,-1,3.52629,120
chr1,267886,268126,.,1000,.,78.91004,-1,4.97158,120
chr1,586068,586308,.,1000,.,52.4994,-1,4.97158,120
chr1,778768,778927,.,1000,.,167.7558,-1,4.97158,136
chr1,858026,858266,.,1000,.,37.67029,-1,4.97158,120
chr1,869845,869994,.,1000,.,194.36523,-1,4.97158,76


In [4]:
summary(dat$X8)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     -1      -1      -1      -1      -1      -1 

## Define column names

ENCODE-DCC/chip-seq-pipeline: ENCODE Uniform processing pipeline for ChIP-seq
https://github.com/ENCODE-DCC/chip-seq-pipeline

A narrowPeak (.narrowPeak) file is used by the ENCODE project to provide called peaks of signal enrichement based on pooled, normalized (interpreted) data. It is a BED 6+4 format. See the UCSC web site for more detail on this format.

[IGV | narrowPeak](https://software.broadinstitute.org/software/igv/node/270)

[UCSC | ENCODE narrowPeak: Narrow (or Point-Source) Peaks format](http://genome.ucsc.edu/FAQ/FAQformat.html#format12)

- chrom string
    - Name of the chromosome for common peaks
- chromStart int
    - The starting position of the feature in the chromosome or scaffold for common peaks, shifted based on offset. The first base in a chromosome is numbered 0.
- chromEnd int
    - The ending position of the feature in the chromosome or scaffold for common peaks. 
    - The chromEnd base is not included in the display of the feature.
- name string
    - Name given to a region (preferably unique) for common peaks. 
    - Use '.' if no name is assigned.
- score int
    - Contains the scaled IDR value, min(int(log2(-125IDR), 1000). 
    - e.g. peaks with an 
        - IDR of 0 have a score of 1000, 
        - idr 0.05 have a score of int(-125log2(0.05)) = 540, and 
        - idr 1.0 has a score of 0.
- strand [+-.] 
    - Use '.' if no strand is assigned.
- signalValue float
    - Measurement of enrichment for the region for merged peaks. 
    - When a peak list is provided this is the value from the peak list.
- p-value float
    - Merged peak p-value. 
    - When a peak list is provided this is the value from the peak list.
- q-value float
    - Merged peak q-value. When a peak list is provided this is the value from the peak list.
- peak
    - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.

In [5]:
dat_cnames = tribble(
    ~Name,         ~Description,
    "Chrom",       "Chromosome",
    "Start",       "Start position",
    "End",         "End position",
    "Name",        "Name given to a region",
    "Score",       "Scaled IDR value, min(int(log2(-125IDR), 1000).",
    "Strand",      "[+-.]; Use '.' if no strand is assigned.",
    "SignalValue", "Measurement of enrichment for the region for merged peaks.",
    "PValue",      "Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.",
    "QValue",      "Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.",
    "Peak",        "Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called."
)

dat_cnames

Name,Description
<chr>,<chr>
Chrom,Chromosome
Start,Start position
End,End position
Name,Name given to a region
Score,"Scaled IDR value, min(int(log2(-125IDR), 1000)."
Strand,[+-.]; Use '.' if no strand is assigned.
SignalValue,Measurement of enrichment for the region for merged peaks.
PValue,Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
QValue,Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
Peak,Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.


## Save results

In [6]:
fdiry = file.path(
    FD_RES, 
    "results", 
    "region", 
    "annotation_chipseq_tf_subset")
fname = "description.tsv"
fpath = file.path(fdiry, fname)

dat = dat_cnames
write_tsv(dat, fpath)

In [7]:
fdiry = file.path(
    FD_RES, 
    "results", 
    "region", 
    "annotation_chipseq_tf")
fname = "description.tsv"
fpath = file.path(fdiry, fname)

dat = dat_cnames
write_tsv(dat, fpath)