The goal of this notebook is to prepare regions for Ben and Gurkan. There are several figures I need to prepare

- ATAC regions with FCC z scores
- ATAC regions with covariates

**Set environment**

In [1]:
suppressMessages(suppressWarnings(source("../run_config_project_sing.R")))
show_env()

You are working on        Singularity: singularity_proj_encode_fcc 
BASE DIRECTORY (FD_BASE): /data/reddylab/Kuei 
REPO DIRECTORY (FD_REPO): /data/reddylab/Kuei/repo 
WORK DIRECTORY (FD_WORK): /data/reddylab/Kuei/work 
DATA DIRECTORY (FD_DATA): /data/reddylab/Kuei/data 

You are working with      ENCODE FCC 
PATH OF PROJECT (FD_PRJ): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC 
PROJECT RESULTS (FD_RES): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/results 
PROJECT SCRIPTS (FD_EXE): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/scripts 
PROJECT DATA    (FD_DAT): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/data 
PROJECT NOTE    (FD_NBK): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/notebooks 
PROJECT DOCS    (FD_DOC): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/docs 
PROJECT LOG     (FD_LOG): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/log 
PROJECT REF     (FD_REF): /data/reddylab/Kuei/repo/Proj_ENCODE_FCC/references 



## Import other data

**Regions with TSS**

In [2]:
txt_fdiry = file.path(
    FD_RES, 
    "region_annotation", 
    "fcc_astarr_macs_input_overlap",
    "summary"
)
txt_fname = "matrix.annotation.genome_tss.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = read_tsv(txt_fpath, show_col_type = "FALSE")

mat_region_annot_tss = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 9649    5


Chrom,ChromStart,ChromEnd,Region,TSS
chr1,28934,29499,chr1:28934-29499,1
chr1,826796,828040,chr1:826796-828040,1
chr1,876493,877795,chr1:876493-877795,1


**FCC coverage**

In [3]:
txt_fdiry = file.path(
    FD_RES, 
    "region_coverage_fcc", 
    "fcc_astarr_macs_input_overlap",
    "summary"
)
dir(txt_fdiry)

In [4]:
txt_fdiry = file.path(
    FD_RES, 
    "region_coverage_fcc", 
    "fcc_astarr_macs_input_overlap",
    "summary"
)
txt_fname = "result.coverage.zscore.share.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = read_tsv(txt_fpath, show_col_type = "FALSE")

dat_region_score_fcc = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 432505      6


Chrom,ChromStart,ChromEnd,Region,Score,Assay
chr10,100729094,100729750,chr10:100729094-100729750,-0.3065107,CRISPRi-HCR FlowFISH
chr10,100743501,100744571,chr10:100743501-100744571,-0.2702473,CRISPRi-HCR FlowFISH
chr10,100745413,100745741,chr10:100745413-100745741,0.1130381,CRISPRi-HCR FlowFISH


## ATAC regions with covariates

In [5]:
txt_fdiry = file.path(
    FD_RES, 
    "region_integration",
    "fcc_astarr_macs_input_overlap",
    "analysis_enrichment_by_annotation_peak"
)
txt_fname = "region.prepare.covariate.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = read_tsv(txt_fpath, show_col_types = FALSE)

dat_region_covariate = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 150041      7


Chrom,ChromStart,ChromEnd,Region,ATAC,pGC,Length
chr1,10038,10405,chr1:10038-10405,0.5955004,0.523161,367
chr1,14282,14614,chr1:14282-14614,0.4535793,0.578313,332
chr1,16025,16338,chr1:16025-16338,0.5832908,0.587859,313


## ATAC regions with FCC labels

### Import data

In [6]:
txt_fdiry = file.path(
    FD_RES, 
    "region_annotation", 
    "fcc_astarr_macs_input_overlap",
    "summary"
)
txt_fname = "region.summary.fcc_peak_call.label.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = read_tsv(txt_fpath, show_col_types = FALSE)

dat_region_annot_fcc = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 155926     19


Chrom,ChromStart,ChromEnd,Region,Type,Num_Assay,TSS_Total,TSS_Essential,Label1,Label2,Label3,Screen_CRISPR_Total,Screen_CRISPR_Growth,Screen_CRISPR_HCRFF,Screen_CRISPR_E2G,Signif_CRISPR_Total,Signif_CRISPR_Growth,Signif_CRISPR_HCRFF,Signif_CRISPR_E2G
chr1,10038,10405,chr1:10038-10405,Repress,1,0,0,Silencer,Silencer,Silencer,0,0,0,0,0,0,0,0
chr1,10038,10405,chr1:10038-10405,Repress_GCFilter,1,0,0,Silencer,Silencer,Silencer,0,0,0,0,0,0,0,0
chr1,16025,16338,chr1:16025-16338,Repress,1,0,0,Silencer,Silencer,Silencer,0,0,0,0,0,0,0,0


## Arrange: Subset the regions

In [7]:
dat = dat_region_annot_fcc

vec = c("Enhance", "Repress")
dat = dat %>% 
    dplyr::filter(Type %in% vec) %>%
    dplyr::filter(Num_Assay > 1) 

dat_region_annot_fcc_subset = dat
print(dim(dat))
fun_display_table(head(dat, 3))

[1] 18346    19


Chrom,ChromStart,ChromEnd,Region,Type,Num_Assay,TSS_Total,TSS_Essential,Label1,Label2,Label3,Screen_CRISPR_Total,Screen_CRISPR_Growth,Screen_CRISPR_HCRFF,Screen_CRISPR_E2G,Signif_CRISPR_Total,Signif_CRISPR_Growth,Signif_CRISPR_HCRFF,Signif_CRISPR_E2G
chr1,778233,779389,chr1:778233-779389,Enhance,3,0,0,Enhancer,Enhancer,Enhancer,1,1,0,0,0,0,0,0
chr1,958722,959968,chr1:958722-959968,Enhance,2,1,0,Enhancer,Promoter,TSS:Enhancer,1,1,0,0,0,0,0,0
chr1,960468,961615,chr1:960468-961615,Enhance,2,1,0,Enhancer,Promoter,TSS:Enhancer,1,1,0,0,0,0,0,0


### Mapping the labels

In [8]:
fun_str_map = function(vec_txt_input){
    vec_txt_pattern = c("Enhancer", "Silencer", "TSS:Enhancer", "TSS:Silencer")
    vec_txt_replace = c("Distal:Active", "Distal:Repressive", "Proximal:Active", "Proximal:Repressive")
    vec_txt_output = fun_str_map_match(
        vec_txt_input,
        vec_txt_pattern,
        vec_txt_replace
    )
    return(vec_txt_output)
}

**Test mapping function**

In [9]:
dat = dat_region_annot_fcc_subset
dat = dat %>% dplyr::mutate(Group = fun_str_map(Label3)) 

print(table(dat$Label3, dat$Group))

              
               Distal:Active Distal:Repressive Proximal:Active
  Enhancer             11623                 0               0
  Silencer                 0              1640               0
  TSS:Enhancer             0                 0            4974
  TSS:Silencer             0                 0               0
              
               Proximal:Repressive
  Enhancer                       0
  Silencer                       0
  TSS:Enhancer                   0
  TSS:Silencer                 109


In [10]:
dat = dat_region_annot_fcc_subset
dat = dat %>% 
    dplyr::mutate(Group = fun_str_map(Label3)) %>%
    #dplyr::mutate(Group = Label3) %>%
    dplyr::select(Chrom:Region, Group)

dat_region_focal = dat
print(dim(dat))
print(table(dat$Group))
head(dat)

[1] 18346     5

      Distal:Active   Distal:Repressive     Proximal:Active Proximal:Repressive 
              11623                1640                4974                 109 


Chrom,ChromStart,ChromEnd,Region,Group
<chr>,<dbl>,<dbl>,<chr>,<chr>
chr1,778233,779389,chr1:778233-779389,Distal:Active
chr1,958722,959968,chr1:958722-959968,Proximal:Active
chr1,960468,961615,chr1:960468-961615,Proximal:Active
chr1,1005094,1005553,chr1:1005094-1005553,Distal:Active
chr1,1013154,1014482,chr1:1013154-1014482,Proximal:Active
chr1,1059012,1060137,chr1:1059012-1060137,Distal:Active


## Define pool regions

In [11]:
dat = dat_region_annot_fcc_subset
vec = unique(dat$Region)

vec_txt_region_fcc = vec

In [12]:
dat = dat_region_covariate
vec = unique(dat$Region)

vec_txt_region_tot = vec

In [13]:
dat = mat_region_annot_tss
vec = unique(dat$Region)

vec_txt_region_tss = vec

In [14]:
vec = setdiff(vec_txt_region_tot, vec_txt_region_fcc)

vec_txt_region_back = vec

In [15]:
dat = dat_region_covariate
dat = dat %>% 
    dplyr::filter(Region %in% vec_txt_region_back) %>%
    dplyr::mutate(
        Group = ifelse(
            Region %in% vec_txt_region_tss, 
            "Proximal:Inactive", 
            "Distal:Inactive"
        )
    ) %>%
    dplyr::select(Chrom:Region, Group)

dat_region_pool = dat
print(dim(dat))
head(dat)

[1] 131700      5


Chrom,ChromStart,ChromEnd,Region,Group
<chr>,<dbl>,<dbl>,<chr>,<chr>
chr1,10038,10405,chr1:10038-10405,Distal:Inactive
chr1,14282,14614,chr1:14282-14614,Distal:Inactive
chr1,16025,16338,chr1:16025-16338,Distal:Inactive
chr1,17288,17689,chr1:17288-17689,Distal:Inactive
chr1,28934,29499,chr1:28934-29499,Proximal:Inactive
chr1,115429,115969,chr1:115429-115969,Distal:Inactive


## Save results

In [16]:
txt_fdiry = file.path(
    FD_RES, 
    "region_integration",
    "fcc_astarr_macs_input_overlap",
    "analysis_enrichment_v2",
    "fcc_starrmpra_vote2_v2_split_pool_by_tss",
    "shared"
)
txt_fname = "region.coverage.fcc.zscore.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = dat_region_score_fcc
write_tsv(dat, txt_fpath)
fun_display_table(head(dat, 3))

Chrom,ChromStart,ChromEnd,Region,Score,Assay
chr10,100729094,100729750,chr10:100729094-100729750,-0.3065107,CRISPRi-HCR FlowFISH
chr10,100743501,100744571,chr10:100743501-100744571,-0.2702473,CRISPRi-HCR FlowFISH
chr10,100745413,100745741,chr10:100745413-100745741,0.1130381,CRISPRi-HCR FlowFISH


In [20]:
txt_fdiry = file.path(
    FD_RES, 
    "region_integration",
    "fcc_astarr_macs_input_overlap",
    "analysis_enrichment_v2",
    "fcc_starrmpra_vote2_v2_split_pool_by_tss",
    "shared"
)
txt_fname = "region.covariates.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = dat_region_covariate
write_tsv(dat, txt_fpath)
fun_display_table(head(dat, 3))

Chrom,ChromStart,ChromEnd,Region,ATAC,pGC,Length
chr1,10038,10405,chr1:10038-10405,0.5955004,0.523161,367
chr1,14282,14614,chr1:14282-14614,0.4535793,0.578313,332
chr1,16025,16338,chr1:16025-16338,0.5832908,0.587859,313


In [18]:
txt_fdiry = file.path(
    FD_RES, 
    "region_integration",
    "fcc_astarr_macs_input_overlap",
    "analysis_enrichment_v2",
    "fcc_starrmpra_vote2_v2_split_pool_by_tss",
    "shared"
)
txt_fname = "region.annotation.fcc.label.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = dat_region_focal
write_tsv(dat, txt_fpath)
fun_display_table(head(dat, 3))

Chrom,ChromStart,ChromEnd,Region,Group
chr1,778233,779389,chr1:778233-779389,Distal:Active
chr1,958722,959968,chr1:958722-959968,Proximal:Active
chr1,960468,961615,chr1:960468-961615,Proximal:Active


In [19]:
txt_fdiry = file.path(
    FD_RES, 
    "region_integration",
    "fcc_astarr_macs_input_overlap",
    "analysis_enrichment_v2",
    "fcc_starrmpra_vote2_v2_split_pool_by_tss",
    "shared"
)
txt_fname = "region.annotation.fcc.control.tsv"
txt_fpath = file.path(txt_fdiry, txt_fname)

dat = dat_region_pool
write_tsv(dat, txt_fpath)
fun_display_table(head(dat, 3))

Chrom,ChromStart,ChromEnd,Region,Group
chr1,10038,10405,chr1:10038-10405,Distal:Inactive
chr1,14282,14614,chr1:14282-14614,Distal:Inactive
chr1,16025,16338,chr1:16025-16338,Distal:Inactive
