### Summary:
In this notebook we will process the ABC and Cicero links into a format that matches our other cRE-gene link outputs and subset SMORES links to be only CP links (cRE-promoter). These functions are designed to work in an automated fashion and be easy to run on many outputs from different cell types. In this notebook we will also process both the significant and non-significant links files from ABC and Cicero to a format that can be used in downstream analyses when building background sets for enrichment. 

## TO DO: clean this up and make sure dedicated sections for background prep are good
- I remove the sig ABC links from the AllPutative files before generating background files... but wouldn't this mean the background is incomplete? All the links in a selection should also be in the background right????
- Pretty sure we do the same thing for Cicero non-sig too?

# 1. Basic Preparation

In [1]:
# Import necessary libraries
suppressMessages(library(tidyverse))
suppressMessages(library(stringr))
suppressMessages(library(tictoc))
suppressMessages(library(parallel))

In [2]:
# Define celltypes list
celltypes <- c('beta','alpha','delta','gamma','ductal','acinar',
              'stellate','endothelial','schwann','immune')

h3k27ac_celltypes <- c('beta','alpha','delta','gamma','ductal','acinar')
no_h3k27ac_celltypes <- c('stellate','endothelial','schwann','immune')

### Define necessary reference files

In [3]:
# Read in the gene coords reference file
ref_df <- read.table('/nfs/lab/ABC/references/gene_coords.gencodev32.hg38.bed', sep='\t', header=FALSE) #read in gene coords ref

# Read in the TSS500bp version of the gene coords reference file
TSS_ref_fp <- '/nfs/lab/ABC/references/gene_coords.gencodev32.hg38.TSS500bp.bed'
TSS_ref_df <- read.table(TSS_ref_fp, sep='\t')

In [4]:
# All CREs (merged list)
cres_fp <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/call_peaks/recluster_final_majorCTs_v2/mergedPeak.txt'

# Celltype specific CREs file path = ct_cres_prefix + celltype + ct_cres_suffix
ct_cres_prefix <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/call_peaks/recluster_final_majorCTs_v2/'
ct_cres_suffix <- '.merged_peaks.anno.mergedOverlap.bed'

### Establish file naming practices for links files

In [5]:
# HM method links files = hm_prefix + celltype + '/' + celltype + hm_suffix
hm_prefix <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/230109_final_map_spearman/'
hm_suffix1 <- '_linked_ct_peaks_fdr0.1_corr.bedpe'
hm_suffix2 <- '_ALL_ct_peaks_corr_pvalues.bedpe'

# Outdir for CP-only SMORES links
hm_outdir <- '/nfs/lab/hmummey/multiomic_islet/intermediates/230228_SMORES_PP_investigation'

In [6]:
# Celltype specific ABC files = abc_prefix + celltype + abc_suffix
abc_prefix1 <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/run_abc/230110_allCTs/outputs/230116_H3K27ac_CTs/'
abc_prefix2 <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/run_abc/230110_allCTs/outputs/230116_no_H3K27ac_CTs/'
abc_suffix <- '/Prediction/EnhancerPredictions.hg38.mapped.bedpe'

# Outdir for reformatted ABC links
abc_outdir <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/run_abc/reprocessed_outputs/230110_allCTs'

In [7]:
# Celltype specific Cicero files = cic_prefix + celltype + cic_suffix
cic_prefix <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/Cicero_links.'
cic_suffix <- '.above0.05.dedup.bedpe'

# Outdir for reformatted ABC links
cic_outdir <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_outputs'

In [8]:
# Directory to write the 3 method combined background file to
overlap_outdir <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/method_overlaps/link_set_enrichment'

# 2. Reformat ABC Links
Because ABC maps peaks to hg19, runs predictions, and then maps back the cREs the cRE coords may no longer match our original peak calls, so we need to remap to them.
Considerations:
- The bedtools intersect command to overlap ABC CREs with the peak calls list sometimes returns double outputs (aka one CRE is mapped to the same peak twice)

## 2a. Reformat significant links

### Functions

In [40]:
### Function to overlap the ABC CREs with the peaks list
overlap_with_peaks <- function(celltype, abc_fp, abc_outdir){
    # Get the left side coordinates from each ABC link (unique and sorted)
    df <- read.table(abc_fp, sep='\t')
    sites <- df[,c(1,2,3)]
    chr_names <- c(paste("chr",seq(1:22),sep=''),'chrX','chrY')
    sites_cut <- sites[sites$V1 %in% chr_names,]
    all_sites_list = paste(sites_cut$V1,sites_cut$V2,sites_cut$V3,sep='_')
    all_sites_list <- sort(unique(all_sites_list))
    
    # Output a bed file of these coordinates 
    fin_df <- as.data.frame(str_split_fixed(all_sites_list, '_', n=3))
    out_fp <- file.path(abc_outdir, sprintf('%s_all_linked_sites.bed', celltype))
    write.table(fin_df, out_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
    
    # Overlap with celltype CREs list using bedtools in the terminal
    # Previously used -f but this loses some overlaps (if ABC peak is larger than mapping peak)
    abc_cres <- out_fp
    cres_fp <- paste0(ct_cres_prefix, celltype, ct_cres_suffix)
    overlap_fp <- file.path(abc_outdir, sprintf('%s_all_linked_sites.mergedPeaks.overlap.bed', celltype))
    cmd <- paste('bedtools intersect -a', abc_cres, '-b', cres_fp, '-wo >',overlap_fp, sep=' ')
    system(cmd)
}


### Function to map ABC cRE coords to overlapping peaks and output reformatted dataframe
map_to_overlap_peaks <- function(celltype, abc_fp, abc_outdir){
    # Read in the bedtools intersect output, cut out repeats, and get map
    overlap_fp <- file.path(abc_outdir, sprintf('%s_all_linked_sites.mergedPeaks.overlap.bed', celltype))
    abc_overlap <- read.table(overlap_fp, sep='\t')
    abc_overlap$peak1 <- paste(abc_overlap$V1, abc_overlap$V2, abc_overlap$V3, sep='-')
    abc_overlap$peak2 <- paste(abc_overlap$V4, abc_overlap$V5, abc_overlap$V6, sep='-')
    
    # Decide between peaks that map to multiple peaks (based on bp overlap)
    # Sort by the overlap # and then remove duplicate peak1
    abc_overlap_sort <- abc_overlap[order(abc_overlap$V7, decreasing = TRUE),]
    abc_overlap_cut <- abc_overlap_sort[!duplicated(abc_overlap_sort$peak1),c(8,9)]
    
    # Read in ABC output and map CRE coords
    abc_df <- read.table(abc_fp, sep='\t')
    abc_df$peak1 <- paste(abc_df$V1, abc_df$V2, abc_df$V3, sep="-")
    abc_df_overlap <- merge(abc_df, abc_overlap_cut, by='peak1')
    abc_nomap <- abc_df[!abc_df$peak1 %in% abc_overlap$peak1,seq(1,8)]

    # Print out statistics
    print(celltype)
    print(paste('Number of unique peaks before mapping: ',length(unique(abc_df$peak1)), sep=''))
    print(paste('Number of mapped links: ', dim(abc_df_overlap)[1], '/', dim(abc_df)[1], sep=''))
    print(paste('Number of unmapped links: ', dim(abc_nomap)[1], '/', dim(abc_df)[1], sep=''))
    print('')

    # Now make final output ABC df where the CRE coords cols are altered to be the mapped ones
    # Excludes any links for which the CRE did not map
    fin_abc_df <- cbind(str_split_fixed(abc_df_overlap$peak2, '-', 3), abc_df_overlap[,c(5,6,7,8,9)])
    mapped_abc_fp <- file.path(abc_outdir, sprintf('%s_mapped_links.bedpe',celltype))
    write.table(fin_abc_df, mapped_abc_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
}

### Run functions on all celltypes to reformat links

In [41]:
tic()
for (celltype in h3k27ac_celltypes){
    abc_fp <- paste0(abc_prefix1, celltype, abc_suffix)
    overlap_with_peaks(celltype, abc_fp, abc_outdir)
    map_to_overlap_peaks(celltype, abc_fp, abc_outdir)
}
toc()

[1] "beta"
[1] "Number of unique peaks before mapping: 13213"
[1] "Number of mapped links: 39306/39495"
[1] "Number of unmapped links: 189/39495"
[1] ""
[1] "alpha"
[1] "Number of unique peaks before mapping: 14026"
[1] "Number of mapped links: 40883/41043"
[1] "Number of unmapped links: 160/41043"
[1] ""
[1] "delta"
[1] "Number of unique peaks before mapping: 13439"
[1] "Number of mapped links: 40612/40774"
[1] "Number of unmapped links: 162/40774"
[1] ""
[1] "gamma"
[1] "Number of unique peaks before mapping: 13606"
[1] "Number of mapped links: 40763/40868"
[1] "Number of unmapped links: 105/40868"
[1] ""
[1] "ductal"
[1] "Number of unique peaks before mapping: 15859"
[1] "Number of mapped links: 56454/56518"
[1] "Number of unmapped links: 64/56518"
[1] ""
[1] "acinar"
[1] "Number of unique peaks before mapping: 14940"
[1] "Number of mapped links: 48776/48905"
[1] "Number of unmapped links: 129/48905"
[1] ""
8.987 sec elapsed


In [42]:
tic()
for (celltype in no_h3k27ac_celltypes){
    abc_fp <- paste0(abc_prefix2, celltype, abc_suffix)
    overlap_with_peaks(celltype, abc_fp, abc_outdir)
    map_to_overlap_peaks(celltype, abc_fp, abc_outdir)
}
toc()

[1] "stellate"
[1] "Number of unique peaks before mapping: 13641"
[1] "Number of mapped links: 39203/39226"
[1] "Number of unmapped links: 23/39226"
[1] ""
[1] "endothelial"
[1] "Number of unique peaks before mapping: 10270"
[1] "Number of mapped links: 39291/39315"
[1] "Number of unmapped links: 24/39315"
[1] ""
[1] "schwann"
[1] "Number of unique peaks before mapping: 8001"
[1] "Number of mapped links: 34169/34250"
[1] "Number of unmapped links: 81/34250"
[1] ""
[1] "immune"
[1] "Number of unique peaks before mapping: 11458"
[1] "Number of mapped links: 36135/36146"
[1] "Number of unmapped links: 11/36146"
[1] ""
4.651 sec elapsed


## 2b. Reformat non-significant links (huge files, this will take a while)
Steps:
1. Remove significant links from all links file -- tbh not necessary, but anything that reduces the df size here is valuable
2. Convert to bedpe 
    1. Map ABC CRE coords to peaks set (get garbled from repeated liftover)
    2. Get gene coords from ref file
    3. Reformat as a bedpe
3. Remove links above distance threshold (1Mb)

### Functions

In [36]:
### Function to extract a gene's TSS from the reference file
get_TSS <- function(gene, ref_df){
    if (gene %in% ref_df$V4 == TRUE){
        ref_df_cut = ref_df[ref_df$V4 ==gene,]
        if (ref_df_cut$V6 == '-'){
            tss = max(c(ref_df_cut$V2,ref_df_cut$V3))
        } else {
            tss = min((c(ref_df_cut$V2,ref_df_cut$V3)))
        }
        return(tss)
    } else {
        return(NA)
    }
}


### Function to calculate link distances from a bedpe style dataframe row
calc_link_distance <- function(link_df_row){
    CRE_start <- as.integer(link_df_row[2])
    CRE_end <- as.integer(link_df_row[3])
    gene_start <- as.integer(link_df_row[5])
    CRE_center <- CRE_start + (CRE_end - CRE_start)/2
    distance <- abs(CRE_center - gene_start)
    return(distance)
}


### Function to overlap the ABC CREs with the peaks list
overlap_with_peaks <- function(celltype, abc_fp, abc_outdir){
    # Get the left side coordinates from each ABC link (unique and sorted)
    df <- read.table(abc_fp, sep='\t')
    sites <- df[,c(1,2,3)]
    chr_names <- c(paste("chr",seq(1:22),sep=''),'chrX','chrY')
    sites_cut <- sites[sites$V1 %in% chr_names,]
    all_sites_list = paste(sites_cut$V1,sites_cut$V2,sites_cut$V3,sep='_')
    all_sites_list <- sort(unique(all_sites_list))
    
    # Output a bed file of these coordinates 
    fin_df <- as.data.frame(str_split_fixed(all_sites_list, '_', n=3))
    abc_cres_fp <- file.path(abc_outdir, sprintf('%s_all_nonsig_linked_sites.bed', celltype))
    write.table(fin_df, abc_cres_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
    
    # Overlap with celltype CREs list using bedtools in the terminal
    # Previously used -f but this loses some overlaps (if ABC peak is larger than mapping peak)
    cres_fp <- paste0(ct_cres_prefix, celltype, ct_cres_suffix)
    overlap_fp <- file.path(abc_outdir, sprintf('%s_all_nonsig_linked_sites.mergedPeaks.overlap.bed', celltype))
    cmd <- paste('bedtools intersect -a', abc_cres_fp, '-b', cres_fp, '-wo >',overlap_fp, sep=' ')
    system(cmd)
}


### Function to map ABC cRE coords to overlapping peaks and output reformatted dataframe
map_to_overlap_peaks <- function(celltype, abc_fp, abc_outdir){
    # Read in the bedtools intersect output, cut out repeats, and get map
    overlap_fp <- file.path(abc_outdir, sprintf('%s_all_nonsig_linked_sites.mergedPeaks.overlap.bed', celltype))
    abc_overlap <- read.table(overlap_fp, sep='\t')
    abc_overlap$peak1 <- paste(abc_overlap$V1, abc_overlap$V2, abc_overlap$V3, sep='-')
    abc_overlap$peak2 <- paste(abc_overlap$V4, abc_overlap$V5, abc_overlap$V6, sep='-')
    
    # Decide between peaks that map to multiple peaks (based on bp overlap)
    # Sort by the overlap # and then remove duplicate peak1
    abc_overlap_sort <- abc_overlap[order(abc_overlap$V7, decreasing = TRUE),]
    abc_overlap_cut <- abc_overlap_sort[!duplicated(abc_overlap_sort$peak1),c(8,9)]
    
    # Read in ABC output and map CRE coords
    abc_df <- read.table(abc_fp, sep='\t')
    abc_df$peak1 <- paste(abc_df$V1, abc_df$V2, abc_df$V3, sep="-")
    abc_df_overlap <- merge(abc_df, abc_overlap_cut, by='peak1')
    abc_nomap <- abc_df[!abc_df$peak1 %in% abc_overlap$peak1,seq(1,8)]

    # Print out statistics
    print(celltype)
    print(paste('Number of mapped links: ', dim(abc_df_overlap)[1], '/', dim(abc_df)[1], sep=''))
    print(paste('Number of unmapped links: ', dim(abc_nomap)[1], '/', dim(abc_df)[1], sep=''))
    print('')

    # Now make final output ABC df where the CRE coords cols are altered to be the mapped ones
    # Excludes any links for which the CRE did not map (peak one is moved to leftmost col by merge)
    fin_abc_df <- cbind(str_split_fixed(abc_df_overlap$peak2, '-', 3), abc_df_overlap[,c(5,6,7,8,9)])
    mapped_abc_fp <- file.path(abc_outdir, sprintf('%s_nonsig_mapped_links.bedpe',celltype))
    write.table(fin_abc_df, mapped_abc_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
}


### Function to prepare ABC background links reference using R, not bash
### Subset for links with score < 0.02, remove NaN links, select for 1Mb distance max
### Output everything that passes this to a file in bedpe format for comparisons
prep_ABC_background <- function(celltype, abc_dir, outdir, score_threshold=0.02, dist_threshold=1000000){
    # Read in all putative links
    all_links_fp <- file.path(abc_dir, 'EnhancerPredictionsAllPutative.txt.gz')
    all_links <- read.table(all_links_fp, sep='\t', header=TRUE)
    print(paste0('Number of links in AllPutative: ', dim(all_links)[1]))

    # Remove links passing the significance threshold (0.02) and which are not NaN
    all_links_cut <- all_links[!is.na(all_links$ABC.Score) & all_links$ABC.Score <= score_threshold,]
    print(paste0('Number of links below threshold and not NaN: ', dim(all_links_cut)[1]))
    
    # Convert to bedpe format -- get TSS coords for each gene from ref_df
    genes <- all_links_cut$TargetGene    
    unique_genes <- unique(genes)
    gene_tss <- sapply(unique_genes, get_TSS, ref_df)
    tss_info <- data.frame(gene_tss, gene_tss+1)
    tss_info$TargetGene <- unique_genes

    # Merge TSS coords into all_links_cut and select cols to make a bedpe
    merged_df <- merge(all_links_cut, tss_info, by='TargetGene')
    all_links_bedpe <- merged_df[,c(2,3,4,2,25,26,1,21)] #TargetGene becomes col 1 in merge

    # Remove links more than 1Mb apart (or whatever distance threshold you choose)
    distances <- unlist(apply(all_links_bedpe, 1, calc_link_distance))
    fin_links <- all_links_bedpe[distances <= dist_threshold,]
    print(paste0('Number of links below distance threshold (', dist_threshold, 'bps): ', dim(fin_links)[1]))
    
    filt_fp <- file.path(abc_dir, 'EnhancerPredictionsAllPutative.filtered.txt')
    write.table(fin_links, filt_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
    
    # Map CRE coords to peak calls (gzip files)
    overlap_with_peaks(celltype, filt_fp, outdir)
    map_to_overlap_peaks(celltype, filt_fp, outdir)
    system(sprintf('gzip %s', filt_fp))
}

### Use functions to prepare the non-significant links files

In [52]:
abc_outdir

In [53]:
# ABC prepare files for compare connections -- previously this only took like 5-15 mins per cell type

tic()
for (celltype in h3k27ac_celltypes){
    print(paste(celltype, Sys.time()))
    abc_dir <- paste0(abc_prefix1, celltype, "/Prediction")
    outdir <- file.path(abc_outdir, 'nonsig_links')
    prep_ABC_background(celltype, abc_dir, outdir)
    print('')
}
toc()

[1] "beta 2023-03-02 13:28:10"
[1] "Number of links in AllPutative: 9334969"
[1] "Number of links below threshold and not NaN: 9168946"
[1] "Number of links below distance threshold (1e+06bps): 2085731"
[1] "beta"
[1] "Number of mapped links: 308002/2085731"
[1] "Number of unmapped links: 1777729/2085731"
[1] ""
[1] ""
[1] "alpha 2023-03-02 13:57:51"
[1] "Number of links in AllPutative: 10141622"
[1] "Number of links below threshold and not NaN: 9970018"
[1] "Number of links below distance threshold (1e+06bps): 2285905"
[1] "alpha"
[1] "Number of mapped links: 349230/2285905"
[1] "Number of unmapped links: 1936675/2285905"
[1] ""
[1] ""
[1] "delta 2023-03-02 14:31:02"
[1] "Number of links in AllPutative: 10155074"
[1] "Number of links below threshold and not NaN: 9976711"
[1] "Number of links below distance threshold (1e+06bps): 2257965"
[1] "delta"
[1] "Number of mapped links: 340528/2257965"
[1] "Number of unmapped links: 1917437/2257965"
[1] ""
[1] ""
[1] "gamma 2023-03-02 14:58:46"

In [37]:
tic()
for (celltype in no_h3k27ac_celltypes){
    print(paste(celltype, Sys.time()))
    abc_dir <- paste0(abc_prefix2, celltype, "/Prediction")
    outdir <- file.path(abc_outdir, 'nonsig_links')
    prep_ABC_background(celltype, abc_dir, outdir)
    print('')
}
toc()

[1] "stellate 2023-03-01 16:00:01"
[1] "Number of links in AllPutative: 8852236"
[1] "Number of links below threshold and not NaN: 8671165"
[1] "Number of links below distance threshold (1e+06bps): 1983537"
[1] "stellate"
[1] "Number of mapped links: 268012/1983537"
[1] "Number of unmapped links: 1715525/1983537"
[1] ""
[1] ""
[1] "endothelial 2023-03-01 16:09:04"
[1] "Number of links in AllPutative: 4502790"
[1] "Number of links below threshold and not NaN: 4329432"
[1] "Number of links below distance threshold (1e+06bps): 995203"
[1] "endothelial"
[1] "Number of mapped links: 114712/995203"
[1] "Number of unmapped links: 880491/995203"
[1] ""
[1] ""
[1] "schwann 2023-03-01 16:11:55"
[1] "Number of links in AllPutative: 1817472"
[1] "Number of links below threshold and not NaN: 1687142"
[1] "Number of links below distance threshold (1e+06bps): 368158"
[1] "schwann"
[1] "Number of mapped links: 32686/368158"
[1] "Number of unmapped links: 335472/368158"
[1] ""
[1] ""
[1] "immune 2023-0

# 3. Reformat Cicero links
Considerations:
- Multimapping: some CREs overlap multiple gene promoters, this will count as separate links to all the genes
- Need to make sure CRE coords are in the first 3 columns, may need to swap coords for some rows, if prom2 has the  gene/s
- What to list as the gene coords? The gene coords from ref df or the coords of the CRE that overlaps the genes promoter? --> when we compare links we compare CRE_gene name so should be ok to leave as is

## 3a. Reformat significant Cicero CP links

### Functions

In [18]:
### Classify links
classify_links <- function(celltype, cic_fp, cic_outdir){
    # Write file of all unique CREs in cicero links
    df <- read.table(cic_fp, sep='\t')
    sites1 <- df[,c(1,2,3)]
    sites2 <- df[,c(4,5,6)]
    colnames(sites2) <- c("V1","V2","V3")
    chr_names <- c(paste("chr",seq(1:22),sep=''),'chrX','chrY')
    all_sites <- rbind(sites1[sites1$V1 %in% chr_names,],sites2[sites2$V1 %in% chr_names,])
    all_sites_list <- paste(all_sites$V1,all_sites$V2,all_sites$V3,sep='_')
    all_sites_list <- sort(unique(all_sites_list))

    fin_df <- as.data.frame(str_split_fixed(all_sites_list,'_',n=3))
    print(paste('Number of unique peaks: ', dim(fin_df)[1]))
    out_fp <- file.path(cic_outdir,sprintf('%s_links.unique_peaks.bedpe',celltype))
    write.table(fin_df, out_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
    
    # Overlap peaks with gene promoters
    cic_peaks <- out_fp
    overlap_fp <- file.path(cic_outdir, sprintf('%s_links.unique_peaks.gencodev32_TSS500.overlap.bed',celltype))
    cmd <- paste('bedtools intersect -a', cic_peaks, '-b', TSS_ref_fp, '-wa -wb >', overlap_fp, sep=' ')
    system(cmd)
    
    # Read in gene overlaps and classify all Cicero links
    gene_overlaps <- read.table(overlap_fp, sep='\t')
    gene_overlaps$peak1 <- paste(gene_overlaps$V1, gene_overlaps$V2, gene_overlaps$V3, sep='-')
    print(paste('Number of Cicero links that overlap gene TSS regions: ',dim(gene_overlaps)[1]))
    print(paste('Number of unique Cicero peaks that overlap at least one promoter', length(unique(gene_overlaps$peak1))))
    
    # Make a reference for each peak (which genes it overlaps)
    get_genes <- function(peak, gene_overlaps){
        gene <- gene_overlaps[gene_overlaps$peak1 == peak,7]
        return(paste(gene, collapse=","))
    }
    unique_peaks <- unique(gene_overlaps$peak1)
    genes_key <- unlist(lapply(unique_peaks, get_genes, gene_overlaps))
    names(genes_key) <- unique_peaks

    # Add in promoter classifications to all the links
    get_overlap_prom <- function(peak, genes_key){
        if (peak %in% names(genes_key)){
            return(genes_key[peak])
        } else {
            return(NA)
        }
    }
    df$peak1 = paste(df$V1, df$V2, df$V3, sep='-')
    df$peak2 = paste(df$V4, df$V5, df$V6, sep='-')
    df$prom1 = unlist(lapply(df$peak1, get_overlap_prom, genes_key))
    df$prom2 = unlist(lapply(df$peak2, get_overlap_prom, genes_key))

    # Classify each link based on prom1 and prom2
    classify_link <- function(df_row){
        prom1 = df_row[['prom1']]
        prom2 = df_row[['prom2']]
        if (is.na(prom1) && is.na(prom2)){
            return('CC')
        } else if (!is.na(prom1) && is.na(prom2)){
            return('CP')
        } else if (!is.na(prom1) && !is.na(prom2)){
            return('PP')
        } else if (is.na(prom1) && !is.na(prom2)){
            return('CP')
        }
    }
    df$class = apply(df,1, classify_link)
    print(table(df$class))
    out_fp = file.path(cic_outdir,sprintf('%s_links.wClass.bedpe',celltype))
    write.table(df[,-8],out_fp, sep='\t', col.names=FALSE, row.names=FALSE, quote=FALSE)
    print('')
}


### Cut down links to CP (CRE-gene promoter links)
cut_links_to_CP <- function(celltype, cic_outdir){
    # Read in classified cicero outputs
    cic_fp <- file.path(cic_outdir,sprintf('%s_links.wClass.bedpe',celltype))
    print(cic_fp)
    cic_df <- read.table(cic_fp, sep='\t')
    colnames(cic_df) <- c('chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2', 'Cicero_score', 
                          'peak1', 'peak2', 'prom1', 'prom2', 'class')

    # Cut down to just CP links
    cic_cp <- cic_df[cic_df$class == 'CP',]
    print(paste('CP links: ', dim(cic_cp)[1], '/', dim(cic_df)[1], sep=''))
    return(cic_cp)
}


### Reformat links to be: CRE coords, gene coords, gene name, score
reformat_CP_links <- function(cic_cp, celltype, cic_outdir){
    # Extract gene name from either prom1 or prom2 column
    get_gene <- function(row){
        gene1 <- row[["prom1"]]
        gene2 <- row[["prom2"]]
        if (is.na(gene1)){
            return(gene2)
        } else {
            return(gene1)
        }
    }
    cic_cp$gene <- apply(cic_cp, 1, get_gene)

    # Separate out multigene rows into new df for further work
    cic_cp1 <- cic_cp[!grepl(',', cic_cp$gene),]
    cic_cp2 <- cic_cp[grepl(',', cic_cp$gene),]
    multigenes <- cic_cp2$gene
    cic_cp2_split <- data.frame()

    # Go through every row from multigene rows and split up each gene to a new row
    for(i in seq(1,dim(cic_cp2)[1])){
        genes <- unlist(strsplit(multigenes[[i]], ','))
        df_row <- cic_cp2[i,seq(1,12)]
        new_rows <- rbind(df_row, rep(df_row[rep(1,length(genes)-1),]))
        new_rows$gene <- genes
        cic_cp2_split <- rbind(cic_cp2_split, new_rows)
    }

    # Combine split multigenes with the one gene rows in cic_cp1
    cic_cp_combo <- rbind(cic_cp1, cic_cp2_split)

    # Swap coords for links where prom2 has the gene
    cic_left <- cic_cp_combo[!is.na(cic_cp_combo$prom1),]
    cic_right <- cic_cp_combo[!is.na(cic_cp_combo$prom2),]
    cic_left <- cic_left[,c(4,5,6,1,2,3,seq(7,13))]
    colnames(cic_left) <- c('chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2', 'Cicero_score', 
                          'peak1', 'peak2', 'prom1', 'prom2', 'class', 'gene')
    cic_fin <- rbind(cic_right, cic_left)

    # Make final df for output (sort by CRE positions, cut out unnecessary cols)
    cic_fin <- cic_fin[order(cic_fin$chrom1,cic_fin$start1),]
    cic_fin2 <- cic_fin[,c(seq(1,6),13,7)] #order: coords, gene, score
    print(paste('Final number of distinct CRE-gene links: ',dim(cic_fin2)[1], sep=''))

    # Output to a file!
    out_fp <- file.path(cic_outdir, sprintf('%s_links.CP_reformat.bedpe',celltype))
    write.table(cic_fin2, out_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
}

### Run functions on all celltypes to reformat links

In [60]:
# Run function to create classified output file (input is already FDR thresholded)
tic()
for (celltype in celltypes){
    cic_fp <- paste0(cic_prefix, celltype, cic_suffix)
    print(cic_fp)
    classify_links(celltype, cic_fp, cic_outdir)
}
toc()

[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/Cicero_links.beta.above0.05.dedup.bedpe"
[1] "Number of unique peaks:  33921"
[1] "Number of Cicero links that overlap gene TSS regions:  10962"
[1] "Number of unique Cicero peaks that overlap at least one promoter 8543"

   CC    CP    PP 
37359 27356 24685 
[1] ""
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/Cicero_links.alpha.above0.05.dedup.bedpe"
[1] "Number of unique peaks:  34242"
[1] "Number of Cicero links that overlap gene TSS regions:  12002"
[1] "Number of unique Cicero peaks that overlap at least one promoter 9390"

   CC    CP    PP 
31057 34764 29132 
[1] ""
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/Cicero_links.delta.above0.05.dedup.bedpe"
[1] "Number of unique peaks:  40781"
[1] "Number of Cicero links that overlap gene TSS regions:  14127"
[1] "Number of unique Cicero peaks

In [88]:
# Cut down Cicero outputs to CP links and reformat
tic()
for (celltype in celltypes){
    print(celltype)
    cic_cp <- cut_links_to_CP(celltype, cic_outdir)
    reformat_CP_links(cic_cp, celltype, cic_outdir)
    print('')
}
toc()

[1] "beta"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_outputs/beta_links.wClass.bedpe"
[1] "CP links: 27356/89400"
[1] "Final number of distinct CRE-gene links: 34789"
[1] ""
[1] "alpha"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_outputs/alpha_links.wClass.bedpe"
[1] "CP links: 34764/94953"
[1] "Final number of distinct CRE-gene links: 44644"
[1] ""
[1] "delta"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_outputs/delta_links.wClass.bedpe"
[1] "CP links: 26834/62263"
[1] "Final number of distinct CRE-gene links: 34556"
[1] ""
[1] "gamma"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_outputs/gamma_links.wClass.bedpe"
[1] "CP links: 39390/91362"
[1] "Final number of distinct CRE-gene links: 51066"
[1] ""
[1] "ductal"
[1] "/nfs/lab/projects/mu

## 3b. Process non-significant links to CP bedpe files with gene names
Steps:
1. Deduplicate links (remove repeated links, generally where CREs are swapped)
2. Overlap all CREs with a gene TSS file and then classify links as CC, CP, PP 
3. Reformat CP links to have gene names listed and separate lines, also with gene coords (just using TSS500 coords here)

### Functions

In [9]:
cic_suffix2 <- '.all.bedpe'

In [12]:
### Function to remove duplicate connections, from all links file
dedup_all_links <- function(all_links_fp, outdir, celltype){
    # Read in all_links file
    out_df_cut <- read.table(all_links_fp, sep='\t')
    
    # Remove duplicated links (same peaks and score, diff order)
    get_ordered_peaks <- function(row){
        if (row[2] < row[5]){
            peak1 = paste(row[[1]], as.character(row[[2]]), as.character(row[[3]]), sep='-')
            peak2 = paste(row[[4]], as.character(row[[5]]), as.character(row[[6]]), sep='-')
        } else {
            peak1 = paste(row[[4]], as.character(row[[5]]), as.character(row[[6]]), sep='-')
            peak2 = paste(row[[1]], as.character(row[[2]]), as.character(row[[3]]), sep='-')
        }
        return(paste(peak1,peak2,sep='_'))
    }

    out_df_cut$ordered_peaks = apply(out_df_cut,1,get_ordered_peaks)
    out_df_fin = out_df_cut[!duplicated(out_df_cut$ordered_peaks),]

    # Output the thresholded and dedup df
    out_fp2 = file.path(outdir,sprintf('Cicero_links.%s.all.dedup.bedpe',celltype))
    write.table(out_df_fin, out_fp2, sep='\t', col.names=FALSE, row.names=FALSE, quote=FALSE)
}




### Use functions to deduplicate all links and select for CP links, add gene name

In [14]:
cic_outdir2 <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links'

In [16]:
# Deduplicate all links -- took about 20 mins
for (celltype in c('beta')){
    print(paste(celltype, Sys.time()))
    cic_fp <- paste0(cic_prefix, celltype, cic_suffix2)
    dedup_all_links(cic_fp, cic_outdir2, celltype)
}

print(Sys.time())

[1] "beta 2023-02-24 12:11:48"
[1] "2023-02-24 12:31:04 PST"


In [19]:
# Run function to create classified output file (same functions, diff outdir) -- took 2.5 hours to run
tic()
for (celltype in c('beta')){
    cic_fp <- file.path(cic_outdir2,sprintf('Cicero_links.%s.all.dedup.bedpe',celltype))
    classify_links(celltype, cic_fp, cic_outdir2)
}
toc()

[1] "Number of unique peaks:  109918"
[1] "Number of Cicero links that overlap gene TSS regions:  18632"
[1] "Number of unique Cicero peaks that overlap at least one promoter 15024"

     CC      CP      PP 
3287657 1027391  114743 
[1] ""
8965.603 sec elapsed


In [20]:
# Cut down Cicero outputs to CP links and reformat (same functions, diff outdir -- took 32 hours to run
tic()
for (celltype in c('beta')){
    print(celltype)
    cic_cp <- cut_links_to_CP(celltype, cic_outdir2)
    reformat_CP_links(cic_cp, celltype, cic_outdir2)
    print('')
}
toc()

[1] "beta"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links/beta_links.wClass.bedpe"
[1] "CP links: 1027391/4429791"
[1] "Final number of distinct CRE-gene links: 1270478"
[1] ""
115584.908 sec elapsed


### Running on remaining large cell types for now... do this in parallel with mclapply!

In [22]:
### Function to remove duplicate connections, from all links file -- modified for parallel use and logging
run_dedup_all_links <- function(celltype, outdir, log_fp){
    # Read in all_links file
    write(paste(celltype, 'start', Sys.time()), file=log_fp, append=TRUE)
    all_links_fp <- paste0(cic_prefix, celltype, cic_suffix2)
    out_df_cut <- read.table(all_links_fp, sep='\t')
    
    # Remove duplicated links (same peaks and score, diff order)
    get_ordered_peaks <- function(row){
        if (row[2] < row[5]){
            peak1 = paste(row[[1]], as.character(row[[2]]), as.character(row[[3]]), sep='-')
            peak2 = paste(row[[4]], as.character(row[[5]]), as.character(row[[6]]), sep='-')
        } else {
            peak1 = paste(row[[4]], as.character(row[[5]]), as.character(row[[6]]), sep='-')
            peak2 = paste(row[[1]], as.character(row[[2]]), as.character(row[[3]]), sep='-')
        }
        return(paste(peak1,peak2,sep='_'))
    }

    out_df_cut$ordered_peaks = apply(out_df_cut,1,get_ordered_peaks)
    out_df_fin = out_df_cut[!duplicated(out_df_cut$ordered_peaks),]

    # Output the thresholded and dedup df
    out_fp2 = file.path(outdir,sprintf('Cicero_links.%s.all.dedup.bedpe',celltype))
    write.table(out_df_fin, out_fp2, sep='\t', col.names=FALSE, row.names=FALSE, quote=FALSE)
    write(paste(celltype, 'done', Sys.time()), file=log_fp, append=TRUE)
}


### Classify links -- modified for parallel use and logging
run_classify_links <- function(celltype, cic_outdir, log_fp){
    write(paste(celltype, 'start', Sys.time()), file=log_fp, append=TRUE)
    cic_fp <- file.path(cic_outdir,sprintf('Cicero_links.%s.all.dedup.bedpe',celltype))
    
    # Write file of all unique CREs in cicero links
    df <- read.table(cic_fp, sep='\t')
    sites1 <- df[,c(1,2,3)]
    sites2 <- df[,c(4,5,6)]
    colnames(sites2) <- c("V1","V2","V3")
    chr_names <- c(paste("chr",seq(1:22),sep=''),'chrX','chrY')
    all_sites <- rbind(sites1[sites1$V1 %in% chr_names,],sites2[sites2$V1 %in% chr_names,])
    all_sites_list <- paste(all_sites$V1,all_sites$V2,all_sites$V3,sep='_')
    all_sites_list <- sort(unique(all_sites_list))

    fin_df <- as.data.frame(str_split_fixed(all_sites_list,'_',n=3))
    print(paste('Number of unique peaks: ', dim(fin_df)[1]))
    out_fp <- file.path(cic_outdir,sprintf('%s_links.unique_peaks.bedpe',celltype))
    write.table(fin_df, out_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
    
    # Overlap peaks with gene promoters
    cic_peaks <- out_fp
    overlap_fp <- file.path(cic_outdir, sprintf('%s_links.unique_peaks.gencodev32_TSS500.overlap.bed',celltype))
    cmd <- paste('bedtools intersect -a', cic_peaks, '-b', TSS_ref_fp, '-wa -wb >', overlap_fp, sep=' ')
    system(cmd)
    
    # Read in gene overlaps and classify all Cicero links
    gene_overlaps <- read.table(overlap_fp, sep='\t')
    gene_overlaps$peak1 <- paste(gene_overlaps$V1, gene_overlaps$V2, gene_overlaps$V3, sep='-')
    print(paste('Number of Cicero links that overlap gene TSS regions: ',dim(gene_overlaps)[1]))
    print(paste('Number of unique Cicero peaks that overlap at least one promoter', length(unique(gene_overlaps$peak1))))
    
    
    # Make a reference for each peak (which genes it overlaps)
    get_genes <- function(peak, gene_overlaps){
        gene <- gene_overlaps[gene_overlaps$peak1 == peak,7]
        return(paste(gene, collapse=","))
    }
    unique_peaks <- unique(gene_overlaps$peak1)
    genes_key <- unlist(lapply(unique_peaks, get_genes, gene_overlaps))
    names(genes_key) <- unique_peaks

    # Add in promoter classifications to all the links
    get_overlap_prom <- function(peak, genes_key){
        if (peak %in% names(genes_key)){
            return(genes_key[peak])
        } else {
            return(NA)
        }
    }
    df$peak1 = paste(df$V1, df$V2, df$V3, sep='-')
    df$peak2 = paste(df$V4, df$V5, df$V6, sep='-')
    df$prom1 = unlist(lapply(df$peak1, get_overlap_prom, genes_key))
    df$prom2 = unlist(lapply(df$peak2, get_overlap_prom, genes_key))

    # Classify each link based on prom1 and prom2
    classify_link <- function(df_row){
        prom1 = df_row[['prom1']]
        prom2 = df_row[['prom2']]
        if (is.na(prom1) && is.na(prom2)){
            return('CC')
        } else if (!is.na(prom1) && is.na(prom2)){
            return('CP')
        } else if (!is.na(prom1) && !is.na(prom2)){
            return('PP')
        } else if (is.na(prom1) && !is.na(prom2)){
            return('CP')
        }
    }
    df$class = apply(df,1, classify_link)
    out_fp = file.path(cic_outdir,sprintf('%s_links.wClass.bedpe',celltype))
    write.table(df[,-8],out_fp, sep='\t', col.names=FALSE, row.names=FALSE, quote=FALSE)
    write(paste(celltype, 'done', Sys.time()), file=log_fp, append=TRUE)
}


### Function to run the cut links to CP and reformat functions -- for parallel use
run_reformat_CP <- function(celltype, outdir, log_fp){
    write(paste(celltype, 'start', Sys.time()), file=log_fp, append=TRUE)
    cic_cp <- cut_links_to_CP(celltype, outdir)
    reformat_CP_links(cic_cp, celltype, outdir)
    write(paste(celltype, 'done', Sys.time()), file=log_fp, append=TRUE)
}

In [25]:
log_fp <- '/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links/log.txt'
celltypes_cut <- c('alpha','acinar','ductal','delta','gamma')
cic_outdir2

# Deduplicate all links -- took about 20 min for beta cells
write(paste('Deduplicating links:', Sys.time()), file=log_fp, append=TRUE)
mclapply(celltypes_cut, run_dedup_all_links, cic_outdir2, log_fp, mc.cores=5)
write("\n", file=log_fp, append=TRUE)

# Create classified output file (same functions, diff outdir) -- took 2.5 hours to run on beta cells
write(paste('Classifying links:',Sys.time()), file=log_fp, append=TRUE)
mclapply(celltypes_cut, run_classify_links, cic_outdir2, log_fp, mc.cores=5)
write("\n", file=log_fp, append=TRUE)

ERROR: Error in paste("Reformatting CP links:", Sys.time): cannot coerce type 'closure' to vector of type 'character'


In [None]:
# Cut down Cicero outputs to CP links and reformat (same functions, diff outdir -- took 32 hours to run on beta cells
write(paste('Reformatting CP links:',Sys.time()), file=log_fp, append=TRUE)
mclapply(celltypes_cut, run_reformat_CP, cic_outdir2, log_fp, mc.cores=5)

write(paste('Done',Sys.time()), file=log_fp, append=TRUE)

### Also filter all tested Cicero links for score <=0.02 and distance <=1Mb

In [49]:
### Function to calculate link distances from a bedpe style dataframe row
calc_link_distance <- function(link_df_row){
    CRE_start <- as.integer(link_df_row[2])
    CRE_end <- as.integer(link_df_row[3])
    gene_start <- as.integer(link_df_row[5])
    CRE_center <- CRE_start + (CRE_end - CRE_start)/2
    distance <- abs(CRE_center - gene_start)
    return(distance)
}


### Function to filter Cicero background links reference using R, not bash
### Subset for links with score <= 0.02, select for 1Mb distance max
### These files should already be in a bedpe format!
filter_cicero_background <- function(celltype, cic_fp, outdir, score_threshold=0.02, dist_threshold=1000000){
    # Read in the processed background bedpe file
    all_links <- read.table(cic_fp, sep='\t', header=TRUE)
    colnames(all_links) <- c('CRE_chr','CRE_start','CRE_end','gene_chr','gene_start','gene_end','gene','score')

    # Remove links passing the significance threshold (0.02) and which are not NaN
    all_links_cut <- all_links[all_links$score <= score_threshold,]
    
    # Remove links more than 1Mb apart (or whatever distance threshold you choose)
    distances <- unlist(apply(all_links_cut, 1, calc_link_distance))
    fin_links <- all_links_cut[distances <= dist_threshold,]
    
    filt_fp <- str_replace(cic_fp, "CP_reformat", "final_filt")
    write.table(fin_links, filt_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
}

In [51]:
cic_outdir2

In [50]:
tic()
for (celltype in celltypes[c(1,2,3,4,5,6)]){
    print(paste(celltype, Sys.time()))
    cic_fp <- file.path(cic_outdir2,sprintf('%s_links.CP_reformat.bedpe',celltype))
    print(cic_fp)
    filter_cicero_background(celltype, cic_fp, outdir)
    print('')
}
toc()

[1] "beta 2023-03-02 11:05:40"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links/beta_links.CP_reformat.bedpe"
[1] ""
[1] "alpha 2023-03-02 11:06:07"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links/alpha_links.CP_reformat.bedpe"
[1] ""
[1] "delta 2023-03-02 11:06:45"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links/delta_links.CP_reformat.bedpe"
[1] ""
[1] "gamma 2023-03-02 11:07:19"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links/gamma_links.CP_reformat.bedpe"
[1] ""
[1] "ductal 2023-03-02 11:07:54"
[1] "/nfs/lab/projects/multiomic_islet/outputs/multiome/cRE-gene_links/cicero/230111_final_map/reprocessed_all_links/ductal_links.CP_reformat.bedpe"
[1] ""
[1] "acinar 2023-03-02 11:08:32"
[1] "/nfs/lab/projects/multiomic

# 4. Subset SMORES links to CP links

## 4a. Classify links as CP or PP

### Functions

In [None]:
### Classify links -- based on function used to classify Cicero links (very similar)
### celltype_prefix contains cell type name AND sig/all (for link set)
classify_links <- function(celltype_prefix, fp, outdir){
    # Write file of all unique CREs in links (only left set of coords here!)
    df <- read.table(fp, sep='\t')
    sites <- df[,c(1,2,3)]
    chr_names <- c(paste("chr",seq(1:22),sep=''),'chrX','chrY')
    all_sites <- sites[sites$V1 %in% chr_names,]
    all_sites_list <- paste(all_sites$V1,all_sites$V2,all_sites$V3,sep='_')
    all_sites_list <- sort(unique(all_sites_list))

    fin_df <- as.data.frame(str_split_fixed(all_sites_list,'_',n=3))
    print(paste('Number of unique peaks: ', dim(fin_df)[1]))
    peaks_fp <- file.path(outdir,sprintf('%s_links.unique_peaks.bedpe',celltype_prefix))
    write.table(fin_df, peaks_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
    
    # Overlap peaks with gene promoters
    overlap_fp <- file.path(outdir, sprintf('%s_links.unique_peaks.gencodev32_TSS500.overlap.bed',celltype_prefix))
    cmd <- paste('bedtools intersect -a', peaks_fp, '-b', TSS_ref_fp, '-wa -wb >', overlap_fp, sep=' ')
    system(cmd)
    
    # Read in gene overlaps and classify all links
    gene_overlaps <- read.table(overlap_fp, sep='\t')
    gene_overlaps$peak1 <- paste(gene_overlaps$V1, gene_overlaps$V2, gene_overlaps$V3, sep='-')
    print(paste('Number of TSS regions overlapping a linked CRE: ',dim(gene_overlaps)[1]))
    print(paste('Number of linked CREs that overlap at least one promoter', length(unique(gene_overlaps$peak1))))
    
    # Make a reference for each peak (which genes it overlaps)
    get_genes <- function(peak, gene_overlaps){
        gene <- gene_overlaps[gene_overlaps$peak1 == peak,7] 
        return(paste(gene, collapse=","))
    }
    unique_peaks <- unique(gene_overlaps$peak1)
    genes_key <- unlist(lapply(unique_peaks, get_genes, gene_overlaps))
    names(genes_key) <- unique_peaks

    # Add in promoter classifications to all the links
    get_overlap_prom <- function(peak, genes_key){
        if (peak %in% names(genes_key)){
            return(genes_key[peak])
        } else {
            return(NA)
        }
    }
    df$peak1 = paste(df$V1, df$V2, df$V3, sep='-')
    df$prom1 = unlist(lapply(df$peak1, get_overlap_prom, genes_key))

    # Classify each link based on prom1 and prom2
    classify_link <- function(df_row){
        prom1 = df_row[['prom1']]
        if(is.na(prom1)){
            return('CP')
        } else if(!is.na(prom1)){
            return('PP')
        }
    }
    df$class = apply(df,1, classify_link)
    fin_df <- df[order(df$V1,df$V2),]
    out_fp = file.path(outdir,sprintf('%s_links.wClass.bedpe',celltype_prefix))
    write.table(fin_df,out_fp, sep='\t', col.names=FALSE, row.names=FALSE, quote=FALSE)
}

### Run functions to classify links

In [None]:
# First classify significant links
for (celltype in celltypes){
    print(paste(celltype, Sys.time()))
    link_fp <- paste0(hm_prefix,celltype,'/',celltype,hm_suffix1)
    prefix <- paste0(celltype,'_sig')
    classify_links(prefix, link_fp, file.path(hm_outdir,'intermediates'))
    print("")
}

In [None]:
# Then classify all links tested
for (celltype in celltypes){
    print(paste(celltype, Sys.time()))
    link_fp2 <- paste0(hm_prefix,celltype,'/',celltype,hm_suffix2)
    prefix2 <- paste0(celltype,'_all')
    classify_links(prefix2, link_fp2, file.path(hm_outdir,'intermediates'))
    print("")
}

## 4b. Summarize proportion of links which are PP and output CP links-only file

In [None]:
# Significant links
for (celltype in celltypes){
    # Read in classified links file
    celltype_prefix <- paste0(celltype, '_sig')
    classified_fp <- file.path(hm_outdir,'intermediates',sprintf('%s_links.wClass.bedpe',celltype_prefix))
    df <- read.table(classified_fp, sep='\t')
    colnames(df) <- c('CRE_chr','CRE_start','CRE_end','gene_chr','gene_start','gene_end','gene',
                      'corr','pvalue','qvalue','CRE_peak','overlap_gene','class')
    
    # Write out stats
    class_prop <- table(df$class)
    print(class_prop)
    pp_links <- class_prop[['PP']]
    all_links <- dim(df)[1]
    print(paste0('Percentage of ', celltype_prefix, ' links which are PP: ', pp_links, '/', all_links, ' = ', pp_links/all_links))
    
    # Filter links file to CP links and output
    cp_links <- df[df$class =='CP',seq(1,10)]
    out_fp <- file.path(hm_outdir,sprintf('%s_CP_links.bedpe',celltype_prefix))
    write.table(cp_links, out_fp, sep='\t', col.names=FALSE, row.names=FALSE, quote=FALSE)
}

In [None]:
# All links
for (celltype in celltypes){
    # Read in classified links file
    celltype_prefix <- paste0(celltype, '_all')
    classified_fp <- file.path(hm_outdir,'intermediates',sprintf('%s_links.wClass.bedpe',celltype_prefix))
    df <- read.table(classified_fp, sep='\t')
    colnames(df) <- c('CRE_chr','CRE_start','CRE_end','gene_chr','gene_start','gene_end','gene',
                      'corr','pvalue','CRE_peak','overlap_gene','class')
    
    # Write out stats
    class_prop <- table(df$class)
    print(class_prop)
    pp_links <- class_prop[['PP']]
    all_links <- dim(df)[1]
    print(paste0('Percentage of ', celltype_prefix, ' links which are PP: ', pp_links, '/', all_links, ' = ', pp_links/all_links))
    
    # Filter links file to CP links and output
    cp_links <- df[df$class =='CP',seq(1,9)]
    out_fp <- file.path(hm_outdir,sprintf('%s_CP_links.bedpe',celltype_prefix))
    write.table(cp_links, out_fp, sep='\t', col.names=FALSE, row.names=FALSE, quote=FALSE)
}

# 5. Merge the background link sets for all 3 methods

### Functions

In [None]:
### Function to extract a gene's TSS from the reference file
get_TSS <- function(gene, ref_df){
    if (gene %in% ref_df$V4 == TRUE){
        ref_df_cut = ref_df[ref_df$V4 ==gene,]
        if (ref_df_cut$V6 == '-'){
            tss = max(c(ref_df_cut$V2,ref_df_cut$V3))
        } else {
            tss = min((c(ref_df_cut$V2,ref_df_cut$V3)))
        }
        return(tss)
    } else {
        return(NA)
    }
}


### Function to calculate link distances from a bedpe style dataframe row
calc_link_distance <- function(link_df_row){
    CRE_start <- as.integer(link_df_row[2])
    CRE_end <- as.integer(link_df_row[3])
    gene_start <- as.integer(link_df_row[5])
    CRE_center <- CRE_start + (CRE_end - CRE_start)/2
    distance <- abs(CRE_center - gene_start)
    return(distance)
}


### Function to read in the 3 method background files (created in notebooks 1 and 2),
### merge them and then remove duplicates
merge_backgrounds <- function(celltype, smores_fp, abc_fp, cic_fp, outdir, dist_threshold=1000000){
    # Read in files and concatenate them
    smores_df <- read.table(smores_fp, sep='\t')[,-c(9)] ###removing p-value col so matches other dfs
    abc_df <- read.table(abc_fp, sep='\t')
    cic_df <- read.table(cic_fp, sep='\t')
    all_df <- rbind(smores_df, abc_df, cic_df)
    colnames(all_df) <- c('CRE_chr','CRE_start','CRE_end','gene_chr','gene_start','gene_end','gene','score')
    
    # Make link ID column and use this to remove duplicates
    all_df$link <- paste(paste(all_df$CRE_chr, all_df$CRE_start, all_df$CRE_end, sep='-'), all_df$gene, sep='_')
    all_df_cut <- all_df[!duplicated(all_df$link),]
    print(paste('Number of unique links in combined background set:', dim(all_df_cut)[1]))
    
    # Rewrite the gene TSS coord info (since cicero file doesn't have correct values)
    unique_genes <- unique(all_df_cut$gene)
    gene_tss <- sapply(unique_genes, get_TSS, ref_df)
    tss_info <- data.frame(gene_tss, gene_tss+1)
    tss_info$gene <- unique_genes    
    merged_df <- merge(all_df_cut, tss_info, by='gene')
    all_df2 <- merged_df[,c(2,3,4,2,10,11,1)] #TargetGene becomes col 1 in merge, not saving score bc not consistent across methods
    
    # Remove links more than 1Mb apart (or whatever distance threshold you choose)
    distances <- unlist(apply(all_df2, 1, calc_link_distance))
    all_df2_cut <- all_df2[distances <= dist_threshold,]
    fin_df <- all_df2_cut[order(all_df2_cut$CRE_chr, all_df2_cut$CRE_start),]

    # Save to bedpe file
    out_fp <- file.path(outdir, sprintf('%s_3method_merged_all_links.bedpe',celltype))
    write.table(fin_df, out_fp, sep='\t', row.names=FALSE, col.names=FALSE, quote=FALSE)
}

### Run functions to merge background files for celltype

In [None]:
smores_fp <- paste0(hm_prefix,celltype,'/',celltype,hm_suffix2)
abc_fp <- file.path(abc_dir, sprintf('%s_nonsig_mapped_links.bedpe',celltype))
cic_fp <- file.path(cic_dir, sprintf('%s_links.CP_reformat.bedpe',celltype))

tic()
merge_backgrounds(celltype, smores_fp, abc_fp, cic_fp, overlap_outdir)
toc()

### previous run took 130 seconds