# Ensembl Regulatory Build Overlap
This notebook explores the overlap of regulatory elements with balanced and unbalanced interactions.
The ensembl file can be downloaded here http://ftp.ensembl.org/pub/current_regulation/homo_sapiens/
(homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20210107.gff.gz). Explanations are available here 
http://useast.ensembl.org/info/genome/funcgen/regulatory_build.html.

In [20]:
import gzip
import pandas as pd
from collections import defaultdict

Enter the path to the regulaotry build file here

In [3]:
regbuild_path = input()

/home/peter/data/diachromatic/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20210107.gff.gz


## Format 
Specify the data types and column names. The start and the end are integers. The score column has "." in this file so we specify the datatype of string.

In [10]:
column_names = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']
column_datatypes = {'seqname':str, 'source':str, 'feature':str, 'start':int, 'end':int, 'score':str, 'strand':str, 'frame':str, 'attribute':str}
df = pd.read_csv(regbuild_path, delimiter='\t', header=None, names=column_names, dtype=column_datatypes)

In [11]:
df.head()

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,attribute
0,GL000008.2,Regulatory_Build,open_chromatin_region,103733,104006,.,.,.,ID=open_chromatin_region:ENSR00000898744;bound...
1,GL000008.2,Regulatory_Build,open_chromatin_region,106249,106503,.,.,.,ID=open_chromatin_region:ENSR00000898745;bound...
2,GL000008.2,Regulatory_Build,open_chromatin_region,112878,113287,.,.,.,ID=open_chromatin_region:ENSR00000898746;bound...
3,GL000008.2,Regulatory_Build,open_chromatin_region,134012,134308,.,.,.,ID=open_chromatin_region:ENSR00000898747;bound...
4,GL000008.2,Regulatory_Build,open_chromatin_region,138112,138274,.,.,.,ID=open_chromatin_region:ENSR00001290160;bound...


In [14]:
feature_categories = df['feature'].unique()
print("The genomic elements listed in this file are: %s" % "; ".join(feature_categories))

The genomic elements listed in this file are open_chromatin_region; TF_binding_site; CTCF_binding_site; enhancer; promoter; promoter_flanking_region


# Hypothesis
We want to test whether the non-enriched digest of unbalanced interactions are more likely to be located in enhancers than the balanced interactions.
To test this, we define a simple class for genomic elements. First let us view the chromosomes and other contigs.

In [15]:
contig_labels = df['seqname'].unique()
contig_labels

array(['GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1',
       'GL000205.2', 'GL000208.1', 'GL000213.1', 'GL000214.1',
       'GL000216.2', 'GL000218.1', 'GL000219.1', 'GL000220.1',
       'GL000221.1', 'GL000224.1', 'GL000225.1', 'KI270330.1',
       'KI270333.1', 'KI270336.1', 'KI270337.1', 'KI270438.1',
       'KI270442.1', 'KI270466.1', 'KI270519.1', 'KI270521.1',
       'KI270528.1', 'KI270538.1', 'KI270581.1', 'KI270706.1',
       'KI270707.1', 'KI270708.1', 'KI270709.1', 'KI270710.1',
       'KI270711.1', 'KI270712.1', 'KI270713.1', 'KI270714.1',
       'KI270716.1', 'KI270717.1', 'KI270718.1', 'KI270719.1',
       'KI270720.1', 'KI270721.1', 'KI270722.1', 'KI270723.1',
       'KI270724.1', 'KI270725.1', 'KI270726.1', 'KI270727.1',
       'KI270728.1', 'KI270729.1', 'KI270730.1', 'KI270731.1',
       'KI270732.1', 'KI270733.1', 'KI270734.1', 'KI270735.1',
       'KI270736.1', 'KI270737.1', 'KI270738.1', 'KI270739.1',
       'KI270741.1', 'KI270742.1', 'KI270743.1', 'KI270

We will omit elements that are not located on the canonical chromosomes

In [16]:
class GenomicElement:
    def __init__(self, contig, start, end):
        self._chrom = contig
        if not isinstance(start, int) or not isinstance(end, int):
            raise ValueError("Start and End need to be integers")
        self._start = start
        self._end = end
    
    @property
    def chrom(self):
        return self._chrom
    
    @property
    def start(self):
        return self._start
    
    @property
    def end(self):
        return self._end

In [17]:
valid_contigs = {'X', 'Y',
       '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'}

In [21]:
elements = defaultdict(list)
df = df.reset_index()  # make sure indexes pair with number of rows
for index, row in df.iterrows():
    if row['seqname'] in valid_contigs:
        elem = GenomicElement(contig=row['seqname'], start=row['start'], end=row['end'])
        elements[row['feature']].append(elem)

In [22]:
for k, v in elements.items():
    print(f"{k} n={len(v)}")

CTCF_binding_site n=175885
enhancer n=127935
open_chromatin_region n=110101
promoter n=36597
promoter_flanking_region n=140548
TF_binding_site n=30303


# Analyze Interactions at baits

In [23]:
import sys
import os
import pandas
import copy
sys.path.append("..")
from diachr import DiachromaticInteractionSet
from diachr import BaitedDigest
from diachr import BaitedDigestSet

# Import diachromatic file
Import a file such as /JAV_MK_RALT_20000_st_fdr0.05_evaluated_and_categorized_interactions.tsv.gz

In [24]:
diachromatic_file = input()

/home/peter/data/diachromatic/JAV_MK_RALT_20000_st_fdr0.05_evaluated_and_categorized_interactions.tsv.gz


In [25]:
# Create DiachromaticInteractionSet
d11_interaction_set = DiachromaticInteractionSet(rpc_rule = 'ht')
d11_interaction_set.parse_file(
    i_file = diachromatic_file,
    verbose = True)

[INFO] Parsing Diachromatic interaction file ...
	[INFO] /home/peter/data/diachromatic/JAV_MK_RALT_20000_st_fdr0.05_evaluated_and_categorized_interactions.tsv.gz
	[INFO] Parsed 1,000,000 interaction lines ...
	[INFO] Parsed 2,000,000 interaction lines ...
	[INFO] Parsed 3,000,000 interaction lines ...
	[INFO] Parsed 4,000,000 interaction lines ...
	[INFO] Parsed 5,000,000 interaction lines ...
	[INFO] Set size: 5,249,507
[INFO] ... done.


Next, we create a BaitedDigestSet and pass the DiachromaticInteractionSet.

In [26]:
baited_digest_set = BaitedDigestSet()
read_interactions_info_dict = baited_digest_set.ingest_interaction_set(d11_interaction_set, verbose=True)

[INFO] Reading interactions and group them according to chromosomes and baited digests ...
	[INFO] Read 1,000,000 interactions ...
	[INFO] Read 2,000,000 interactions ...
	[INFO] Read 3,000,000 interactions ...
	[INFO] Read 4,000,000 interactions ...
	[INFO] Read 5,000,000 interactions ...
	[INFO] Total number of interactions read: 5,249,507
	[INFO] Total number of baited digests: 21,700
[INFO] ... done.


In [27]:
print(baited_digest_set.get_ingest_interaction_set_info_report())

[INFO] Report on ingestion of interactions:
	[INFO] Total number of interactions read: 5,249,507
	[INFO] Discarded NN and EE interactions: 330,224
	[INFO] Total number of ingested NE and EN interactions: 4,919,283
	[INFO] Broken down by interaction category and enrichment status: 
		[INFO] DIX: 
			[INFO] NE: 65
			[INFO] EN: 80
		[INFO] DI: 
			[INFO] NE: 97,547
			[INFO] EN: 100,355
		[INFO] UIR: 
			[INFO] NE: 99,118
			[INFO] EN: 98,784
		[INFO] UI: 
			[INFO] NE: 2,262,625
			[INFO] EN: 2,260,709
		[INFO] ALL: 
			[INFO] NE: 2,459,355
			[INFO] EN: 2,459,928
	[INFO] Total number of baited digests: 21,700
[INFO] End of report.



In [29]:
di_baited_digests = 0
di_num_total = 0
count = 0
for chrom in baited_digest_set._baited_digest_dict.keys():
    for baited_digest_key, baited_digest in baited_digest_set._baited_digest_dict[chrom].items():
        print(baited_digest)
        count += 1
        if count > 20:
            break
    break

<diachr.baited_digest.BaitedDigest object at 0x7f0b0be29250>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0be298b0>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0be29c40>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdcd5b0>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdf2550>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdf2b50>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdf2eb0>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bd982b0>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bd98490>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bd98910>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdbc310>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdbc730>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdbc970>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bdbce50>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bd60130>
<diachr.baited_digest.BaitedDigest object at 0x7f0b0bd60430>
<diachr.baited_digest.Ba