# COMP 561 Final Project

## Objectives
1. Identifying a set of bound and non‐bound DNA sequences for a given TF based on existing experimental data.
2. Calculating the DNA physical properties of each sequence.
3. Training a machine learning classifier to distinguish between bound and unbound sites.

## My Interpretation
1. Find bound and non-bound regions using the publicly available data
2. Use GBshape to identify the physical properties of each bound and non-bound region
3. Train a machine learning classifier using physical properties as features and bound vs non-bound as labels

## We have:
- human genome assembly hg19
- active regulatory regions of GM12878
- TF binding sites
- position weight matrix

## Questions
- why do we need the entire human genome?
    - end goal is to use the classifier to identify binding regions given a sequence and DNA physical properties
- what do we use the position weight matrix for?
    - we can compare the performance of this to our machine learning classifier
- what else?
    - train a classifier on just the sequence data and try to use it to identify transcription factors
   
## Assumptions and Decisions:
- we will use a single chromosome and a single transcription factor. this reduces the size of the problem in a scalable way (if this succeeds then it is safe to say that the approach should work on the larger problem). 

## Approach

### Data Generations
The data for our classifier will be binding sites on a single chromosome for a single transcription factor. We will select one of the transcription factors that is more common so that we have more data points (~15 000). When considering physical features of DNA, we will select certain features available online, and will most likely restrict the features to simpler properties (ie. not including second order). We will select 14 000 data points from binding sites and label them as class 1. For non-binding sites, we will iterate through regulatory regions and randomly select sequences of 15 nucleotides that are known to be non-binding sites. We considered iterating through  sequence in the genome but the presence of repeats and unspecified nucleotides made that difficult. Plus, it seems more realistic for scientists to scan regulatory regions in search of binding sites instead of scanning the entire genome (which is expensive, but may yield interesting results).

### Data Representation
We will represent each data point as a multidimensional array. The simplest point will be a vector of nucleotides, but we will likely also include other information (ie. physical properties) at each nucleotide.

In [1]:
# variables
chrom='chr1'
tf_name='CTCF'

## 1. Identifiying bound/non-bound sequences

In [2]:
# input file: set of genomic regions that are active regulatory regions in GM12878
# a regulatory region is characterized by: chromosome #, start nt, end nt, ID (ex. chr2.3 is the 3rd site on chr2)
reg_regions = []
with open('data/wgEncodeRegTfbsClusteredV3.GM12878.merged.bed', 'r') as file:
    for line in file:
        region = line.rstrip().split('\t')
        
        # only take regions on chromosome 1
        if region[0] != chrom: 
            break
        reg_regions.append(region)

In [3]:
reg_regions[-1]
# len(reg_regions)

['chr1', '249218925', '249219295', 'chr1.7404']

In [4]:
# input file: set of genomic coordinates of transcription factor binding sites for several transcription factors
# we chose to look at a single TF: 'CTCF'

# binding_sites is all bound sequences
binding_sites = []

# count frequencies of each tf
tf_freq = {}

with open ('data/factorbookMotifPos.txt', 'r') as file:
    for line in file:
        region = line.rstrip().split('\t')[1:]
        # only take regions on chromosome 1
        if region[0] != chrom:
            break
            
        key = region[3]
        if key in tf_freq:
            tf_freq[key] += 1
        else:
            tf_freq[key] = 1
            
        if key == tf_name:        
            binding_sites.append(region)

In [5]:
tf_freq

{'UA1': 767,
 'CTCF': 14648,
 'NFY': 2653,
 'FOXA': 4950,
 'RUNX1': 2792,
 'UAK25': 2678,
 'UAK26': 1149,
 'UAK27': 1705,
 'v-Maf': 4302,
 'USF': 4778,
 'BHLHE40': 1609,
 'HNF4': 2487,
 'RXRA': 1190,
 'AP1': 15471,
 'NFE2': 2281,
 'MYC': 2883,
 'NRF1': 3987,
 'PAX5': 1441,
 'YY1': 2557,
 'UAK17': 791,
 'UAK18': 473,
 'SP1': 8720,
 'v-JUN': 3030,
 'CREB-ext': 811,
 'EBF1': 4181,
 'ZNF143-ext': 1801,
 'EGR1': 7011,
 'UAK42': 14428,
 'RFX5': 1055,
 'UA6': 90,
 'UAK52': 887,
 'UAK30': 974,
 'CEBPB': 5596,
 'ELF1': 2456,
 'NFKB1': 2680,
 'MAX': 2859,
 'UAK41': 635,
 'ZNF263': 7595,
 'UA2': 665,
 'GABP': 1645,
 'E2F4': 5039,
 'UAK21': 162,
 'NR2C2': 242,
 'TAL1': 1378,
 'ZEB1': 325,
 'GATA1-ext': 2033,
 'UAK36': 696,
 'AP2': 3167,
 'TEAD4': 1665,
 'UA7': 240,
 'CTCF-ext': 2364,
 'POU2F2': 540,
 'MEF2': 890,
 'UAK61': 340,
 'TCF12': 2394,
 'ZNF281': 1702,
 'UA3': 3275,
 'UAK29': 1412,
 'GATA1': 3542,
 'GATA3': 821,
 'UA9': 870,
 'TCF3': 1322,
 'ESR1': 1296,
 'E2F1': 3601,
 'STAT1': 2351,
 'BA

In [6]:
# binding_sites[-1]
len(binding_sites)

14648

In [7]:
# check that the every binding site has the same length

len_sites = {}
for site in binding_sites:
    key = int(site[2]) - int(site[1])
    if key in len_sites:
        len_sites[key] += 1
    else:
        len_sites[key] = 1

In [8]:
len_sites

{15: 14648}

In [9]:
non_binding = []
k=0
start=0
for region in reg_regions:
    # print to track progress
    k+=1
    if k % 10 == 0 or k == len(reg_regions):
        print('\r', end='')
        print(str(k) + '/' + str(len(reg_regions)), end='')
    
    # check if the region contains a binding site
    '''
    logic: for any given region, the first potential TF binding site cannot be before the first potential TF 
    binding site of the previous region. thus we store the first potential site as new_start and update start 
    at the end of each iteration
    '''
    
    new_start=-1
    contains=False
    for i in range(start, len(binding_sites)):
        if region[0] != binding_sites[i][0] or int(binding_sites[i][1]) < int(region[1]):
            continue
        # this code only met if start of TF binding is >= to start of region
        else:
            # new_start is only -1 once -> only updated the first time TF binding >= start of region
            if new_start == -1:
                new_start = i
            # if this condition is met, there is no overlap between TF binding and region
            if int(binding_sites[i][1]) > int(region[2]):
                break;
            # full overlap
            if int(binding_sites[i][1]) >= int(region[1]) and int(binding_sites[i][2]) <= int(region[2]):
                contains=True;
                break;
    if not contains:
        non_binding.append(region)
    start=new_start

7404/7404

In [10]:
len(non_binding)

2417

In [11]:
with open('data/non_binding_regions.txt', 'w') as f:
    for region in non_binding:
        f.write(region[0] + '\t' + region[1] + '\t' + region[2] + '\t' + region[3] + '\n')

## 1.1 Converting experimental sequence data into form that a Classifier can accept

We do this by converting binding information into a vector of nucleotides. This will require the sequence of chromosome 1, which we have. In addition, we will sample regulatory regions without binding sites to build a sample of non-binding data points. 

Note: if a TF binds on the negative strand, we will reverse the DNA sequence and take the complement.

In [12]:
# input file: sequence of chromosome 1
chr_seq = ""
chr_lines = []
with open('data/'+chrom+'.fa', 'r') as file:
    next(file)
    chr_lines = file.read().splitlines()
chr_seq = ''.join(chr_lines).upper()

In [13]:
chr_seq[16245:16260]

'GCCAGCAGAGGGGTT'

In [14]:
def complement(sequence):
    '''
    Takes a sequence of all capital letters
    '''
    s=[]
    for nt in sequence:
        if nt == 'A':
            s.append('T')
        elif nt == 'T':
            s.append('A')
        elif nt == 'G':
            s.append('C')
        elif nt == 'C':
            s.append('G')
    complement=''.join(s)
    return complement[::-1]

In [15]:
binding_sites

[['chr1', '16245', '16260', 'CTCF', '1.97', '-'],
 ['chr1', '91265', '91280', 'CTCF', '1.4', '+'],
 ['chr1', '91419', '91434', 'CTCF', '2.07', '+'],
 ['chr1', '91421', '91436', 'CTCF', '2.8', '-'],
 ['chr1', '91431', '91446', 'CTCF', '1.95', '-'],
 ['chr1', '104986', '105001', 'CTCF', '2.83', '+'],
 ['chr1', '138972', '138987', 'CTCF', '3.43', '-'],
 ['chr1', '237749', '237764', 'CTCF', '2.25', '+'],
 ['chr1', '237751', '237766', 'CTCF', '3.15', '-'],
 ['chr1', '237761', '237776', 'CTCF', '2.57', '-'],
 ['chr1', '521532', '521547', 'CTCF', '2.93', '+'],
 ['chr1', '521534', '521549', 'CTCF', '3.45', '-'],
 ['chr1', '521544', '521559', 'CTCF', '2.53', '-'],
 ['chr1', '545980', '545995', 'CTCF', '3.27', '-'],
 ['chr1', '546048', '546063', 'CTCF', '3.27', '-'],
 ['chr1', '546116', '546131', 'CTCF', '3.74', '-'],
 ['chr1', '664719', '664734', 'CTCF', '2.54', '-'],
 ['chr1', '664870', '664885', 'CTCF', '2.02', '-'],
 ['chr1', '714182', '714197', 'CTCF', '2.21', '+'],
 ['chr1', '714272', '714

In [16]:
sequences=[]
y_b=[]

for site in binding_sites:
    seq = chr_seq[int(site[1]): int(site[2])]
    if site[5] == '-':
        sequences.append(complement(seq))
    else:
        sequences.append(seq)
    y_b.append(1)

In [17]:
sequences[90:100]

['GCCCCCTCACGTGGG',
 'CTCCCCTCCCCCGGC',
 'GGTCCCCCCGGAGGC',
 'CCGCCCGCTGCTGCC',
 'GGGCCACCTGGCAGC',
 'CTCCCGCCTGCTGGG',
 'GCGGCCTCCGCGGGC',
 'GCGCCCCCCTCCGAC',
 'GCGGCCCCTCCCGGC',
 'ACACCACCTCCTGGG']

In [18]:
y_b[90:100]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [19]:
letters={'A': 0, 'T': 0, 'G': 0, 'C': 0}
for seq in sequences:
    for letter in seq:
        letters[letter]+=1

In [20]:
total = 0
for letter in letters:
    total += letters[letter]

for letter in letters:
    letters[letter] /= total

In [21]:
# nucleotide distribution in the binding sites
letters

{'A': 0.08544511196067722,
 'T': 0.19506189695976697,
 'G': 0.2802976515565265,
 'C': 0.4391953395230293}

In [22]:
# convert sequences to a form that can be accepted a machine classifier

X_b=[]
nt_ind={'A':0, 'T':1, 'G':2, 'C':3}
for seq in sequences:
    x_seq = [[],[],[],[]]
    x=[]
    for nt in seq:
        x_seq[nt_ind[nt]].append(1)
        for i in range(4):
            if i == nt_ind[nt]:
                continue
            x_seq[i].append(0)
    for i in x_seq:
        x.extend(i)
    X_b.append(x)

In [23]:
X_b[:5]

[[1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  1],
 [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0],
 [0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1],
 [0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  

## 1.2 Generate set of non-binding regions for classification

Generate every single sequence 15 nt sequence (with overlaps) from the non-binding list and randomly select 15k of them.

In [24]:
nb_seq=[]
for region in non_binding:
    region_seq=chr_seq[int(region[1]):int(region[2])]
    if len(region_seq) < len(sequences[0]):
        continue
    for i in range(len(region_seq)-len(sequences[0])+1):
        nb_seq.append(region_seq[i:i+len(sequences[0])])

In [25]:
# nb_seq[:10]
len(nb_seq)

1544778

In [36]:
import random

nb_seq_sample=random.sample(nb_seq, 2*len(binding_sites))

In [27]:
nb_seq_sample[:10]

['CAAGCCAGCCAGAAT',
 'ACCAGCTAAGGGTGG',
 'ATAATCTTGTCAATA',
 'AATAATGATTAGAAT',
 'ACAGTGCCTTTTTTT',
 'TGTCAAGCCACAGTC',
 'AGCCTTCCCCTCACT',
 'TAAGATGCCAGGACA',
 'AATCAATAAAAGTTG',
 'TAAGTAGGTTCTCAA']

In [37]:
X_nb=[]
y_nb=[]

nt_ind={'A':0, 'T':1, 'G':2, 'C':3}
for seq in nb_seq_sample:
    x_seq = [[],[],[],[]]
    x=[]
    for nt in seq:
        x_seq[nt_ind[nt]].append(1)
        for i in range(4):
            if i == nt_ind[nt]:
                continue
            x_seq[i].append(0)
    for i in x_seq:
        x.extend(i)
    X_nb.append(x)
    y_nb.append(0)

In [29]:
X_nb[:10]

[[0,
  1,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0],
 [1,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [1,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0],
 [1,
  1,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  0,
  0,
  1,
  0,
  

## 2. Calculating DNA physical properties of each sequence

Steps:
1. download table of all physical properties of a 

In [30]:
str(len([]))

'0'

## 3. Train a Machine Learning Classifier

1. Start by training a classifier using just sequence data. Then use a similar technique but incorporate physical data as well. See if there is any difference.

In [31]:
# baseline classifier without considering DNA physical properties

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X=[]
X.extend(X_b)
X.extend(X_nb)

y=[]
y.extend(y_b)
y.extend(y_nb)

X_other, X_valid, y_other, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_other, y_other, test_size=0.25, random_state=42)

In [32]:
svc=SVC()
svc.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [33]:
test_pred=svc.predict(X_test)

In [35]:
from sklearn import metrics

metrics.f1_score(y_test, test_pred)

0.9977512541082858

In [39]:
pred=svc.predict(X_nb)

In [40]:
y_pred=[0 for i in range(2*len(binding_sites))]

In [42]:
metrics.accuracy_score(pred,y_pred)

0.996586564718733