# Training a DeepBind model

By Ziga Avsec (Avsecz on github).

Now that you have learned the basics in the previous exercise, let's have a look at the real-world example. We are going to use a set sequences at the ChIP-seq peaks of the FoxA1 transcription factor as done by DeepBind - https://www.nature.com/articles/nbt.3300.

## Install the dependencies

In [None]:
!pip install concise

You can import and directly use the convenience functions from the previous example:

In [None]:
!wget https://raw.githubusercontent.com/Avsecz/DL-genomics-exercise/master/utils.py

In [1]:
from utils import evaluate, plot_filters, plot_history, plot_seq_importance

Using TensorFlow backend.


## Loading the deepbind data

The raw sequneces from the ChIP-seq peaks provided in the supplementary material of the article look like this:

```
FoldID	EventID	seq	Bound
A	seq_00001_peak	CTGAAATTGCTGTATTTACCTTGAAAACCACAAACTGTAAACAGGGCACCTGTTCAGAGAAGATCTTCAAACTGCTCACTCACTAAATCAACACCTGGGAA	1
A	seq_00003_peak	GAGGCACTGGGGCAGAACAAATTTGCACAGTGTCTGCTGTGGACATAAGGGATTTCTTCAGCCCTATGCAAATAGTAATCCCACTAGTTCCCAGAAGATAA	1
A	seq_00005_peak	CTTTTACTATTTACCTTGGCAAGTCCAGGACCGGATTGATGATCTGTAAAGTGGATTTGTTATTTGGCTGTTTGCTTTGGCAGCTCTTGAAAGCACTTTGC	1
A	seq_00007_peak	AAAACAGATGTTGCAACAAGCAAACATCTTTGTGATGACACAGTGATGTTATCGTAGCTGTATAAACAACCGTATGGCCTTTGGCCTGAGATTCCGGAAGC	1
A	seq_00009_peak	CAGTGTTTGCCCTTCCAAAGCCAGAGCCATAAAAGGCAGCTTTCAAAGTCACTGCCGCAGAAATGTCAACATGAGGGGGAGGCCAGTCATGGTTTCTGAGG	1
A	seq_00011_peak	GAGGCCCCATGCTCATTTTTTTCCTCTCCCAGAATCTCAGAAAAGTAAATAACCACCCGAGCTGCTCTAGCGGGTAAACAGCCCAGAGTTTGCTCTCCTAA	1
A	seq_00013_peak	GAGAATGGAGAGAAGCAGCTAGGCAAATAATTGGCAAGAAAAGTAAACAGTTACAGTGCAGCTTTGTTTACCCACTCTGCCTATCTGCGTTTCTGAAATTG	1
A	seq_00015_peak	TGGTTCCAAGTGTGTCAACAGCCTGTTGCTTTCTAGTTCAACAAGAGGGAATAATCTTTGGTAAACATGGCCGTTGGAAAAAAGCAAATATTTGTCTTGGC	1
A	seq_00017_peak	CCAGGAAGAAGAACGATAAAGCTTGTTGACTTTTGCTCTTTGGAGGCTATCTTTCTCCTAGCAGAGTAAACGCATCTCTAGGGGATTAAAGGCAGGCTCCA	1
```

We are providing you the code to download, load and pre-process data directly:

In [2]:
import pandas as pd
from concise.preprocessing import encodeDNA

def load_deepbind_data(path, shuffle=True):
    """Load the DeepBind data into
    
    Args:
      path: path to the .seq.gz file of deebbind.
        example: 'https://github.com/jisraeli/DeepBind/raw/master/data/encode/FOXA1_HepG2_FOXA1_(SC-101058)_HudsonAlpha_AC.seq.gz'
    
    Returns:
      (one-hot-encoded array, labels)
    """
    import random
    from random import Random
    def dincl_shuffle_string(s):
        sl = [s[2*i:2*(i+1)] for i in range(int(np.ceil(len(s)/2)))]
        random.shuffle(sl)
        return "".join(sl)
    
    # Load the positive sequences
    df = pd.read_table(path)
    df = df[df.Bound==1]
    pos_seq = list(df.seq)
    
    # Generate the negative set by permuting the sequence
    neg_seq = [dincl_shuffle_string(s) for s in pos_seq]
    seqs = pos_seq + neg_seq
    labels = [1] * len(pos_seq) + [0] * len(neg_seq)
    
    # Permute the order
    idx = list(range(len(seqs)))
    if shuffle:
        Random(42).shuffle(idx)
        
    return encodeDNA(seqs)[idx], np.array(labels)[idx]

In [5]:
x_train, y_train = load_deepbind_data('https://github.com/jisraeli/DeepBind/raw/master/data/encode/FOXA1_HepG2_FOXA1_(SC-101058)_HudsonAlpha_AC.seq.gz')

In [6]:
x_test, y_test = load_deepbind_data('https://github.com/jisraeli/DeepBind/raw/master/data/encode/FOXA1_HepG2_FOXA1_(SC-101058)_HudsonAlpha_B.seq.gz')

In [7]:
x_train[0][:10]

array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.]])

In [8]:
y_train[:10]

array([1, 0, 0, 1, 1, 0, 0, 0, 0, 0])

## TODO

Your task is now to:
- train a model
- evaluate it
- visualize the filters
- visualize the importance scores
- lookup the discovered motif
  - Visit http://cisbp.ccbr.utoronto.ca/ and enter FoxA1 motif  
  - Lookup other motifs on: http://meme-suite.org/tools/tomtom

------------------------------------------------------