### Custom ML model to identify local similarities ("links") with sequence profiles

Let $$X \in \{0,1\}^{B\times N\times 6 \times T \times 22}$$ be a batch of **one-hot encoded input translated sequences**,
where $B$ is `batch_size`, $N$ is the number of genomes and $T$ is the `tile_size` (in aa).
The 6 is here the number of translated frames in order (0,+),(1,+),(2,+),(0,-),(1,-),(2,-).
The 22 is here the size of the considered amino acid alphabet.


In [1]:
import numpy as np
import tensorflow as tf

import sequtils as su
import seq

### Create random genomes as toy data 

In [2]:
N = 4           # number of genomes
tile_size = 20  # tile size measured in amino acids
# A tile is a consecutive subsequence of _one_ contig/scaffold/chromosome.
# Tiles should be about gene-sized.

genome_sizes = [[210,100], [30,220,150], [230,110,120,90], [180]] # in nucleotides
genomes = seq.getRandomGenomes(N, tile_size, genome_sizes)

In [3]:
genomes

[['TGAGGAGTCCCCCAAGGGGCACCGCCTAGGTAGATAAGACCGGGACTAAACGTGAACATTGCGGGTTACGCGTTTCCCTTTTGTTAGTCATTTTTTCCGACGAGCTGATTGGCTATTTAGCGTTCACCAGGCATACGGTTCCCTACGCTGTGGGTGGGATCACTCGAGTTTGCATAAACTAGTGGGGTGTGCTGCCGTAGCATCCGAGAA',
  'GACAGTAGGCAATACTTATTCCGCTTTCTGCCTGCTTTCTGAAGGATTCTATCCATCGCAACACTACTAATAAAGCTTACATGCTGCGTTATTGAGCGCC'],
 ['ATGCATCATCATCAAATTGATATCACTCAT',
  'CGACCGGGTTCTCCCTAAGTAACCCTTGCCTTCGGCTGAGTTCGACAACCACATACGAGGAGTAGGTTCCTTACGGTTATGTAAAAACTCCAAGCATACTCACAGACATCTGACTAGAAGGGTCCCCTTCGGGCTTATAAGGTCCTTTTCAACTCCGATTGAATAGTGACGGTCACGACAATTCGCCTTAAGTCCAGTGCTCTTTGGTGCCACCTAACCT',
  'TAATGTGGTCGTGCCCCTGCTTAAGCGTTACACGCGTGCCATCAATCTGATAGAATCCTGGACTGTTCGCCAATACTAACATGGTGCATAATGCTCCCAGTGAGTAGCCTTGGTGAAAACAGAGGCGCATTACGGCCGGGCATCCTCAAC'],
 ['AGGGAGACCTTTTTTTAAATGACGCTGCCTTCGTTCTACGCAATATTTCACCCGAATCCTCACTCCCAGCTTCTTCAGGGGTGACTATTAGCAACCGAGGCGCCGTCCCGGGTCGGCCTGCTTTTCAGAGTGCAAGTCTGATGCATCTTACCGGAGAACCGTTGGACCGCAGAGTTCGTTCCTCGCTAGTTGCTTTGTGTGGGTATCAACTGCCAGTCGAATGCCAAACT',
  'AGTTAGAGATAAGATTCCA

In [4]:
Q = seq.backGroundAAFreqs(genomes, True)
# set a small probability for 0-th character (missing aa) to avoid numerical error
Q[0] = 0.0001
Q /= Q.sum()
Q

background freqs:  2840.0 *
  0.0000
C 0.0324
K 0.0370
E 0.0352
W 0.0137
T 0.0599
G 0.0665
Y 0.0243
A 0.0577
I 0.0447
N 0.0345
V 0.0602
H 0.0310
S 0.0951
D 0.0289
F 0.0398
M 0.0137
R 0.0870
L 0.0940
P 0.0658
Q 0.0313
* 0.0472


array([9.99900003e-05, 3.23911235e-02, 3.69681306e-02, 3.52077484e-02,
       1.37310205e-02, 5.98531701e-02, 6.65426403e-02, 2.42933463e-02,
       5.77407032e-02, 4.47138399e-02, 3.45035940e-02, 6.02052473e-02,
       3.09828166e-02, 9.50609148e-02, 2.88703516e-02, 3.97847556e-02,
       1.37310205e-02, 8.69631395e-02, 9.40046832e-02, 6.58384860e-02,
       3.13348956e-02, 4.71783802e-02], dtype=float32)

#### Read in the genome

In [5]:
batch_size = 2  # constrained by RAM and gradient descent performance

X = seq.getNextBatch(genomes, batch_size, tile_size, verbose=False)
X.shape

(2, 4, 6, 20, 22)

## TensorFlow Model
Let $$P = (P[w,c,u]) \in [0,1]^{k \times 22 \times U}$$
be a collection of $U$ amino acid **profiles**, each of length $k$.
Let $$ Q = (Q[c]) \in [0,1]^{22}$$ be a background amino acid distribution.

Both are normalized distributions:
$$ \sum_c P[w,c,u] = \sum_c Q[c] = 1 \qquad \forall u,w.$$

Define the scores tensor 
$$ S \in \mathbb{R}^ {B\times N \times U}$$
by
$$ S[b,g,u] = \max_{f=0}^5 \max_{v=0}^{T-k} \sum_{w=0}^{k-1} \sum_{c=0}^{21} X[b,g,f,v+w,c] \cdot \ln \frac {P[w,c,u]}{Q[c]}.$$

For a given batch $S[b,g,u]$ is the maximal score that the $u$-th profile scores in the $b$-th tile of genome $g$.
It can be computed using a **one dimensional convolution** and max pooling.

Define the intermediate variables:

$R \in [0,1]^{k \times 22 \times U}$ by 
$$ R[w,c,u] := \ln \frac {P[w,c,u]}{Q[c]}.$$

$Z \in \mathbb{R}^{B\times N \times U \times 6 \times T-k-1}$ by
$$Z[b,g,u,f,v] := \sum_{w=0}^{k-1} \sum_{c=0}^{21} X[b,g,f,v+w,c] \cdot R[w,c,u].$$

In [6]:
U = 2 # number of profiles to train
k = 3 # length of profiles

In [7]:
P_logit_init = tf.random.normal([k, su.aa_alphabet_size, U],
                                stddev=1.,
                                dtype=tf.float32,
                                seed=1)

In [8]:
P_logit = tf.Variable(P_logit_init)

In [9]:
P = tf.nn.softmax(P_logit, axis=1)
P

<tf.Tensor: shape=(3, 22, 2), dtype=float32, numpy=
array([[[0.01283871, 0.15297097],
        [0.03084924, 0.00301303],
        [0.03191357, 0.06260679],
        [0.05227958, 0.00414839],
        [0.01402562, 0.0327656 ],
        [0.0549997 , 0.02661102],
        [0.18501115, 0.06115986],
        [0.01970647, 0.00784837],
        [0.10205929, 0.03379603],
        [0.02217916, 0.16052717],
        [0.00505716, 0.0223708 ],
        [0.01640806, 0.04776622],
        [0.08971284, 0.00355158],
        [0.04683332, 0.00932726],
        [0.04129053, 0.00614332],
        [0.02776053, 0.08522641],
        [0.00726591, 0.15249194],
        [0.02260764, 0.01665986],
        [0.02375722, 0.03723903],
        [0.05442174, 0.00720272],
        [0.10919815, 0.01072184],
        [0.02982442, 0.05585181]],

       [[0.05235683, 0.02591239],
        [0.02155031, 0.00584429],
        [0.03832261, 0.01645906],
        [0.01603701, 0.02683151],
        [0.0047699 , 0.02537894],
        [0.09475918, 0.16156

In [10]:
Q1 = tf.expand_dims(Q,0)
Q2 = tf.expand_dims(Q1,-1)
R = tf.math.log(P/Q2)
R

<tf.Tensor: shape=(3, 22, 2), dtype=float32, numpy=
array([[[ 4.8551497e+00,  7.3329334e+00],
        [-4.8772138e-02, -2.3749392e+00],
        [-1.4702488e-01,  5.2681756e-01],
        [ 3.9533973e-01, -2.1385465e+00],
        [ 2.1228012e-02,  8.6972153e-01],
        [-8.4566675e-02, -8.1056905e-01],
        [ 1.0225731e+00, -8.4351800e-02],
        [-2.0925546e-01, -1.1298963e+00],
        [ 5.6959158e-01, -5.3561908e-01],
        [-7.0112979e-01,  1.2781801e+00],
        [-1.9202578e+00, -4.3330696e-01],
        [-1.2999866e+00, -2.3144084e-01],
        [ 1.0631812e+00, -2.1660404e+00],
        [-7.0792294e-01, -2.3215771e+00],
        [ 3.5781801e-01, -1.5474491e+00],
        [-3.5986865e-01,  7.6182753e-01],
        [-6.3646358e-01,  2.4074543e+00],
        [-1.3471963e+00, -1.6524822e+00],
        [-1.3754580e+00, -9.2598718e-01],
        [-1.9044092e-01, -2.2127457e+00],
        [ 1.2484318e+00, -1.0724500e+00],
        [-4.5860830e-01,  1.6876614e-01]],

       [[ 6.2607675e+0

In [11]:
R.shape

TensorShape([3, 22, 2])

In [12]:
X1 = tf.expand_dims(X,-1) # 1 input channel
R1 = tf.expand_dims(R,-2) # 1 input channel
print ("X ", X1.shape, "\tR ", R1.shape)
Z1 = tf.nn.conv2d(X1, R1, strides=1,
                 padding='VALID', data_format="NHWC", name="Z")
Z = tf.squeeze(Z1, 4) # remove input channel dimension 
print("Z shape: ", Z.shape)

X  (2, 4, 6, 20, 22, 1) 	R  (3, 22, 1, 2)
Z shape:  (2, 4, 6, 18, 2)


In [13]:
Z[0,0,0]

<tf.Tensor: shape=(18, 2), dtype=float32, numpy=
array([[-3.0758557 , -3.0170102 ],
       [ 0.34099555, -1.3303499 ],
       [-1.4075589 , -3.2617812 ],
       [-0.1156095 , -3.6152835 ],
       [-0.08167011, -2.960205  ],
       [ 1.2930237 , -2.27994   ],
       [-0.03751284, -2.861636  ],
       [-0.67819375, -5.8154573 ],
       [-0.7165212 , -2.6640773 ],
       [-0.8692785 , -2.1967278 ],
       [-1.0513811 ,  0.22160077],
       [-2.0587752 , -0.07773656],
       [-2.2447054 , -4.848545  ],
       [-1.8161974 , -4.8099794 ],
       [-0.03225112,  0.06865031],
       [-2.8662808 , -4.234872  ],
       [-1.1999551 , -1.2777566 ],
       [-0.10705101, -2.0447001 ]], dtype=float32)>

In [14]:
dataset = tf.data.Dataset.from_tensor_slices(X)

In [15]:
Q = tf.expand_dims(Q,-1)
Q.shape

TensorShape([22, 1])