# Part 3: EM implementation

In [1]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [2]:
def read_data():
    with open('sequence.padded.txt') as handle:
        # I dont think this is the actual sequences we should be using
        # put using as placeholder for now.
        return [s.strip() for s in handle.readlines()]

data = read_data()
data

['ATACCCCTGGCTGGGTCATGGTGACCTGGAGGAAGCGT',
 'CATATATGGCCAGGGTCAGTGTGACCTCCATTTCCCAT',
 'AGCAGCTGGCCTGGGTCACAGTGACCTGACCTCAAACC',
 'AGGCTGTGTACAAGGTCAGAGTGACCTCTAGAAGCTCT',
 'TACTCTAGTTCCAGGTCATGGTGACCTGTGAAAAATCT',
 'AGGACTGTTTCAAGGTCACGGTGACCCTCGTGGGCTGT',
 'GCAGGAAGTTTTGGGTCACGGTGACCTCTAGTTGTTGA',
 'CAAGTGCTTCAAAGGTCATGGTGCCCTGGGGCCGAGAG',
 'ACCAACATGGCAGGGTCAAGTTGACCTCCCTGGCCACT',
 'TCTCTCTCTAGTAGGTCATGGTGACCTGTACACATTAT',
 'TCAGACCACAGAGGGTCAAGGTGACCTGAGAGATCAGT',
 'AGGCAATTCACTAGGTCAGGATGCCCTGGGGCAACAGT',
 'TAGTCCTGAAAAGGGTCATGTTGACCTGATTGTCATGT',
 'ATTAACTCTTCTAGGTCAGTGTGACCTAAACTCATCGG',
 'GGACAATTATTGGGGTCACGGTGACCTGCCTGTTTCAG',
 'GGTCCATAATATAGGTCATGTTGACCTGGGACAACTGG',
 'CTCCAGGAGCAGGGGTCAGGGTGACCTCCAGCTCCTCA',
 'GAGCCCATCTCTGGGTCATGTTGCCCTCTTACAGCACA',
 'TGGGTTAAACCTGGGTCATGTTGACCTAGATACATCTC',
 'GTGACATCCCCAGGGTCAAAGTGCCCTGAGTCTGGAGA',
 'GCCTTCTAGGTCAGCATGACCTGGTCCTCAGAGGGGGG',
 'GGCAATGAATCAAGGTCAGGCTAACCTGGCTTACTGCA',
 'CCTACTAGCCCTGGGTCAACGTGCCCTGTAAGAGCATG',
 'GGCGCAGCC

Create matrix $X_{i,j,p,k}$

In [3]:
seq_length = len(data[0])
motif_length = 8
number_motifs = seq_length - motif_length + 1
X = np.zeros((len(data), seq_length, motif_length, 4))

def nuc_to_one_hot(nuc):
    # Convert nucleotide to the index in one hot encoded array
    # that should be hot (==1)
    upper_nuc = nuc.upper()
    mapping = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
    return mapping[upper_nuc]

j_p = []
for i in range(seq_length):
    for j in range(number_motifs):
        for p in range(motif_length):
            nuc = data[i][j+p]
            k_hot = nuc_to_one_hot(nuc)
            X[i][j][p][k_hot] = 1.0

assert X.sum() == seq_length * number_motifs * motif_length

Initialize model parameters.

In [4]:
def init_EM(seq_length, motif_length):
    number_motifs_per_sequence = seq_length - motif_length + 1
    lambda_j = np.random.uniform(0, 1, size=(number_motifs_per_sequence,))
    lambda_j_norm = lambda_j / lambda_j.sum()
    psi_0 = np.random.uniform(0, 1, size=(4, motif_length))
    psi_1 = np.random.uniform(0, 1, size=(4, motif_length))
    psi_0 = (psi_0.T/psi_0.sum(axis=1)).T  # https://stackoverflow.com/questions/16202348/numpy-divide-row-by-row-sum
    psi_1 = (psi_1.T/psi_1.sum(axis=1)).T
    
    return lambda_j_norm, psi_0, psi_1

In [5]:
lambda_j, psi_0, psi_1 = init_EM(seq_length, motif_length)

![](e.png)

## E step numerator

- $\prod_{p}\prod_{k} {\psi^{1}}$ is product of probabilities of the nucleotides in a given motif J 
- $\prod_{j'!=j}\prod_{p}\prod_{k}$ is product of probabilities of the nucleotides of all motifs for the same sequence that are not motif J.

The numerator is then the sum of everything.

### Questions
- What does the product of $P(C_{i} = j | X, \theta)$ look like in terms of shape?

If asking what is the prob of $C_{i} = j$ then you are really asking what is the prob of motif at position j being the transcription factor binding site. You need to be able to answer this question for all $j$. So you should have an array with shape `(number of sequences, number of motifs per sequence`).

- How does taking the log here work? Products become sums so below may not actually be correct. I don't think we can just take the log whenever we want. I am not sure I understand Quon description of how to work in log space.

### Log space notes


In [6]:
def E_step(X, lambda_j, psi_0, psi_1):
    num = lambda_j * (X * psi_1.T) * (X * psi_1.T)
    denom = lambda_j * (X * psi_1.T) * (X * psi_1.T)

First take product along axis 3. This will just return an array of what were the non-zero nucleotide probibilties ($\prod_{k}$).

#### Numerator ($psi_{1}$) part (in the motif)

In [7]:
a = X[0][0]*psi_1.T
a

array([[0.11597112, 0.        , 0.        , 0.        ],
       [0.        , 0.04522459, 0.        , 0.        ],
       [0.07555558, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.10898697],
       [0.        , 0.        , 0.        , 0.06261992],
       [0.        , 0.        , 0.        , 0.29971703],
       [0.        , 0.        , 0.        , 0.04570203],
       [0.        , 0.07314194, 0.        , 0.        ]])

In [8]:
a.flatten()
a[a != 0]

array([0.11597112, 0.04522459, 0.07555558, 0.10898697, 0.06261992,
       0.29971703, 0.04570203, 0.07314194])

Now need where not in the motif (all other motifs in this case that are not motif 0 of sequence 0).

In [9]:
cut = X[0][np.arange(len(X[0]))!=0]
cut.shape

(37, 8, 4)

In [10]:
X[0].shape

(38, 8, 4)

In [11]:
b = (cut[0] * psi_0.T)
b.shape

(8, 4)

In [12]:
c = (cut * psi_0.T).flatten()
c[c != 0].prod()

7.311662667238134e-253

In [13]:
x=2
e = 2
x**e

4

In [14]:
np.log(x**e)

1.3862943611198906

In [15]:
e*np.log(x)

1.3862943611198906

![](e.png)

In [16]:
def numerator(i, j, X, lambda_j, psi_0, psi_1):
    # i = current sequence
    # j = current j index
    # X = Data
    psi_1_term = (X[i][j] * psi_1.T).flatten()
    # remove zero terms this is in lieu of having exponent X_{i,j,p,k}
    # cause term to go to 1
    psi_1_term = psi_1_term[psi_1_term != 0]
    # take product of all remaining terms (these are probibities seeing
    # the bases in the given motif in their given positions given they
    # are in the TFBS)
    psi_1_term = np.log(psi_1_term).sum()
    
    # now need to get product of all other motifs but assuming they are
    # not the TFBS
    psi_0_term = X[i][np.arange(len(X[i]))!=j]
    psi_0_term = psi_0_term.flatten()
    psi_0_term = np.log(psi_0_term[psi_0_term != 0]).sum()
    
    return np.log(lambda_j[j]) + psi_0_term + psi_1_term

In [None]:
def denominator(i, j, X, lambda_j, psi_0, psi_1):
    # Get all lambda values that are not j
    lambda_j_prime = np.delete(lambda_j, j)
    for j_prime in lambda_j_prime:
        psi_one_term = (X[i][j_prime] * psi_1.T).flatten()

AttributeError: 'numpy.ndarray' object has no attribute 'delete'

What is j prime prime in the denominator?

Convince myself that log of sums is same as product

In [17]:
a = a.flatten()
a = a[a != 0]
np.e**(np.log(a).sum())

2.709514915865458e-09

In [18]:
a.prod()

2.7095149158654565e-09

In [19]:
prod = X * psi_1.T
prod_3 = prod.prod(axis=3)
prod_3.shape

(357, 38, 8)

In [20]:
prod[0][0]  # one motif

array([[0.11597112, 0.        , 0.        , 0.        ],
       [0.        , 0.04522459, 0.        , 0.        ],
       [0.07555558, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.10898697],
       [0.        , 0.        , 0.        , 0.06261992],
       [0.        , 0.        , 0.        , 0.29971703],
       [0.        , 0.        , 0.        , 0.04570203],
       [0.        , 0.07314194, 0.        , 0.        ]])

Then take the product along axis 2 of this new array. This is the product for each motif ($\prod_{p}$)

In [21]:
prod_3.prod(axis=2).shape

(357, 38)

Combine into one operation.

In [22]:
prod_pk_psi_1 = (X * psi_1.T)
prod_pk_psi_1[prod_pk_psi_1 == 0] = 1

prod_pk_psi_1 = np.log(prod_pk_psi_1)
prod_pk_psi_1[prod_pk_psi_1 == 0] = 1

n1 = prod_pk_psi_1.prod(axis=3).prod(axis=2)
n1.shape

(357, 38)

In [23]:
prod_pk_psi_0 = (X * psi_0.T)
prod_pk_psi_0[prod_pk_psi_0 == 0] = 1

prod_pk_psi_0 = np.log(prod_pk_psi_0)
prod_pk_psi_0[prod_pk_psi_0 == 0] = 1
n2 = prod_pk_psi_0.prod(axis=3).prod(axis=2)
n2.shape

(357, 38)

## M step

Quon says $\boldsymbol{E}[C_{i,j}] = P(C_{i} = j | X_{i}, \theta)$

He also gives what $\boldsymbol{E}[C_{i,j}] = P(C_{i} = j | X_{i}, \theta)$ equals to in the E step (shown in the image below.)

### $\lambda_{j}$

Take sum of all values at each index $i$ over vector $C_{i, j}$ which would store prob each $j$ (each motif) being the transcription factor binding site and divide this value by the number of sequences $N$.
- Would this not always just sum to 1? And therefore $\lambda_{j}$ is basically fixed at 1 over the number of sequences?

In [24]:
# use random data for now that should be in theory in the same shape as what the E step would produce
N = X.shape[0]
m = X.shape[1]
rand = np.random.random((N, m))

In [25]:
np.e**rand.sum(axis=0) / N

array([1.17646643e+74, 1.14650026e+75, 1.24620820e+75, 3.15698072e+72,
       2.26447212e+75, 2.97630218e+74, 7.71695357e+75, 3.73262896e+75,
       2.22443887e+73, 1.15356741e+76, 7.33034719e+76, 1.20733316e+77,
       1.54040141e+74, 3.74894193e+75, 2.93543782e+76, 1.09194068e+79,
       1.28624197e+75, 2.06408159e+77, 3.23948003e+75, 8.70052987e+75,
       4.08827950e+76, 1.37848745e+74, 1.81407734e+76, 2.52832555e+74,
       2.13575569e+72, 2.16275711e+76, 1.02883245e+73, 2.23829698e+75,
       7.58911551e+75, 6.57638003e+76, 6.09563577e+75, 3.01944384e+74,
       1.16087462e+75, 1.18945511e+74, 2.47621240e+77, 3.04173418e+72,
       3.66156042e+74, 4.09224375e+76])

### $\psi^{1}_{p, k}$

- How does taking a sum over $i$ for $C_{i, j}$look compared to taking a sum over $i$ and $j$?

Product of indicator variables for a given motif (For example the matrix at `X[0][0]`) and the expectation at that motif calculated during the E step. Then take a sum overall all those values and divide by the number of sequences.

In [26]:
(X[0][0][0][0]*rand[0][0]) / N # this would product one cell in the psi matrix need to iterate through p and Ks

0.0002811729716141022

In [27]:
for p in range(psi_1.T.shape[0]):
    for k in range(psi_1.T.shape[1]):
        products = []
        for i in range(X.shape[0]):
            for j in range(X.shape[1]):
                products.append(X[i][j][p][k] * rand[i][j])
    psi_1[k][p] = sum(products) / N
print(psi_1)

[[0.11597112 0.03387243 0.07555558 0.17352129 0.12573083 0.0417171
  0.19749639 0.23613525]
 [0.16619038 0.04522459 0.16139631 0.16748459 0.15447033 0.09024389
  0.14184798 0.07314194]
 [0.03724527 0.00148296 0.31625834 0.27427653 0.10607105 0.02439327
  0.11688184 0.12339072]
 [0.42362929 0.44322599 0.44895867 0.42713755 0.45416678 0.46115143
  0.42440585 0.42747741]]


### $\psi^{0}_{p, k}$

Seems pretty much like other $\psi$ but add some subtractions and change the denominator to the number of sequences times the number of possible motifs.

In [28]:
seq_len = 100

for p in range(psi_1.T.shape[0]):
    for k in range(psi_1.T.shape[1]):
        products = []
        for i in range(X.shape[0]):
            for j in range(X.shape[1]):
                products.append(1 - X[i][j][p][k] * rand[i][j])
    psi_1[k][p] = sum(products) / ((seq_len - X.shape[1] + 1 - 1) * N)
print(psi_1)

[[0.11597112 0.03387243 0.07555558 0.17352129 0.12573083 0.0417171
  0.19749639 0.23613525]
 [0.16619038 0.04522459 0.16139631 0.16748459 0.15447033 0.09024389
  0.14184798 0.07314194]
 [0.03724527 0.00148296 0.31625834 0.27427653 0.10607105 0.02439327
  0.11688184 0.12339072]
 [0.6060705  0.60575442 0.60566196 0.60601391 0.60557796 0.6054653
  0.60605797 0.60600843]]
