# In class EM implementation

## Stopping iteration

Need to stop somewhere, keep track how much it is improving when it slows down then you stop optimizing.

How would actually calculate?

- Each sequences have is length $L$
- How to calculate things we need given the data?

## Computing posteriors (E step)

- Posteriors goes into M step
- At some point need to calculate post for each base for each sequence
    - Already init $\lambda_{l}$ and $\Psi_{l}$

- Could do nested for loop but not efficient

In current model each base is sampled independently so each time you see an A you will have same value.

$ P(C_{ij}=1|X_{ij}=A) = \frac{\lambda_{1}\Psi^{1}_{A}}{\lambda_{0}\Psi^{0}_{A} + \lambda_{1}\Psi^{1}_{A}} $

For next class write code to calculate the posteriors.

$C_{i,j}$ converted into $C_{i, j, l}$ because $C_{i, j}$ could take on multiple values due to multiple models? Want to reduce $C$ to either zero or one when converting to expectation (?). 

In the E step $q = P(C | X)$ and therefore depends on what value X we have.

## For Thursday

In [2]:
import numpy as np

Initialize the model parameters $\theta$ which include $\lambda_{0}, \lambda_{1}, \Psi^{0}_{k}$ and $\Psi^{1}_{k}$.

In [6]:
def init_params():
    
    lambda_0 = np.random.uniform()
    lambda_1 = 1 - lambda_0
    
    def init_psi():
        psi = np.random.uniform(size=(4))
        psi_norm = psi / psi.sum()
        return psi
    
    psi_0 = init_psi()
    psi_1 = init_psi()
    
    return {
        'l0': lambda_0, 'l1': lambda_1, 'psi0': psi_0, 'psi1': psi_1
    }

theta_0 = init_params()
theta_0

{'l0': 0.03705496470442482,
 'l1': 0.9629450352955752,
 'psi0': array([0.89698044, 0.0187848 , 0.4707884 , 0.17963004]),
 'psi1': array([0.56367798, 0.56942416, 0.9238641 , 0.75458959])}

Calculate posterior probability array. Gives the probability $C=1$ given the identity of a specific nucleotide.

In [9]:
def post_probs(theta_0): 
    return [
        (
            (theta_0['l1']*theta_0['psi1'][i]) / 
            (theta_0['l0']*theta_0['psi0'][i] + theta_0['l1']*theta_0['psi1'][i])
        ) for i in range(4)
    ]

probs = post_probs(theta_0)  # probability of Cij == 1 given the identity of each nucleotide
probs

[0.9422987247226866, 0.9987321594690878, 0.9807678086255173, 0.990922779498142]

Code from sequence reader assignment to read in Quon enhancer data.

In [10]:
def read_seq_file(filepath):
    with open(filepath) as handle:
        return [s.upper().strip() for s in handle]

def nuc_to_one_hot(nuc):
    # Convert nucleotide to the index in one hot encoded array
    # that should be hot (==1)
    upper_nuc = nuc.upper()
    mapping = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
    return mapping[upper_nuc]

def make_matrix(seqs):
    # input an iterable of sequences and return one hot matrix
    num_seqs, length = len(seqs), len(seqs[0])
    # assume all sequences are the same length
    matrix = np.zeros((num_seqs, length, 4))
    for i, each_seq in enumerate(seqs):
        for j, each_nuc in enumerate(each_seq):
            hot_index = nuc_to_one_hot(each_nuc)
            matrix[i][j][hot_index] = 1
    return matrix

In [11]:
seqs_path = '../assignments/data/sequence.padded.txt'
seqs = read_seq_file(seqs_path)
seq_matrix = make_matrix(seqs)

Multiply each one-hot encoded matrix by the posterior matrix. Since only one value in seq matrix is non-zero and at the same index as the posterior multiplying the two together will give 3D matrix that hold posterior probibilities for each base. The array at positions $X_{ij}$ could be reduced to single values by summing in that format makes more sense in the actual E step implementation.

In [12]:
base_probs = seq_matrix * probs

In [13]:
base_probs

array([[[0.94229872, 0.        , 0.        , 0.        ],
        [0.        , 0.99873216, 0.        , 0.        ],
        [0.94229872, 0.        , 0.        , 0.        ],
        ...,
        [0.        , 0.        , 0.        , 0.99092278],
        [0.        , 0.        , 0.98076781, 0.        ],
        [0.        , 0.99873216, 0.        , 0.        ]],

       [[0.        , 0.        , 0.        , 0.99092278],
        [0.94229872, 0.        , 0.        , 0.        ],
        [0.        , 0.99873216, 0.        , 0.        ],
        ...,
        [0.        , 0.        , 0.        , 0.99092278],
        [0.94229872, 0.        , 0.        , 0.        ],
        [0.        , 0.99873216, 0.        , 0.        ]],

       [[0.94229872, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.98076781, 0.        ],
        [0.        , 0.        , 0.        , 0.99092278],
        ...,
        [0.94229872, 0.        , 0.        , 0.        ],
        [0.        , 0.      

Version of matrix were we take sum of each array at $X_{ij}$.

In [14]:
base_probs_sum = base_probs.sum(axis=2)
base_probs_sum

array([[0.94229872, 0.99873216, 0.94229872, ..., 0.99092278, 0.98076781,
        0.99873216],
       [0.99092278, 0.94229872, 0.99873216, ..., 0.99092278, 0.94229872,
        0.99873216],
       [0.94229872, 0.98076781, 0.99092278, ..., 0.94229872, 0.99092278,
        0.99092278],
       ...,
       [0.99092278, 0.99092278, 0.94229872, ..., 0.98076781, 0.98076781,
        0.98076781],
       [0.94229872, 0.98076781, 0.98076781, ..., 0.99873216, 0.98076781,
        0.99092278],
       [0.99092278, 0.99873216, 0.99092278, ..., 0.99092278, 0.99092278,
        0.99092278]])