# BHIT (TensorFlow)

This notebook uses python and **TensorFlow** package to run BHIT.

## Import Necessary Packages

1. `time` package for measuring running time of the whole program;
2. `numpy` package for matrix manipulation;
3. `tensorflow` package for GPU-accelerated calculation.
4. `OrderedDict` class for constructing ordered selective functions.

In [1]:
import time
import numpy as np
import tensorflow as tf
from collections import OrderedDict

## Define Some Helper Functions

### Function init()
This function intializes parameters that will be used in the program.

- `dataset`: The whole dataset read from the input file.
- `outFileName`: Name of output file.
- `iterNum`: Number of iteration.
- `burninNum`: Burn-in number of iterations.
- `ObsNum`: Number of observations, e.g., number of population.
- `SNPNum`: Number of single nucleotide polymorphisms (SNPs).
- `PhenoNum`: Number of phenotype types.
- `MAF`: Minor Allele Frequency, should be less than 1.

**There are some differences between Jupyter notebook version and command line version!**

*We define parameter values inside the function instead of reading them from command line, because Jupyter cannot run with self-defined parameters.*

In [2]:
def init():
    
    dataset = np.loadtxt('input.txt').transpose()
    outFileName = 'output.txt'
    iterNum = 30000
    burninNum = 29000
    ObsNum = 200
    SNPNum = 100
    PhenoNum = 1
    MAF = 0.5
    
    return dataset, outFileName, iterNum, burninNum, ObsNum, SNPNum, PhenoNum, MAF


### Function logLikeCont(contArr)
This function calculates logarithmic likelihood of only continuous variates given the data array.
#### Math formulas
We assume $Gaussian$ distribution on the continuous data, and the probability is:

$$P(Y\mid X) = (\frac{1}{2\pi})^n\sqrt{\frac{\kappa_0}{\kappa_n}}\frac{\Gamma(v_n/2)}{\Gamma(v_0/2)}((\frac{v_0\sigma_0^2}{2})^{v_0/2}/(\frac{v_n\sigma_n^2}{2})^{v_n/2})$$

where

- $\kappa_n = \kappa_0 + n$
- $v_n = v_0 + n$
- $\sigma_n^2 = \frac{1}{v_n}(v_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0 n}{\kappa_n}(\bar y - \mu_0)^2)$
- $s^2 = \frac{1}{n-1}\sum_{i=1}^n(y_i-\bar y)^2$
- $\kappa_0, \mu_0, \sigma_0, v_0$ is pre-defined in the program ([click here](#Run-This-Program)), $n$ is the observation number of continuous data

#### Parameters
- `contTensor`: Input tensor.

#### Returns
- `logProb`: Logarithmic likelihood of continuous variates.

In [3]:
def logLikeCont(contTensor):
    varNum = tf.shape(contTensor)[0]
    obsNum = tf.cast(tf.shape(contTensor)[1], tf.float64)
    
    def func1():
        means = tf.reduce_mean(contTensor)
        sigma = tf.constant(1.0, dtype=tf.float64)
        
        nuVar = tf.multiply(NU0, tf.square(sigma))
        nuVar += tf.reduce_sum(tf.square(contTensor-means))
        nuVar += tf.multiply(KAPPA0, tf.multiply(
            obsNum/(KAPPA0+obsNum), tf.square(means-MU0)))

        res = (-1*tf.log(2*PI)*obsNum / 2 + tf.log(KAPPA0 / 
                (KAPPA0 + obsNum))/2 + tf.lgamma((NU0+obsNum)/2))
        res += (-1*tf.lgamma(NU0/2) + tf.log(NU0*tf.square(sigma)/
                2)*NU0/2 - tf.log(nuVar/2) * (NU0+obsNum)/2)
        return res
    
    def func2():
        means = tf.reduce_mean(contTensor, axis=1)
        lambda_arr = tf.diag(tf.ones_like(contTensor)[0])
        diff = tf.reshape(means-MU0, [1, -1])
        lambdaN = (lambda_arr + KAPPA0*obsNum/(KAPPA0+obsNum) * tf.matmul(
                    diff, diff, transpose_a=True, transpose_b=False))
        lambdaN += (obsNum-1)*np.cov(contTensor, rowvar=False, bias=False)
 
        logProb = (-tf.log(PI) * obsNum * varNum / 2 + tf.log(KAPPA0/
            (KAPPA0 + obsNum) * varNum / 2))
        logProb += tf.log(tf.matrix_determinant(lambda_arr)) * NU0/2
        logProb -= tf.log(tf.matrix_determinant(lambdaN)) * (NU0+obsNum)/2
        logProb += tf.reduce_sum(tf.lgamma((NU0+obsNum)/2 - 
            np.arange(varNum)/2) - tf.lgamma(NU0/2 - 
            np.arange(varNum)/2))
    
    logProb = tf.cond(tf.equal(varNum, 1), lambda: func1(), lambda: ZERO)
        
    return logProb


### Function logLikeDisc(discArr)
This function calculates logarithmic likelihood of only discrete variates given the data array.
#### Math formulas
We assume $Dirichlet$ distribution on the discrete data, and the probability is:

$$P(p_1, \cdots, p_{C_h}\mid \alpha_1, \cdots, \alpha_{C_h}) = \frac{1}{B(\alpha)}\prod_{j=1}^{C_h}p_{j}^{\alpha_{j-1}} $$

where 

- $C_h$ is the possible combination values in the genetic variation group (incoming parameter `discArr`)
- $\alpha = (\alpha_1, \cdots, \alpha_{C_h})$, parameter vector in $Dirichlet$ distribution
- $B(\alpha) = \frac{\prod_{j=1}^{C_h}\Gamma(\alpha_j)}{\Gamma(\sum_{j=1}^{C_h}\alpha_j)}$


By integrating these, we can get following formula:
$$ P(X\mid I) = \prod_{j=1}^{C_h}\frac{\Gamma(n_j+\alpha_j)}{\Gamma(\alpha_j)}\frac{\Gamma(\sum_{j=1}^{C_h}\alpha_j)}{\Gamma(\sum_{j=1}^{C_h}(n_j+\alpha_j))}$$

where 

- $X$ is the genetic variation group, `discArr` in the program
- $I$ is current partition, `Ix` or `Iy` in the program
- $n_j$ denotes the number of j-th value in possible combination shown up in the genetic variation group $X$
- $\Gamma (x) = \int_0^{\infty}t^{x-1}e^{-t}dt$.

#### Parameters:
- `discTensor`: Input tensor.

#### Returns
- `logProb`: Logarithmic likelihood of discrete variates.

In [4]:
def logLikeDisc(discTensor):
    logProb = tf.constant(0.0, dtype=tf.float64)
    combined_tensor = tf.reduce_join(discTensor, 0, separator=' ')
    unique_tensor, _, N = tf.unique_with_counts(combined_tensor)
    
    idx = tf.string_split(unique_tensor, delimiter=' ')
    idx = tf.sparse_tensor_to_dense(idx, default_value='1')
    idx = tf.string_to_number(idx, out_type=tf.int32) - 1
    
    alpha = tf.gather(Odds, idx)
    alpha = tf.reduce_prod(alpha, axis=1)
    n_plus_alpha = tf.add(alpha, tf.cast(N, alpha.dtype))
    
    logProb += tf.reduce_sum(tf.lgamma(n_plus_alpha) - tf.lgamma(alpha))
    logProb -= tf.lgamma(tf.reduce_sum(n_plus_alpha))
    return logProb


### Function logLikeDepe(discArr, contArr)
This function calculates logarithmic likelihood of partitions with both continuous and discrete variates.
#### Math formulas
If we detect interaction between both continuous and discrete data, we calculate probability as follows:
$$P = \prod_{m=1}^{M}P(Y_{\{m\}} \mid X_{\{I=h\}}) P({X_{\{I=h\}}}\mid I)$$

where

- $M$ is the total number of combination values of $X$ that are associated with $Y$
- The formula can be calculated by combining two formulas we defined above

#### Parameters
- `discTensor`: Input discrete tensor.
- `contTensor`: Input continous tensor.

#### Returns
- `logProb`: Logarithmic likelihood of both continuous and discrete variates.

In [5]:
def logLikeDepe(discTensor, contTensor):
    combined_tensor = tf.reduce_join(discTensor, axis=0, separator=' ')
    unique_tensor, _ = tf.unique(combined_tensor)
    
    def select_fn(elem):
        selected = tf.squeeze(tf.transpose(tf.gather(tf.transpose(contTensor), 
                    tf.where(tf.equal(combined_tensor, elem)))), [1])
        return logLikeCont(selected)
    
    logProb = tf.map_fn(lambda x: select_fn(x), unique_tensor, dtype=tf.float64)
    logProb = tf.reduce_sum(logProb)
    logProb += logLikeDisc(discTensor)
    
    return logProb


### Function calcProb(tensor1, tensor2)

This function is used to calculate likelihood given a dicrete tensor and continuous tensor.

In [6]:
def calcProb(tensor1, tensor2):
    shape1 = tf.shape(tensor1)
    shape2 = tf.shape(tensor2)
    
    pred_fn = OrderedDict([(tf.logical_and(tf.greater(shape1[0], 0), 
                tf.greater(shape2[0], 0)), lambda: logLikeDepe(tensor1, tensor2)),
                (tf.greater(shape1[0], 0), lambda: logLikeDisc(tensor1)),
                (tf.greater(shape2[0], 0), lambda: logLikeCont(tensor2))])
    res = tf.case(pred_fn, default=lambda: ZERO, exclusive=False)
    
    return res


## Use TensorFlow

### Create TensorFlow Graph

In [7]:
(dataset, outFileName, iterNum, burninNum, obsNum, 
    SNPNum, PhenoNum, MAF) = init()
TotalNum = SNPNum + PhenoNum
Odds = np.array([(1-MAF)**2, 2*MAF*(1-MAF), MAF**2])


graph = tf.Graph()
with graph.as_default():
    # Define TensorFlow constants.
    GeneData = tf.constant(dataset[:SNPNum].astype(np.int32).astype('str'))
    PhenoData = tf.constant(dataset[SNPNum:TotalNum])
    PI = tf.constant(np.pi, name='PI', dtype=tf.float64)
    ZERO = tf.zeros([], dtype=tf.float64)
    KAPPA0 = tf.constant(1.0, name='KAPPA', dtype=tf.float64)
    NU0 = tf.constant(PhenoNum+1, name='NU', dtype=tf.float64)
    MEANS = tf.reduce_mean(PhenoData, axis=1, name='MEANS')
    MU0 = tf.reduce_max(MEANS) + 2
    
    # Define TensorFlow placeholders.
    index1 = tf.placeholder(dtype=tf.int32)
    index2 = tf.placeholder(dtype=tf.int32)
    var1 = tf.placeholder(dtype=tf.int32)
    var2 = tf.placeholder(dtype=tf.int32)
    
    # TensorFlow random number generator.
    u = tf.random_uniform([], dtype=tf.float64)
    
    Dx = index1[:SNPNum]
    Cx = index1[SNPNum:TotalNum]
    Dy = index2[:SNPNum]
    Cy = index2[SNPNum:TotalNum]
    
    Dxx = tf.squeeze(tf.gather(GeneData, tf.where(tf.equal(Dx, var1))), [1])
    Cxx = tf.squeeze(tf.gather(PhenoData, tf.where(tf.equal(Cx, var1))), [1])
    Dxy = tf.squeeze(tf.gather(GeneData, tf.where(tf.equal(Dx, var2))), [1])
    Cxy = tf.squeeze(tf.gather(PhenoData, tf.where(tf.equal(Cx, var2))), [1])
    Dyx = tf.squeeze(tf.gather(GeneData, tf.where(tf.equal(Dy, var1))), [1])
    Cyx = tf.squeeze(tf.gather(PhenoData, tf.where(tf.equal(Cy, var1))), [1])
    Dyy = tf.squeeze(tf.gather(GeneData, tf.where(tf.equal(Dy, var2))), [1])
    Cyy = tf.squeeze(tf.gather(PhenoData, tf.where(tf.equal(Cy, var2))), [1])
    
    pX = 0 
    pY = 0
    pX += calcProb(Dxx, Cxx)
    pX += calcProb(Dxy, Cxy)
    pY += calcProb(Dyx, Cyx)
    pY += calcProb(Dyy, Cyy)
    
    accept = tf.log(u) <= tf.minimum(ZERO, pY-pX)
    res = tf.cond(accept, lambda: index2, lambda: index1)


### Create TensorFlow Session

In [8]:
with tf.Session(graph=graph) as sess:
    start = time.time()
    # If you want to use TensorBoard to visualize graph, uncomment the following line.
    # writer = tf.summary.FileWriter('output/', sess.graph)
    sess.run(tf.global_variables_initializer())
    Ix = np.arange(TotalNum)
    for i in range(iterNum):
        while True:
            # Sort the number to ensure changing from small index to big one.
            x, y = np.sort(np.random.choice(Ix, 2, False))
            k = np.where(Ix == x)[0]
            
            if len(k) > 1:
                k = np.random.choice(k, 1)
          
            Iy = np.array(Ix)
            Iy[k] = y
          
            tmp1 = np.where(Ix == x)[0]
            tmp2 = np.where(Iy == y)[0]
            if (len(tmp1)!=1 or len(tmp2)!=1):
                break
        
        Ix = sess.run(res, {index1: Ix, index2: Iy, var1: x, var2: y})
            
        if (i+1) % 5000 == 0:
            print('Progress: %.2f%%' % ((i+1)/iterNum*100))
                
    print('Training Complete! Index array:\n', Ix)

end = time.time()
print("The whole program runs about %.2f s." % (end-start))


Progress: 16.67%
Progress: 33.33%
Progress: 50.00%
Progress: 66.67%
Progress: 83.33%
Progress: 100.00%
Training Complete! Index array:
 [100 100   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  88  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  95  47  48  49  62  51  52  53
  54  91  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  92  89
  90  91  92  93  94  99  96  97  98  99 100]
The whole program runs about 80.26 s.
