### Usage and main fitting functions

The package comes with an artificial dataset to present the functionality.

```
```

## Quick mathematical description of the package. 

The package tries to infer the parameters of two models: 

- Poisson Log Normal model (PLN)
- Poisson Log Normal-Principal Composent Analysis model (PLN-PCA)



We consider the follwoing model:  

- Consider $n$ samples $(i=1 \ldots n)$

- Measure $x_{i}=\left(x_{i h}\right)_{1 \leq h \leq d}$ :
$x_{i h}=$ (covariate) for sample $i$
(altitude, temperature, categorical covariate, ...)

- Consider $p$ features (genes) $(j=1 \ldots p)$ Measure $Y=\left(Y_{i j}\right)_{1 \leq i \leq n, 1 \leq j \leq p}$ :

- Measure $Y = Y_{i j}=$ number of times the feature $j$ is observed in sample $i$. 

- Associate a random vector $Z_{i}$ with each sample.
- Assume that the unknown $\left(W_{i}\right)_{1 \leq i \leq n}$ are independant and living in a space of dimension $q\leq p$  such that:

$$
\begin{aligned} 
W_{i} & \sim \mathcal{N}_p\left(0, I_{q}\right)  \\
Z_{i} &=\beta^{\top}\mathbf{x}_{i} +\mathbf{C}W_i  \in \mathbb R^p \\
Y_{i j} \mid Z_{i j} & \sim \mathcal{P}\left(\exp \left(o_{ij} + Z_{i j}\right)\right)
\end{aligned}
$$

and $C\in \mathbb R^{p\times q}$, $\beta \in \mathbb R^{d\times p}$. 

Where $O = (o_{ij})_{1\leq i\leq n, 1\leq j\leq p}$ are known offsets. 

We can see that 

$$Z_{i} \sim \mathcal N_p (\beta^{\top}\mathbf{x}_{i}, \Sigma) $$

The unknown parameter is $\theta = (\Sigma,\beta)$. The latent variable of the model can be seen as $Z$ or $W$. 


- When $p=q$, we call this model Poisson-Log Normal (PLN) model. In this case, $Z_i$ is a non-degenerate gaussian with mean  $\beta^{\top}\mathbf{x}_{i} \in \mathbb R^p$ and covariance matrix $\Sigma$.  
- When $p<q$, we call this model  Poisson-Log Normal-Principal Component Analysis (PLN-PCA). Indeed, we are doing a PCA in the latent layer, estimating $\Sigma$ with a ranq $q$ matrix: $CC^{\top}$.

The goal of this package is to retrieve $\theta$ from the observed data $(Y, O, X)$. To do so, we will try to maximize the log likelihood of the model:
$$p_{\theta}(Y_i)  = \int_{\mathbb R^q} p_{\theta}(Y_i,W)dW \overset{\text{ (if } p=q\text{)}}{=} \int_{\mathbb R^p} p_{\theta}(Y_i,Z)dZ$$

However, almost any integrals involving the law of the complete data is unreachable, so that we can't perform neither gradient ascent algorithms nor EM algorithm.   
We adopt two different approaches to circumvent this problem: 
- Variational approximation of the latent layer (Variational EM)
- Importance sampling based algorithm, using a gradient ascent method.









### Variational approach

We want here to use the EM algorithm, but the E step is unreachable, since the law $Z|Y_i$ (resp $W|Y_i$) is unknown and can't be integrated out. We thus choose to approximate the law of $Z|Y_i$ (resp $W|Y_i$) with a law $\phi_i(Z)$ (resp $\phi_i(W)$), where $\phi_i$ is taken among a family of law. We thus change the objective function: 

$$\begin{align} J_Y(\theta,\phi) & = \frac 1 n \sum _{i = 1}^n J_{Y_i}(\theta, \phi_i) \\ 
J_{Y_i}(\theta, \phi_i)& =\log p_{\theta}(Y_i)-K L\left[\phi_i(Z_i) \|p_{\theta}(Z_i \mid Y_i)\right]\\ 
& = \mathbb{E}_{\phi_i}\left[\log p_{\theta}(Y_i, Z_i)\right] \underbrace{-\mathbb{E}_{\phi_i}[\log \phi_i(Z_i)]}_{\text {entropy } \mathcal{H}(\phi_i)} \\
\end{align}$$


We choose $\phi_i$ in a family distribution : 

$$
\phi_i \in \mathcal{Q}_{\text {diag}}=\{
 \mathcal{N}\left(M_{i}, \operatorname{diag} (S_{i}\odot S_i ))
, M_i \in \mathbb{M} ^q, S_i \in \mathbb{R} ^q\right\}
$$

We choose such a Gaussian approximation since $W$ is gaussian, so that $W|Y_i$ may be well approximated. However, taking a diagonal matrix as covariance breaks the dependecy induced by $Y_i$. 

We can prove that $J_{Y_i}(\theta, \phi_i) \leq p_{\theta} (Y_i) \; \forall \phi_i$. The quantity $J_{Y}(\theta, \phi)$ is called the ELBO (Evidence Lower BOund).  

##### Variational EM 

Given an intialisation $(\theta^0, q^0)$, the variational EM aims at maximizing the ELBO alternating between two steps: 

-  VE step: update  $q$
$$
q^{t+1}=\underset{q \in \mathcal{Q}_{gauss}}{\arg \max } J_Y(\theta^{t}, q)
$$
- M step : update $\theta$
$$
\theta^{t+1}=\underset{\theta}{\arg \max } J_Y(\theta, q^{t+1})
$$
Each step is an optimisation problem that needs to be solved using analytical forms or gradient ascent. Note that $q$ is completely determined by $M = (M_i)_{1 \leq i \leq n } \in \mathbb R ^{n\times q}$ and $S = (S_i)_{1 \leq i \leq n } \in \mathbb R ^{n\times q}$, so that $J$ is a function of $(M, S, \beta, \Sigma)$. $M$ and $S$ are the variational parameters, $\beta$ and $\Sigma$ are the model parameters.  


##### Case $p = q$
The case $p=q$ is not doing any reduction dimension, but is very fast to compute. 
When $ p =q $, computations show that the M-step is straightforward as we can update $\Sigma$ and $\beta$ with an analytical form : 

$$
\begin{aligned}
\Sigma^{(t+1)} & = \frac{1}{n} \sum_{i}\left(\left((M^{(t)}-X\beta)_{i} (M^{(t)}-X\beta)_{i}\right)^{\top}+S^{(t)}_{i}\right)\\
\beta^{(t+1)} &= (X^{\top}X)^{-1}X^{\top}M^{(t)} \\ 
\end{aligned}
$$
This results in a fast algorithm, since we only need to go a gradient ascent on the variational parameters $M$ and $S$. Practice shows that we only need to do one gradient step of $M$ and $S$, update $\beta$ and $\Sigma$ with their closed form, then re-perform a gradient step on $M$ and $S$ and so on.


##### Case $p <q$

When $p<q$, we do not have any analytical form, and we are forced to perform gradient ascent on all the parameters. Practice shows that we can perform a gradient ascent on all the parameters at a time (doing each VE step and M step perfectly is quite inefficient). 




#### Importance sampling based algorithm 

In this section, we try to estimate the gradients with respect to $\theta = (C, \beta) $. 


We can use importance sampling to estimate the likelihood: 

 $$p_{\theta}(Y_i) = \int \tilde p_{\theta}^{(u)}(W) \mathrm dW \approx \frac 1 {n_s} \sum_{k=1}^{n_s} \frac {\tilde p_{\theta}^{(u)}(V_k)}{g(V_k)}, ~ ~ ~(V_{k})_{1 \leq k \leq n_s} \overset{iid}{\sim} g$$
 
where $g$ is the importance law, $n_s$ is the sampling effort and  


$$\begin{array}{ll}
\tilde p_{\theta}^{(u)}\ :& \mathbb R^{q}  \to  \mathbb R^+  \\
 & W \mapsto p_{\theta}(Y_i| W) p(W) \\
\end{array}$$

To learn more about the (crucial) choice of $g$, please see REF.

One can do the following approximation:


  $$\begin{equation}\label{one integral}
  \nabla _{\theta} \operatorname{log} p_{\theta}(Y_i) \approx \nabla_{\theta} \operatorname{log}\left(\frac 1 {n_s} \sum_{k=1}^{n_s} \frac {\tilde p_{\theta}^{(u)}(V_k)}{g(V_k)}\right)\end{equation}$$
  
  












<!---
 ### ZIPLN 

[//]ZIPLN model is a modified PLN model that tries to explain the zero inflated datasets. Basically, we add a latent variable $\xi$ parametrized by $B^0$ that will force some components of $Y$ to be zero. The model is the following : 

$$
\begin{aligned} 
W_{i} & \sim \mathcal{N}\left(0, I_{q}\right)  \\
Z_{i} &=\beta^{\top}\mathbf{x}_{i} +\mathbf{C}W_i  \in \mathbb R^p \\
\xi _{ij} &   \sim \mathcal{B}\left(\operatorname{logit}^{-1}\left(\mathbf x_{i}^{\top} B_{j}^{0}\right)\right) \in \mathbb R \\
Y_{i j} \mid Z_{i j} & \sim (1-\xi_{ij})\mathcal{P}\left(\exp \left(o_{ij} + Z_{i j}\right)\right)
\end{aligned}
$$


$
\text { We are interested in inferring } \theta=\left(\boldsymbol{\Sigma}, \boldsymbol{\beta}, \boldsymbol{B}^{0}\right) \in \mathbb{S}_{p}^{++} \times \mathcal{M}_{p, d}(\mathbb{R}) \times \mathcal{M}_{p, d}(\mathbb{R}) \text {,   where }\Sigma = CC^{\top}
$
-->

In [7]:
import models

ImportError: attempted relative import with no known parent package