# Test environment for package development

In [1]:
import jax
import jax.numpy as jnp
import numpy as np
import numpyro
import numpyro.distributions as dist
import pandas as pd

from frugalCopyla.model import Copula_Model

  from .autonotebook import tqdm as notebook_tqdm


## Rough outline

Ideally, one could specify a (linear) model with a variety of parameters, and generate samples from it. It would bypass the need to play around with a numpyro backend and allow you to generate data purely from a relatively trivial model parameterisation.

**TO DO: NEED TO CHANGE ALL MENTION OF LINK FUNCTIONS TO INVERSE LINK FUCTIONS**

The input should be a dictionary whose keys label the variables in your model. For each of these, specify in a sub-dictionary:

* `dist`: The distribution the variable is drawn from. These must be selected from `numpyro.distributions`
* `formula`: For each parameter in the chosen distribution, specify its linear model **only using variables defined earlier in the dictionary**. The names of the correct parameters can be found by either [searching the `numpyro` documentation]() or looking at the `arg_constraints` of the distribution by running (using the Normal as an example) : 
```
> numpyro.distributions.Normal.arg_constraints 

{'loc': Real(), 'scale': GreaterThan(lower_bound=0.0)}
```
* `params` (name will most likely change): Specifies the linear coefficients used to generate the primary variable through the linear model. A set of coefficients must be provided for each parameter. Note that the labelling of parameters (e.g. `'formula': {'rate': 'X ~ 1 + Z + A'}, 'params': {'rate': {'x_0': 0., 'x_1': 2., 'x_2': 1}}`) does not affect the linear model. Only the order of the specification matters. For example, `x_0` will be the coefficient of the first variable in the formula (always the intercept) and `x_2` will always be the last.
* `link` allows the user to provide a link function for each of the linear formulas. For example, the command 
```'X': {'dist': dist.Exponential, 'formula': {'rate': 'X ~ 1 + Z + A'}, 'params': {'rate': {'x_0': 0., 'x_1': 2., 'x_2': 1}}, 'link': {'rate': jnp.exp}},``` will wrap the linear predictor in an exponential function such that the probabilistic model is $$X \sim \text{Exponential}(\lambda=\exp(2Z + A)).$$ **Note that the link function must have a `jax` base.** If no link function is require, leave it as `None`.
* `copula`: To specify a copula, first choose a `'class'` of copula from [frugalCopyla/copula_functions.py](../frugalCopyla/copula_functions.py). The copula functions will take in keyword arguments to calculate the log-likelihood of the copula factor. 
    * Under `vars`, provide a mapping of the variables linked by the copula and the function arguments using a dictionary. For example, the `bivariate_gaussian_copula_lpdf(u, v, rho)` factor takes two variables, and one `rho` parameter. If we wish to simulate a copula between `Z` and `Y`, provide `vars` the dictionary `..., 'vars': {'u': 'Z', 'v': 'Y'}`.
    * Under `'formula'`, specify the form of the linear predictor for the parameters passed to the copula. The coefficients for the linear predictor are specified under `'params'`.
    * Similarly to the other inputs, a link function can be chosen to wrap the linear predictor specified in `'formula'` and `'params'`.

For example, consider the following probabilistic model:
$$ A \sim \text{Bernoulli}(0.5) \newline Z \sim \mathcal{N}(0, 1) \newline X ~ \text{Exponential}(\exp(2Z + A) \newline Y | \text{do}(X) \sim \mathcal{N}(X - 0.5, 1)$$
and a bivariate Gaussian copula between $Z$ and $Y$ parameterised by a fixed covariance term $\rho_{ZY} = logit(1)$

The following model is specfied in `input_dict`.

In [2]:
input_dict = {
    'A': {'dist': dist.BernoulliProbs, 'formula': {'probs': 'A ~ 1'}, 'params': {'probs': {'z_0': 0.5}}, 'link': None}, 
    'Z': {'dist': dist.Normal, 'formula': {'loc': 'Z ~ 1', 'scale': 'Z ~ 1'}, 'params': {'loc': {'z_0': 0.}, 'scale': {'z_0': 1}}, 'link': None},
    'X': {'dist': dist.Exponential, 'formula': {'rate': 'X ~ 1 + Z + A'}, 'params': {'rate': {'x_0': 0., 'x_1': 2., 'x_2': 1}}, 'link': {'rate': jnp.exp}},
    'Y': {'dist': dist.Normal, 'formula': {'loc': 'Y ~ 1 + X', 'scale': 'Y ~ 1'}, 'params': {'loc': {'y_0': -0.5, 'y_1': 1.}, 'scale': {'phi': 1.}}, 'link': None},
    'copula': {'class': 'bivariate_gaussian_copula', 'vars': {'u': 'Z', 'v': 'Y'}, 'formula': {'rho': 'c ~ Z'}, 'params': {'rho': {'a': 1., 'b': 0.}}, 'link': {'rho': jax.nn.sigmoid}}
}

Preparing the `Copula_Model`:

In [3]:
cop_mod = Copula_Model(input_dict)



We can see whether the model has been parsed correctly by looking at the `'full_formula'` entries in the output.

Currently, the code is set up such that each random variable in the linear model should be fetched from a dictionary named `record_dict`. That is, instead of seeing `Y ~ X + A` we should see `Y ~ record_dict['X'] + record_dict['A']`.

In [4]:
parsed_model = cop_mod.parsed_model
parsed_model

{'A': {'dist': numpyro.distributions.discrete.BernoulliProbs,
  'formula': {'probs': 'A ~ 1'},
  'params': {'probs': {'z_0': 0.5}},
  'link': {},
  'full_formula': {'probs': '0.5'}},
 'Z': {'dist': numpyro.distributions.continuous.Normal,
  'formula': {'loc': 'Z ~ 1', 'scale': 'Z ~ 1'},
  'params': {'loc': {'z_0': 0.0}, 'scale': {'z_0': 1}},
  'link': {},
  'full_formula': {'loc': '0.0', 'scale': '1'}},
 'X': {'dist': numpyro.distributions.continuous.Exponential,
  'formula': {'rate': "X ~ 1 + record_dict['Z'] + record_dict['A']"},
  'params': {'rate': {'x_0': 0.0, 'x_1': 2.0, 'x_2': 1}},
  'link': {'rate': <CompiledFunction of <function _one_to_one_unop.<locals>.<lambda> at 0x10e2f5940>>},
  'full_formula': {'rate': "0.0 + 2.0 * record_dict['Z'] + 1 * record_dict['A']"}},
 'Y': {'dist': numpyro.distributions.continuous.Normal,
  'formula': {'loc': "Y ~ 1 + record_dict['X']", 'scale': 'Y ~ 1'},
  'params': {'loc': {'y_0': -0.5, 'y_1': 1.0}, 'scale': {'phi': 1.0}},
  'link': {},
  'full

Looks ok so far. Now we can simulate from the prior using MCMC. Specify the steps for warmup and sampling, the seed (if desired), and whether the joint is `'continuous'`, `'discrete'`, or `'mixed'`. If this last step is not specified correctly you may see an error.

The simulated data is returned as a dictionary of data, along with the inverse cdf'd copula RVs (and their standard normals) of the copula parameters and the samples for the copula parameters:

In [5]:
sim_data = cop_mod.simulate_data(num_warmup=1000, num_samples=10000, joint_status='mixed', seed=0)
pd.DataFrame(sim_data).describe()

Unnamed: 0,A,X,Y,Z,q_Y,q_Z,rho,std_normal_Y,std_normal_Z
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.5262,0.724472,0.366902,0.201363,0.5397,0.555629,0.731058,0.14243,0.201363
std,0.499338,1.054587,1.036412,0.867837,0.280151,0.262885,0.0,0.94551,0.867837
min,0.0,8e-06,-3.436111,-2.508168,0.001104,0.006068,0.731059,-3.060654,-2.508168
25%,0.0,0.05526,-0.331392,-0.422512,0.303812,0.336326,0.731059,-0.513469,-0.422511
50%,1.0,0.250307,0.29978,0.161178,0.551845,0.564023,0.731059,0.130325,0.161177
75%,1.0,0.92827,0.984778,0.777478,0.782922,0.781561,0.731059,0.7821,0.777477
max,1.0,5.773732,4.266821,3.225994,0.999842,0.999372,0.731059,3.601015,3.225986
