# Building your own Bayesian model

To develop a Bayesian solution to a statistical problem you need to break it into three parts *Model*, *Estimand* and *Algorithm*. 

### Model
First, you are deciding on your likelihood function and priors. Remember the likelihood is the distribution that your data is created from. There are some obvious questions to ask to whittle down the likelihood choice;

- Is the data binary or discrete or continuous?
- Do you have enough data points to approach a gaussian approximation?
- Is there a large order of magnitude difference between the lowest and highest values?
- Is there a physical justification for any introduced errors?

Second you choose the priors to sample your parameters from;

- Do you have a suitable guess for an informative prior? Else stick with a weakly informative prior.
- Could you use a high variance 'conjugate' prior? They tend to increase the speed of your code

### Estimand
Now you have thought through which distribution best models the data, you need to decide which parameter(s) of the model hold the information relevent to your biological question. With only 1 or 2 parameters this is a very simple step but models of thousands of parameters can make this rather tricky! Sometime the estimand you are intersted in is different from the parameter of the distribution. For example, consider a poisson model for the number of counts per gene. The $\lambda$ parameter for the poisson may consist of a $A_i$ cell to cell scale factor and a $\mu_j$ gene to gene mean, $\lambda_{ij} = A_i \mu_j$. You want $\mu_j$ for biological interpretation but $A_i$ is vital experimentally.

### Algorithm
The workshop has glanced over this section under the assumption that the infinitely wise Andrew Gelman has sorted this out for our basic models. There are many different MCMC algorithms with many different positives and negatives. Nevermind the non-MCMC's algorithms.  For our models Stan's NUTS is like using a sledge hammer to crack...well...a nut.

# You're on your own

I've lifted a lovely data set of the number of great inventions and scientific discoveries per year between 1860-1959 from the even more lovely book by Ben Lambert "A Student's Guide to Bayesian Statistics". Can you come up with the posterior distribution for the mean number of inventions?

In [None]:
import pystan

model_code="""
data { 
    // Define a variable to hold the number of data point being passed
    
    // Define another variable to actually hold the data

}

parameters {
    // What is the actual parameter for the model? Are there any limits to its possible values?

}

model{
    // Whats a reasonable prior for the parameter?
    
    // What is the likelihood?

  }
"""
# ____Remember to compile the stan code_____



In [1]:
# ___Import the dataset here from  data/evaluation_discoveries.csv____

# ___Make sure the the key names in the pandas dataframe match that defined in the model code___

# ___Trial a short version of your model just to check___


In [None]:
# ___If that works then go for the full monty___

# ___Extract the parameters___

# ___Plot a histogram/density plot of the posterior distribution___


# The Next Step : Hierarchical Models
