# Building your own Bayesian model

To develop a Bayesian solution to a statistical problem you need to break it into three parts *Model*, *Estimand* and *Algorithm*. 

### Model
First, you are deciding on your likelihood function and priors. Remember the likelihood is the distribution that your data is created from. There are some obvious questions to ask to whittle down the likelihood choice;

- Is the data binary or discrete or continuous?
- Do you have enough data points to approach a gaussian approximation?
- Is there a large order of magnitude difference between the lowest and highest values?
- Is there a physical justification for any introduced errors?

Second you choose the priors to sample your parameters from;

- Do you have a suitable guess for an informative prior? Else stick with a weakly informative prior.
- Could you use a high variance 'conjugate' prior? They tend to increase the speed of your code

### Estimand
Now you have thought through which distribution best models the data, you need to decide which parameter(s) of the model hold the information relevent to your biological question. With only 1 or 2 parameters this is a very simple step but models of thousands of parameters can make this rather tricky! Sometime the estimand you are intersted in is different from the parameter of the distribution. For example, consider a poisson model for the number of counts per gene. The $\lambda$ parameter for the poisson may consist of a $A_i$ cell to cell scale factor and a $\mu_j$ gene to gene mean, $\lambda_{ij} = A_i \mu_j$. You want $\mu_j$ for biological interpretation but $A_i$ is vital experimentally.

### Algorithm
The workshop has glanced over this section under the assumption that the infinitely wise Andrew Gelman has sorted this out for our basic models. There are many different MCMC algorithms with many different positives and negatives. Nevermind the non-MCMC's algorithms.  For our models Stan's NUTS is like using a sledge hammer to crack...well...a nut.

# You're on your own

I've lifted a lovely data set of the number of great inventions and scientific discoveries per year between 1860-1959 from the even more lovely book by Ben Lambert "A Student's Guide to Bayesian Statistics". Can you come up with the posterior distribution for the mean number of inventions?

In [None]:
import pystan

model_code="""
data { 
    // Define a variable to hold the number of data point being passed
    
    // Define another variable to actually hold the data

}

parameters {
    // What is the actual parameter for the model? Are there any limits to its possible values?

}

model{
    // Whats a reasonable prior for the parameter?
    
    // What is the likelihood?

  }
"""
# ____Remember to compile the stan code_____



In [1]:
# ___Import the dataset here from  data/evaluation_discoveries.csv____

# ___Make sure the the key names in the pandas dataframe match that defined in the model code___

# ___Trial a short version of your model just to check___


In [None]:
# ___If that works then go for the full monty___

# ___Extract the parameters___

# ___Plot a histogram/density plot of the posterior distribution___


# The Next Step : Hierarchical Models

Previously we have specified our likelihoods, then the priors and fixed the parameters of the priors to reasonable values. There's nothing stopping us from treating the parameters of our priors as random numbers themselves, to be sampled from an even higher level distribution. Why would we ever want to do this? 

Let's say you want to know the mean test score across all secondary school students in Edinburgh. You could pool all of the students from all of the schools together and find the mean of the nomral distribution across all. Or you could accept the fact that a students test score will be highly correlated to the school they go to. Therefore you need to find a suitable distribution from which to sample the school to school mean then overall mean. So you will have a higher level normal distribution for school to school means, which has priors for its mean and variance, feeding the means to the lower normal of student to student scores, with another prior on its variance.

**CHALLENGE**

Can you create both the simple bayesian model with just one normal distribution for all student schools pooled together and the second hierarchical model for accounting for school to school variation?

In [None]:
# BASIC MODEL
import pystan

model_code="""
data { 
    // Define a variable to hold the number of data point being passed
    
    // Define another variable to actually hold the data

}

parameters {
    // What is the actual parameter for the model? Are there any limits to its possible values?

}

model{
    // Whats a reasonable prior for the parameter?
    
    // What is the likelihood?

  }
"""
# ____Remember to compile the stan code_____



In [1]:
# ___Import the dataset here from  data/mean_test_results.csv ____

# ___You will have to manipulate the data to allow stan to process it____
# ___Can you find a way to combine all the columns into one long vector?___

# ___Make sure the the key names in the pandas dataframe match that defined in the model code___

# ___Trial a short version of your model just to check___


In [2]:
# ___If that works then go for the full monty___

# ___Extract the parameters___

# ___Plot a histogram/density plot of the posterior distribution___


In [None]:
# Hierarchical MODEL
import pystan

model_code="""
data { 
    // Define two variables to hold the number of data point being passed (rows and columns)
    
    // Define another variable to actually hold the data (think about its structure)

}

parameters {
    // What is the actual parameter for the model? Are there any limits to its possible values?

}

model{
    // Whats a reasonable prior for the parameters?
    
    // Whats going to find the school means?
    
    // What is the student score likelihood?

  }
"""
# ____Remember to compile the stan code_____



In [None]:
# ___Import the dataset here from  data/mean_test_results.csv ____

# ___You will have to manipulate the data to allow stan to process it____
# ___Can you find a way to convert the data into an array instead of a dataframe?___

# ___Make sure the the key names in the pandas dataframe match that defined in the model code___

# ___Trial a short version of your model just to check___


In [None]:
# ___If that works then go for the full monty___

# ___Extract the parameters___

# ___Plot a histogram/density plot of the posterior distribution___
