# Bayes Factors

Sections:
1. Revisiting the p-value
2. Bayes Theorem factor
3. Bayes Factor

The lecture draws from Wagenmakers, E. J. (2007). "A practical solution to the pervasive problems ofp values." Psychonomic bulletin & review, 14(5), 779-804.

---
# 1. Revisiting the p-value

Remember, in the last class we went oer the idea of the p-value. Remember that the Fisherian p-value reflects the probability that you would get the data you observed if the null hypothesis ($H_0$) were true.

$$ p = P(X | H_0) $$

Just to reiterate, the p-value itself has several major problems.

<br>

(1) p-values depend **unobserved data**.

In parametric statistics we assume a shape or family of the null distribution as in the example used above. Now this may seem like a restricted problem for parametric statistics, but in non-parametric statistics we assume the nature of chance (e.g., noise is _iid_), which may not always be correct.

<br>

(2) p-values depend on unknown and **subjective intentions.**

This makes sense if you consider the problem of "p-hacking". P-hacking is when you repeatedly test your data, as you collect your sample, until the p-value passes a specific threshold. Here the subjective intention of "monitoring" your data, leads to the case where you're more likely to interpret a finding as being significant when it is just a random configuration of $X$ that happens to be an extreme value.

Another subjective aspect of p-values is even the conept of using $\alpha = 0.05$ in the first place. This popular threshold is arbitrary and used only for historical reasons (see below).

<br>

(3) p-values do not quantify **statistical evidence**

The p-value i s ameasure of _existence_ , not the magnitude, of an effect. For example, a _p=0.1_ does not indicate substantially more evidence for a null hypothesis than a _p=0.2_ or _p=0.06_. 

Consider the case of the **p postulate**. Imagine you have two experiments with two different sample sizes. Experiment 1 has 10 participants, while Experiment 2 has 100 participants. Imagine that we perform the same NHST on both data sets and get the same p-value (e.g., _p=0.01_). The traditional p-value only gives you the probability of the null being true and thus, in these two cases, it means that both experiments provide equal evidence against the null hypothesis. However, Experiment 1 has a smaller sample size, which means that the effect size is actually much larger than in Experiment 2. This means there is _more_ evidence against $H_0$ in Experiment 1 than in Experiment 2. But this fact isn't reflected in the traditional NHST.

<br>

(4) p-values are **poorly understood**

You can _never_ prove the $H_0$ with standard NHSTs! You can only measure evidence against it.
So you can never actually confirm if the null is true, only that it is not likely false. 


---
# 2. Bayes Theorem


Ideal null hypothesis ($H_0$) evaluation procedure should have 5 qualities:

1. It should depend only on data that were observed (unobserved data are irrelevant)
2. It should not depend on unknown intentions of the researcher.
3. It should provide a measure of evidence that takes into account both the null and alternative hypothesis.
4. It should be very easy to implement.
5. It should be “objective”

<br>

In all cases, rather than estimate the probability of your data if $H_0$ were true (i.e., $P(X | H_0)$), we want to evaluate the probability of $H_0$ given your data (i.e., $P(H_0 | X)$). Luckily, probability theory has a way of getting from $P(X | H_0)$ to $P(H_0 | X)$.

Lets say that you have a hypothesis ($H$) and data you collected to evaluate it ($D$). _Bayes Theorem_ states that

$$ P(H | D) = \frac{P(D | H) P(H)}{P(D)} $$

Now the left hand side of the equation, or the output of the theorem, is the _posterior probability_. 

$$ P(H | D) $$

The posterior probabilty is estimated from three other probabilities. The first probability function is the likelihood, which we've seen before. It is the likelihood of the data given your hypothesis.

$$ P(D | H) $$

The second part of the theorem is the _prior distribution_, which reflects existing knowledge about the hypothesis.

$$ P(H) $$

The demonimator is the marginal probability of the data. In other words, the assumption that all the data is observed. This is just a normalization term and in many cases we set it to 1.

$$ P(D) $$

Because we normally assume that $P(D) \approx 1$, we usually refer to the problem as being just the combination of the likelihood and the posterior distribution.

$$ P(H | D) \propto P(D | H) P(H)$$

The beauty of the Bayesian approach is that you are estimating the probability of not just $H_0$ but really any hypothesis. So we now have a method for infering _the evidence for the null as well as any other hypothesis_. 

In the article by Wagenmaker, he outlines an example of this approach trying to find the optimal parameter ($\Theta$) for a binomial distribution, given the data. Instead of getting just one value, you get a range of possible value (the peak of which corresponds to approximately the value you get from traditional, non-Bayesian statistics).

![Posterior example](imgs/L19_PosteriorBinomial.png)


Let's consider the power of the Bayesian approach to evaluating this in an applied regression context.

<br>

## Example: 

Let's consider a simple regression model. 

$$ Y = \beta_0 + \beta_1 X + \epsilon $$

In this case our _hypothesis_ is that $\beta_1 \neq 0$. Unlike the approach we discussed in the linear regression lectures, where we find the best fitting $\beta_1$ using maximum likelihood estimation, what we can to do is evaluate the probability of observing a range of values for $\beta_1$. 

If $Y$ is normally distributed, the _likelihood_ function is then

$$ P(Y | X, \beta) = P(Y|\beta) = \frac{1} {{\sigma^2 \sqrt {2\pi } }} e^{{\frac{ - ( {Y - (\beta_0 + \beta_1 X})^2 }{2\sigma^2} }} $$

Now if we had a lot of prior data we could also assume a shape of the prior distribution of possible values that $\beta$ take on. But on our first pass, we don't know this. So we stick with an uninformative prior (i.e., a uniform distribution).

$$ P(\beta) = Uniform( - \infty, + \infty) $$

This makes our life easier, as we can use Bayes Theorem to flip the equation and get the probability distribution of $\beta$'s.

$$ P(\beta | Y) \approx - \frac{n}{2} log(2 \pi) - n log \sigma - \frac{1}{2 \sigma}(Y-(\beta_0 + \beta_1 X)) $$



---
# 3. Bayes Factor

<br>
Thanks to Bayes Theorem, we now have a method for estimating whole probability distributions of parameters (in this case, hypotheses), that we can  use to directly calculate evidence for one hypothesis over another. Let's put it to use for hypothesis testing.

Let's say that along with the null hypothesis ($H_0$), you have your research hypothesis that you need to evaluate directly ($H_1$). Using Bayes Theorem we can directly calculate the odds ratio of one hypothesis over the other.

$$ \frac{P(H_0 | D)}{P(H_1 | D)} = \frac{P(D | H_0)}{P(D | H_1)} \cdot \frac{P(H_0)}{P(H_1)} $$

The equation above describes the general form of a _Bayes Factor_ (BF). Higher BFs reflect more evidence for the null over the alternative. Smaller BFs indicate more evidence for the alternative over the null. The BF fundamentally **prefers the hypothsis under which the observed data are more likely to arise.**

Now the equation above is the normative form of a BF. There are many variants that exist in the literature. But are all interpreted the same way: provide a measure of evidence for the $H_0$ or the alternative hypothesis ($H_1$). 

Interpreting a BF is usually done by a heuristic table. The table below is the example provided by Wagenmaker.

![BF Table](imgs/L19_BFTable.png)

So inferring from a BF follows a similar heuristic as you use for infering from a p-value. But instead of a simple binary decision (i.e., accept or reject), with a BF you estimate _degree of evidence_ for or against the null itself.

<br>

---

## Example

To understand BFs in practice, let's return to the regression problem

Let's say that you are trying to see whether verbal fluency is impacted by income and social stress, after controlling for age & education. Your $H_1$ is thus defined by the model.

$$ Y_{VF} = \beta_0 + \beta_1 X_{income} + \beta_2 X_{SS} + \beta_3 X_{age} + \beta_4 X_{education} + \epsilon $$

In this model p = 5, with 3 factors being control or non-interest factors (i.e., intercept, $X_{age}$, $X_{education}$). Remember the way we stated the research hypothesis is that both _age_ **and** _social stress_ impact verbal fluency. Therefore our null model is the case where both of these factors have not influence on _Y_ (i.e., $\beta_1 = 0$ & $\beta_2 = 0$). 

$$ Y_{VF} = \beta_0 + \beta_3 X_{age} + \beta_4 X_{education} + \epsilon $$

Now to evaluate the amount of information acounted for in each model, we can use a form of information criterion similar to AIC that we discussed earlier. In this case we can use the Bayesian form.

**Bayesian Information Criterion (BIC):**

$$ BIC(H_i) = -2 \log L_i + p_i \log n $$

Here we are getting the BIC value for any hypothesis $H_i$, given the likelihood of that data given the model (i.e., hypothetical form of $H_i$), called $L_i$. The term $p_i$ reflects the number of parameters in the model.

For linear regression, $L_i$ is defined as

$$ L_i = (1-R_i^2) = \frac{RSS}{TSS} $$

So we can use the BIC to generate a BF for any given hypotheis and the null ($H_0$). We refer to this as $BF_{01}$ which is the difference between $H_0$ and $H_1$. 

$$ BF_{01} \approx \frac{P(D | H_0)}{P(D | H_1)} = \exp({\frac{\Delta {BIC}_{10}}{2}}) $$

Where

$$ \Delta {BIC}_{10} = BIC(H_1) - BIC(H_0) $$

For linear models, a userful approximation is:

$$ \Delta {BIC}_{10} = n \log (\frac{{SSE}_0}{{SSE}_1}) + (p_1 - p_0) \log n $$

So the BF is just an adjusted difference between the two BICs for the two models. You can now interpret $BF_{01}$ according to the table above.

The beauty of this formulation is that you can also directly estimate the posterior probability of any given hypothesis, in this case linear regression models, from competing alternative models as:

$$ P_{BIC}(H_i | D) = \frac{ \exp (-0.5 \cdot BIC(H_i)) }{\sum_{j=0}^{k-1} \exp (-0.5 \cdot BIC(H_j) ) } $$

Now if you boil it down to just evaluating two models, say $H_0$ and $H_1$, then this reduces to.

$$ P_{BIC}(H_0 | D) = \frac{1} {1+\exp(-0.5 \Delta {BIC}_{10}) } $$

Remember, in this way, the probabiliy of $H_1$ is now just

$$ P_{BIC}(H_1 | D) = 1 - P_{BIC}(H_0 | D) $$

No you can estimate teh probability of ANY hypothesis, including the null, from the data you have.  This relationship would look something like this.

![Posterior Probability of Null](imgs/L19_PostProbability.png)



    
    

