In [1]:
import numpy as np
import matplotlib.pyplot as plt

# Also used, but not necessary (all code works without it)
# import tqdm as tqdm

## Structure:

| Part | Topic | Type |
|:----:|-------|------|
| 1 | You are already a Bayesian | Intuition |
| 2 | The Keypad | Intuition |
| 3 | Bayes' Theorem | Derivation |
| 4 | Medical Test | Worked Example |
| 5 | The Astronomical Problem | Motivation |
| 6 | Marginalisation | Theory |
| 7 | Monty Hall | Worked Example (Optional) |
| 8 | Likelihoods | Theory (Optional) |
| 9 | Fitting an Absorption Line | Hands-on Exercise |
| 10 | Summary | — |


# Introduction to Bayesian Statistics (crash course)

## Part 1: You are already a Bayesian

In 1915, Albert Einstein presented his work on the Theory of General Relativity. In September of 2015, LIGO announced the detection of gravitational waves, a resulted predicted by General Relativity. 

**Question 1:** Upon hearing this, where you:

- a) More inclined to belief that GR was a good model of reality.
- b) Less inclined to belief that GR was a good model of reality.
- c) Neither. While GR survived another test, my belief about the validity of GR remained unchanged.

Before clicking below, think about this before locking in your answer:

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Click to reveal answer</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

If you answered a), you are a Bayesian. You updated your belief about a hypothesis (GR is correct) based on new evidence (LIGO detection).

If you answered b), you are lying.

If you answered (c), you are a frequentist. Under frequentism, hypotheses don't have probabilities; they are either true or false. The data can only reject or fail to reject a null hypothesis. Feels a bit silly that this is the only valid frequentist interpretation doesn't it?

</div>
</details>

**Question 2:** LIGO reported a $5\sigma$ detection. If you were a frequentist, how would you interpret this? 

- a) There is a $>99.9999\%$ probability that gravitational waves exist
- b) There is a $>99.9999\%$ probability that GR is correct
- c) Assuming gravitational waves do not exist, the probability of observing data at least this extreme due to noise is $<1 \times 10^{-5}$.
- d) The result is so unlikely under the null hypothesis that we can be essentially certain the null hypothesis is false.


<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Click to reveal answer</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">


If you answered a) or b), you assigned a probability to a model. This is not allowed under frequentism.

If you answered c), this is the only correct answer under frequentism.

If you answered d), you also assigned a probability to a model, so under frequentism this is incorrect. You are allowed to say that this is unlikely under frequentism.

</div>
</details>

**Question 3:** Let's assume that LIGO reported that the mass of the black hole was $5.2\pm2.1$ solar masses, with a 95% confidence interval. Which of the following is the correct frequentist interpretation?

- a) There is a 95% probability that the true value of the black hole lies between \[3.1, 7.3\] solar masses.
- b) There is a 5% probability that the true value of the black hole is outside of this interval.
- c) The parameter lies in this interval with high confidence, but we cannot assign a probability to that statement.
- d) If we were to repeat this experiment many times, 95% of the confidence interals constructed this way would contain the true value

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Click to reveal answer</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">


The correct frequentist interpretation is (c).

A 95% confidence interval does not mean there is a 95% probability that the parameter lies in the interval you computed.

Under frequentism:

The parameter is fixed (though unknown).

The interval is random.

Probability statements apply only to the procedure, not to this specific interval.

The interval you obtained either contains the true value or it does not—there is no probability attached to that fact.

If, however, you instinctively read a confidence interval as “there is a 95% chance the parameter lies here,” then you once again assigned a probability to a hypothesis conditioned on data. You were thinking like a Bayesian.

</div>
</details>


## Part 2: The Keypad

You are trying to break into a house and you come across the below Keypad.
<img src="Images/LockpadClean.png"
     alt=" "
     width="300"
     style="display:block; margin-left:auto; margin-right:auto;">

You note that this is a 4 digit lock. Given that each digit can take on the values of 0-9, the number of possible combinations is $10^4 = 10,000$. If you had to guess randomly,  you would expect it to take ~5000 guesses for you to unlock the door. Not great odds for the would-be burglar.

You take a closer look, and notice that there is a distinct wear pattern.

<img src="Images/LockpadWorn.png"
     alt=" "
     width="300"
     style="display:block; margin-left:auto; margin-right:auto;">

You notice that the keys 1, 2, 5, 7, and Enter are notably worn compared to the rest. The other keys looked practicaly untouched. You reason that the code is probably a combination that only uses these four digits. If this is the case, there are now $4^4 = 256$ likely combinations. On average you would expect it to now take ~128 guesses for you to unlock the door. 

Notice what we just did. We haven't seen the code. We haven't tried a combination. But our knowledge (i.e. our prior) has been updated. Updating your beliefs based on the wear of the keypad is just common sense. This is Bayesian reasoning in action.

You take a step back, and realise that the house number that you are looking at is 1572 Burglary Road:

<img src="Images/LockpadAddress.png"
     alt=" "
     width="300"
     style="display:block; margin-left:auto; margin-right:auto;">

1572 Burglary road just so happens to contain all four digits of the subset of worn keys. Your first guess would likely be 1572. If that didn't work, you might try it backwards: 2751. 

You've taken prior knowledge (people often use easy to remember combinations), combined it with evidence (the address, the wear pattern), and have arrived at a posterior belief (1572 is very likely). You may still be wrong, it could be a different combination, but clearly 1572 is the best guess. Using Bayesian inference - which comes naturally, seems to have worked well.

A frequentist could construct a test here. Define a null hypothesis that the code is random, compute the probability of the coincidence, and probably reject the null. But all that says it the the null hypothesis that the code is random is unlikey, it doesn't actually tell you what the code likely is, which is what we actually care about. 

**Question:** Suppose all keys 0-9 were equally worn. How would this change your inference about the code?

## Part 3: Bayes' Theorem

### Notation
Before we derive anything, let's go other notation. Typically, probabilities are written like this:

$$P(A\,|\,B) $$

This denotes the probability of $A$ *given* $B$. Also stated as the probability of $A$ *conditional* on $B$ being true or known. The vertical bar "$|$" should be read as "given" or "assuming". You may also see the following:

$$P(A,\,B\,|\,C) $$

The comma denotes "and", so the above reads "the probability of $A$ and $B$ are true, given $C$". You may also see this written with the intersection symbol: $P(A\cap B\, |\, C)$. This meas the same thing.

Some examples:
- $P(\text{rain})$ --- The probability that it will rain.
- $P(\text{rain}\,|\,\text{cloudy})$ --- The probability that it will rain, given that it is cloudy.
- $P(\text{rain, cold}\, | \text{winter})$ --- The probability it will rain and be cold, given that its winter.

Also note that the order matters. Generally, $P(A\,|\,B) \ne P(B\,|\,A)$. For example:
- $P(\text{dead}\, | \, \text{guillotine}) \approx 1$ --- The probability of being dead given that you were guillotined is high.
- $P(\text{guillotine}\, | \, \text{dead}) \approx 0$ --- The probability of have been guillotined given that you are dead is low.

While that seems obvious, confusing the two is common, and is known as the prosecutors fallacy, or transposing the conditional. 

### The Product Rule

A fundamental rule in probability is the product rule. This states that the probability that both $A$ and $B$ occur can be wrriten as:
$$P(A,\,B) =  P(A\,|\,B)P(B) =  P(B\,|\,A)P(A) $$
This reads as: The probability of $A$ and $B$ equals the probability of $A$ given $B$ times the probability of $B$ (which is equivalently the probably of $B$ given $A$ times the probability of $A$).

### Deriving Bayes Rule
We can take the above equation
 $$ P(A\,|\,B)P(B) =  P(B\,|\,A)P(A) $$
and rearrange for $P(A\,|\,B)$:
 $$ P(A\,|\,B) =  \frac{P(B\,|\,A)P(A)}{P(B)} $$
In Bayesian Inference, we want $P(\theta \, |\, D)$, where $\theta$ is a parameter and $D$ is data. We can rewrite the above equation as:

$$\boxed{P(\theta \mid D) = \frac{P(D \mid \theta) \, P(\theta)}{P(D)}}$$

**This is Bayes Theorem.** It is simply derived from simple probability rules. When people are distrustful of Bayes, it isn't because they don't agree with the algebra, it is because they don't think that $P(\theta \mid D)$ is a meaningful quantity. But Bayesians disagree.

Often is science we want $P(\theta \mid D)$. This is the *probability distribution* of a parameter value, given our data. But usually what is easier to compute from our physical models is $P(D \mid \theta)$, the probably of observing our data, given a particular parameter value/model. All that Bayes' theorem is connect what we want (the posterior), with what is easy to compute (the likelihood). Each term in Bayes' Theorem has a name:

$$\underbrace{P(\theta \mid D)}_{\text{posterior}} = \frac{\overbrace{P(D \mid \theta)}^{\text{likelihood}} \, \overbrace{P(\theta)}^{\text{prior}}}{\underbrace{P(D)}_{\text{evidence}}}$$

| Term | Name | Meaning |
|------|------|---------|
| $P(\theta)$ | **Prior** | What you believed about $\theta$ before seeing data |
| $P(D \mid \theta)$ | **Likelihood** | The physics. probability of this data given $\theta$ |
| $P(D)$ | **Evidence** | Normalisation constant (for now, more later) |
| $P(\theta \mid D)$ | **Posterior** | What you should believe about $\theta$ after seeing data |

Personally, I find the $P(A\mid B)$ notation, confusing, so I often write Bayes' theorem as:

$$ P(\theta \mid D) = \frac{\mathcal{L}(\theta) \pi(\theta)}{\mathcal{Z}} $$

Not only is this visually simpler and uses cooler looking symbols, it helps remind me that the likelihood and priors are functions. Note that the evidence, $\mathcal{Z}$, does not depend on $\theta$. This means that it acts as a normalisation constant (kind of, see the nested sampling tutorial), which means that you can drop it, leaving:

$$ P(\theta \mid D) \propto \mathcal{L}(\theta) \pi(\theta) $$

Unless you are comparing models, the above equation is all that you need to find the probabilty distributions of your parameters. This is even more beneficial than it seems, the evidence is often the hardest term to calculate.

### How to interpret this:

The likelihood, $\mathcal{L}(\theta)$ is what encodes your physical model, how the data would be generated for different parameter values. This is where your understanding of physics, the instrument, the noise enter the equation.

The prior, $\pi(\theta)$, encodes your state of knowledge before the experiment. This could be knowing that your parameter must be positive, or the typical chemical stellar abundances, the mass distribution of galaxies, etc. Previous experiments can help decide this, or maybe you are genuinely ignorant, and you want to express this. 

The posterior, $P(\theta \mid D)$ is the logically correct way to combine these. It tells you want you should belief after seeing your data.

**Question:** If the priors $\pi(\theta)$ is constant (flat), what does the posterior become proportional to?

## Part 4: Bayes Theorem in Action - Medical Test

<img src="Images/XKCD2545.png"
     alt="An XKCD comic"
     width="300"
     style="display:block; margin-left:auto; margin-right:auto;">

A classic study is to ask a group of medical students a variation of the following question:

There is a disease that 1 in 10,000 people have. A test exists that has an accuracy of 98%. For simplicity, the sensitivity and specificity of the test are both 98%. I.e. if you have the disease, the test is positive 98% of the time (sensitivity). If you don't have the disease, the test is negative 98% of the time (specificity). I.e. The false positive and false negative rates are the same. 

You take the test. It comes back positive. What is the chance that you actually have the disease?

These studies of medical students, often from prestigous universities, often state that less than one in four students answer correctly. The most common (incorrect) answer is usually 98% chance that you have the disease (the test is 98% accurate after all!). Why do so many smart professionals get this wrong? It isn't that these people are stupid, but I posit that it is because traditional, frequentist, statistics is so confusing that a lot of people don't actually get it. (Add in link to stop teaching frequentism here). If Bayesian statistics were the default for what we taught, I bet that people would get this correct more often. 

### Let's apply Bayes

Here our postierior is the probability that we have the disease, given that we got a positive test. The likelihood that we have the disease is 98%, and our prior is that 1 in 10,000 people have this disease. 

The tricky part is the evidence. In this case, the evidence is the total probability of testing positive. We have to consider all the ways that we could test positive. In this case there are two ways: Having the disease and testing positive, and not having the disease but still testing positive. 

Let's calculate this now. 

(some visualisation would also help here)

In [None]:
prior = 1/10000 # disease prevalence

likelihood = 0.98 # test is 98% accurate (specificiy = sensitivity)

evidence = 0.98*prior + 0.02*(1-prior) # all the ways to test positive
#        True positive    False Positive

posterior = likelihood*prior/evidence
print(f'Probability of actually having the disease: {posterior*100:.2f}%')

So this means that despite the tests 98% accuracy, if we were to test positive, there would be about a 1 in 200 chance of actually having the disease! If this feels wrong, consider if the entire population of Sydney took this test. That's about 5 million people. 

In [None]:
population = 5e6

people_with_disease = int(population*prior)
people_without_disease = int(population - people_with_disease)

print(f'Number of people with the disease: {people_with_disease}')
print(f'Number of people without the disease: {people_without_disease}')

# Consider how the test will divide each population
people_with_disease_positive = int(people_with_disease * 0.98) # true positive 
people_with_disease_negative = int(people_with_disease * 0.02) # false negative

people_without_disease_positive = int(people_without_disease * 0.02) # false positive
people_without_disease_negative = int(people_without_disease * 0.98) # true negative

print('-'*80)
print(f'Number of people with the disease found to have disease: {people_with_disease_positive}')
print(f'Number of (unfortunate) people with the disease found to NOT have disease: {people_with_disease_negative}')
print('-'*80)
print(f'Number of (temporaily scared) people without the disease found to have disease: {people_without_disease_positive}')
print(f'Number of people without the disease found to NOT have disease: {people_without_disease_negative}')


So if your test result was positive, you are either with the 490 people that do have the disease and tested positive or with the 99,990 people that don't have the disease and tested positive. 

In [None]:
print(f'Probability of actually having disease: {people_with_disease_positive/(people_with_disease_positive + people_without_disease_positive)*100:.2f}%')

So far this is pretty straightforward and this could be easily done with frequentism. But let's assume that this is a very bad disease that you definitely don't want. Even a 0.49% chance of having it is concerning. So you decide to get a second test, with a different doctor at a different lab. If the second result also comes back positive, what is the probability that you have this disease? This is where Bayesian inference really shines. All we have to do is update our beliefs. Fill out the below section:

In [None]:
new_prior = # fill out

likelihood = 0.98 # test is 98% accurate (specificiy = sensitivity)

evidence = 0.98*new_prior + 0.02*(1-new_prior) # all the ways to test positive

posterior = likelihood*new_prior/evidence
print(f'Probability of actually having the disease: {posterior*100:.2f}%')

**Question**: How many times would you need to repeat the test to be 98% confident that you have the disease?

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Click to reveal answer</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">
    
After the first test, your new prior should be the previous posterior. This is because you knowledge is no longer the prevelance of the disease in the community, it now should also include the fact that you took a test and got a positive result. The only thing that needs to change is the prior, the accuracy of the test stays the same, so the likelihood is the same. The evidence is the same as there is still the same amount of ways to test positive. 

After the first test, your posterior should be 0.49%. You use that as your prior for the second test, as your posterior is 19.44%. Repeat and you posterior is 92.20%. Once more and your posterior is 99.83%

So it takes **4 tests** to be more than 98% confident that you have the disease.

</div>
</details>


## Part 5: The Astronomical Problem

As you ought to have gathered by now, the Bayesian interpretation of probability is one of belief. Probability literally encapsulates ones (rational) belief about some model/hypothesis/parameter. In contrast, frequentism interprets probability of an event is the limiting frequency which that event occurs in an infinite sequence of identical, independant trials. These are not two shades of the same idea, they are actually different ideas. Hopefully the above examples have convinced you that the probability can be interpreted as belief.

The problems in astronomy which we seek to solve often do not fit in with the frequentist philosophy. Astronomy is often the study of singular events, not samples from infinite populations. 

You cannot rerun the formation of the Galaxy, or the formation of our particular solar system. Most astronomical "samples" are not samples in the frequentist sense - draws from a well-defined population that could, in principle, be redrawn. They are unique systems observed at a particular moment in cosmic history.

Astronomy was one of the first sciences to Bayesian methods. Laplace, contemporary of Reverand Bayes, used Bayesian inference to estimate the mass of saturn from orbital data in the late 1700's. This problem does not have a meaningful "ensemble". 

Unfortunately Bayesian methods fell out of fashion in the 20th century, as eugenics became fashionable. Both of these trends can be linked, in some respect, to R. A. Fisher, staunch racist and frequentist. The fall of eugenic beliefs, and the rise of Bayesianism occured during and shortly after the second world war. Let me be clear, I am not saying that these two topics are causally tied, but they are historically related both in time and with people, which is curious. The efficiency of project Ultra in breaking the enigma machine was in part due to Alan Turings use of bayesian methods. Eugenics fell out of favour when the allies won the war and the horrors of the holocaust were revealed. 

For most of the 20th century, most problems that would be benefitted by Bayesian methods were not computationally tractable, until the early 1990s. Now, with modern computers there is no reason to not use Bayesian methods.

## Part 6: Marginalisation

A word that you hear often in Bayesian works is *marginalisation*. This is the process of computing the probability of an event that we care about, by integrating (or summing) over all possible values of some other variable that we don't know or don't care about.

Suppose that you wanted to know the probability of some event $A$, but you can't compute it directly. What you *can* do is compute $P(A \mid B)$ which is the probability of $A$ given some other proposition(s) $B$. The problem is that you don't know whether $B$ is true or not. 

Marginalisation solves this by considering all possiblities for $B$. If $B$ can take on values $\{B_1, B_2, ... , B_n\}$ (or alternatively, if $B$ is a continous parameter) and all these $B$'s are mutually exclusive and exhaustive (at least one must be true), then:

$$ P(A) = \sum_{i=1}^n P(A\mid B_i) P(B_i)$$

This is the law of total probability.

You should think of marginalisation as a form of weighted average. I.e. $P(A)$ is a weighted average of conditional probabilities $P(A\mid B_i)$, where the weights are $P(B_i)$.

### An example

Assume you are doing an exoplanet transiting survey, and you want to estimate what fraction of your target stars will yield a detected transiting planet. Here, we want $P(\text{detected}$, but we can't compute this directly because detectability changes across the sample. 

One of the confounding factors is stellar photometric variability. Essentailly the noise floor depends on how active the star is. Quiet stars with stable brightness allow detection of shallow transits (small planets), while active stars with spot modulation and flares etc, have higher noise floors that bury all but the deepest transit signals. Your survey contains a mix of both types, and the detection probability differs between them. In reality, this would be a continous thing that would depend on some metric of stellar activity, but assume that these are discrete categories for now.

This is what marginalisation is designed for. We want $P(\text{detected}$, but we only know $P(\text{detected}\mid \text{activity type})$ for each type seperately. Marginalisation lets you combine these conditional probabilities to calculate the overall probability.

Assume we know:
- $P(\text{detected}\mid \text{Quiet}) = 0.025$. About 2.5% of quiet stars yield a detection.
- $P(\text{detected}\mid \text{Active}) = 0.008$. About 0.8% of active stars yield a detection.
- $P(\text{quiet} = 0.6$ 60% of stars are quiet.
- $P(\text{quiet} = 0.4$ 40% of stars are active.

 A detection can happen via two mutually exclusive paths: a star is quiet and you detect a planet, or the star is active and you detect a planet. Thus we can sum this:

$$P(\text{detected})=P(\text{detected},\, \text{quiet})+P(\text{detected},\, \text{active}) $$

Using the product rule $P(A, \, B)=P(A\mid B)P(B)$, we get:

$$P(\text{detected})=P(\text{detected} \mid \text{quiet})P(\text{quiet})+P(\text{detected} \mid \text{active})P(\text{active}) $$
$$= (0.025)(0.6) + (0.008)(0.4) = 0.015 + 0.0032 = 0.0182 $$

About 1.8% of the targets will yield a detection. Notice that the quite stars contribute >80% of expected detections, despite being only 60% of the sample. This is what we would expect.

This is simplified, of course. In practice, you'd marginalise over continuouss distributions and over more parameters (radii, orbital periods, distance, etc).

For the continoous case, the summation gets replaced by an integral:

$$ P(\text{detected}) = \int_0^\infty P(\text{detected}\mid \text{activity level}) P(\text{activity level}) d(\text{activity level}) $$

### Conection to Bayes' Theorem
 
Marginalisation appears in Bayes' Theorem as the evidence $\mathcal{Z}$. It is in the denominator of bayes theorem. Thew evidence itself is a marginalisation over the parameter space (which is why it is sometimes called a marginal likelihood):

$$\mathcal{Z} = \int \mathcal{L}(\theta) \pi(\theta) d\theta $$

This is the total probability of observing the data, averaged over all possible parameter values weighted by the prior. It ensures the posterior normalises to 1, and it plays a central role in model comparison.

### Nuisance Parameters
This is how Bayesians handle nuisance parameters (quantities that affect your data but aren't of direct scientific interest). Rather than fixing them, you marginalise, integrating over your uncertainty about them. This propagates that uncertainty into your final inference on the parameters you care about.

##  Part 7: Bayes Theorem in Action - The Monty Hall problem (optional)

A classic probability problem is the Monty Hall problem. Again, this is something that tends to confuse people. Again I argue that this is because people are taught to think like Frequentists, not Bayesians. We will look at this problem, and see how it is actually really simple under a Bayesian Framework. 

<img src="Images/MontyHall3Doors.png"
     alt="3 Door Monty Hall Problem"
     width="800"
     style="display:block; margin-left:auto; margin-right:auto;">

#### The Setup:

In the '70s there was a game show called "Let's make a deal" hosted by Monty Hall. There are three doors on the stage. Behind one door, there is a sports car. Behind the other two, there is a goat. You want the sports car. Monty makes a deal: If you choose which door the car is behind, you get to keep it. You decide to choose the door on the right.

After you choose, Monty (who knows where the car is) opens the centre door, revealing a goat. Note that Monty will always open a door with a goat, he never reveals the car.


<img src="Images/MontyHall2Doors.png"
     alt="3 Door Monty Hall Problem"
     width="800"
     style="display:block; margin-left:auto; margin-right:auto;">


Monty then makes another deal: Do you want to stay with the door you originally chose, or switch to the other unopened door?

What you should do is switch. This is because switching has a 2/3 probability of winning the car. If you stay, you have a 1/3 probability of obtaining the car. 

#### Why do people get this wrong?

The intuitive (and wrong) answer is that whether you switch or not shouldn't matter. There are two doors left, so surely each has a 50% chance? This reasoning implicitly assumes that when Monty opened the door to the goat, you gained no information. So it feels like you are starting again from scratch with two equally likely options.

But Monty did give you information. He doesn't open a random door. He opened a door he knew had a goat behind it. The fact that it wasn't random is where the information comes from. 

#### How would a Bayesian approach this?

Let $C_i$ denote the hypothesis "the car is behind door $i$". Before any doors are opened, you have no reason to prefer any doors, so the prior is uniform:

$$\pi(C_1) = \pi(C_2) = \pi(C_3) = \frac{1}{3} $$

You choose Door 3. Monty opens Door 2, revealing a goat. Let $M_2$ denote "Monty opens Door 2". We want to compute the posteriors probabilities $P(C_1 \mid M_2)$ (Car is behind Door 1, given Monty opens Door 2) and $P(C_3 \mid M_2)$ (Car is behind Door 3, given Monty opens Door 2). 

Apply Bayes' Theorem:

$$P(C_i \mid M_2) = \frac{\mathcal{L}(C_i)\pi(C_i)}{\mathcal{Z}}$$

where $\mathcal{L}(C_i) = P(M_2 \mid C_i)$ is the likelihood, and $\mathcal{Z} = P(M_2)$ is the evidence.

We need the likelihoods: the probability that monty opens Doo2 2, given each possible location of the car. 

- **If the car is behind Door 1 ($C_1$)**: Monty cannot open Door 1 (car) or Door 3 (your choice). He must open Door 2. So in this case, $\mathcal{L}(C_1) = 1$
- **If the car is behind Door 3 ($C_3$)**: Monty cannot open Door 3 (your choice, and also the car). He can either open Door 1 or Door 2. Assuming he chooses randomly, each remaining door has a 50% chance of being opened. So $\mathcal{L}(C_3) = 1/2$.

Now, unless we are comparing the relative probabilities of different models, we usually don't need to compute the evidence. Recall that the posterior is proportional to the likelihood times the prior. In our case, the priors are equal, so here the posterior is proportional to the likelihood only. 

The (unnormalised) posterior $P(C_1 \mid M_2)$ (car behind Door 1) is then 2.

The (unnormalised) posterior $P(C_3 \mid M_2)$ (car behind Door 3) is then 1.

Hence it is twice as likely that the car is behind the other door, than your door, and thus you should switch.

Now, if we wanted to compute the evidence $\mathcal{Z}$, we can. This would mean marginalisation. The event $M_2$ can occur in conjunction with any three car locations, so we sum over all possibilities:

$$\mathcal{Z} = \sum_{i=1}^3 \mathcal{L}(C_i) \pi(C_i)  = \mathcal{L}(C_1) \pi(C_1) + \mathcal{L}(C_2) \pi(C_2) + \mathcal{L}(C_3) \pi(C_3)  $$

$$ = (1)(\frac{1}{3}) + (0)(\frac{1}{3}) + (\frac{1}{2})(\frac{1}{3}) = \frac{1}{3} + \frac{1}{6} = \frac{1}{2} $$

This is the normalisation constant that would rescale the above result so that the sum of all posteriors sum to 1.

<img src="Images/XKCD1282.png"
     alt="Another XKCD"
     width="800"
     style="display:block; margin-left:auto; margin-right:auto;">


## Part 8: Likelihoods (Optional)

The likelihood, $\mathcal{L}$ is the most important part of Bayes' Theorem. This is the part that encodes the physics, and answers the question: given a set of parameter values $\theta$, how probable is it that we would observe the data we actually got?

To construct a likelihood, we need a noise model. This is an assumption about how our mesasurements deviate from the true underlying signal (whatever true means). The noise model should reflect what we actually know about our instrument and observations.

The most common assumption is that measurements are Gaussian. This is often well justified by the central limit theorem. However, you should consider whether a different noise model is better for example:

- Poisson for when you have low counts such as X-ray astronomy or Radio astronomy.
- Log-normal for when quantities are the products of random variables, or for when things must be positive and span multiple magnitudes.
- Beta distributions for quantities bounded between 0 and 1
- Mixture models for when you need to model good and bad data as seperate populations (e.g 95% of your data could be gaussian, but 5% could be cosmic rays that might be log-normal.

Let's assume our noise model is Gaussian. The following section we will be talking about spectral fluxes, so let's write our Gaussian in that light. The probability of measuring a flux $f_i$ at wavelength $\lambda_i$ given some parameter (or set of parameters) $\theta$ is:

$$P(f_i \mid \theta) = \frac{1}{\sigma_i \sqrt{2\pi}} \exp\!\left[ -\frac{\bigl(f_i - f(\lambda_i, \theta)\bigr)^2}{2\sigma_i^2} \right] $$

This is the Gaussian probability density function, centred on the model prediction $f(\lambda_i, \theta)$ with standard deviation $\sigma_i$. Measuremts close to the model prediction are more likely that those far away. 

If the measurement errors are independant (which often isn't the case, wavelengths can be correlated, pixel sensitivities are not always uncorrelated, ask Chris about Veloce), then the joint probability (or likelihood) of all measurements is the product of the individual probabilities:

$$ \mathcal{L}(\theta) = \prod_{i=1}^{N} P(f_i \mid \theta) $$

Where $N$ is the number of data points. If the errors are correlated, the likelihood becomes:

$$ \mathcal{L}(\theta) = \frac{1}{\sqrt{(2\pi)^N \text{det} \mathbf{C}}} \exp{ \left[-\frac{1}{2}(\mathbf{f}-\mathbf{m})^T \mathbf{C}^{-1}(\mathbf{f}-\mathbf{m})\right]}$$

Where $\mathbf{f}$ is the data vector, $\mathbf{m}$ is the vector of model predictions, and $\mathbf{C}$ is the covariance matrix. 

For computational and mathematical nicety, you often work in log-likelihoods. For the uncorrelated Gaussian likelihood, that is:

$$ \ln \mathcal{L}(\theta) = - \frac{1}{2} \sum_{i=1}^N \left[\frac{(f_i - f(\lambda_i, \theta))^2}{\sigma_i^2} + \ln (2\pi \sigma_i^2) \right]$$

The second term doesn't depend on $\theta$, so maximising the log likelihood, is equivalent to minimising:

$$ \chi^2 (\theta) = \sum_{i=1}^N \frac{(f_i - f(\lambda_i, \theta))^2}{\sigma_i^2} $$

This is our old friend chi-squared. This is why I said that frequentism is sometimes a good approximation to Bayesian inference. The likelihood is $\mathcal{L} \propto \exp(-\chi^2/2)$, so minimising $\chi^2$ maximises the likelihood. If you adopt a flat prior, the posterior is proportional to the likelihood, so the $\chi^2$ minimum is also the posterior maximum.

But this equivalence only holds when you have no prior information to incorporate. When you do have prior knowledge (and in astronomy you usually do) the full Bayesian approach lets you use it. And even with flat priors, the MLE gives you only a point estimate, not the full posterior distribution.

Behind all this complicated looking mathematics, the (gaussian) likelihood is essentially just counting how many sigmas you model is off from your data, and you want to choose a model that minimises that.

## Part 9: Ftting an Absoption line

Let's apply these ideas to a concrete astronomy problem of fitting an absorption line in a stellar spectrum

What we will be doing is:
  1) Define a physical model (Gaussian absorption line)
  2) Simulate realistic data with known parameters
  3) Construct the likelihood function
  4) Define priors
  5) Compute the posterior on a grid
  6) Marginalise to extract parameter constraints
  7) Compute credible intervals and recovere the radial velocity

#### The Setup

You have obtained a spectrum and identified an absorption feature. The data consists of flux measurements $\{f_i\}$ at wavelengths $\{\lambda_i\}$, each with an associated uncertainty $\{\sigma_i\}$. You want to determine the line's central wavelength $\lambda_0$ (which encodes radial velocity), its depth $A$, and its width $\omega$. 

We will model the absorption line as a Gaussian:

$$f(\lambda) = C - A \exp\!\left[-\frac{(\lambda - \lambda_0)^2}{2\omega^2}\right] $$

where $C$ is the continuum level. Our parameters are $\theta = \{C,\, A,\, \lambda_0,\, \omega\}$

In [None]:
def absorption_line_model(wavelength, C, A, lambda_0, w):
    """
    Inputs:
        - wavelength : numpy array
        - C          : float
        - A          : float  - positive = absorption
        - lambda_0   : float
        - w          : float
    Output:
        - flux       : array
    """

    #####################################
    # TODO: Implement model (see above) #
    #####################################

    return flux

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 1</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

**Wait!** Before clicking the next cell, think of the pedagogical consequences!

In future, the final hints will be the full solution. Earlier hints will have partial advice.
</div>
</details>



<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 2</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

```python
def absorption_line_model(wavelength, C, A, lambda_0, w):
    """
    Inputs:
        - wavelength : numpy array
        - C          : float
        - A          : float  - positive = absorption
        - lambda_0   : float
        - w          : float
    Output:
        - flux       : array
    """

    flux = C - A * np.exp(-(wavelength - lambda_0)**2 / (2 * w**2))
    return flux
```

</div>
</details>


#### Simulate some data

Before fitting, we need to generate some synthetic data with known parameters so we can verify our inference. 

Let's simulate a star with a redial velocity of $v_r = 45$ km/s (moving away from us). This which doppler shift the line to longer wavelengths:

$$\lambda_{\text{observed}} = \lambda_\text{rest} \left( 1+ \frac{v_r}{c} \right) $$

In [None]:
######################
# Physical Constants #
######################
c = 299792.458  # Speed of light in km/s
lambda_rest = 5172.7  # Mg I b line rest wavelength (Angstroms)

################################
# True Parameters (to recover) #
################################

v_radial_true = 45.0  # Radial velocity in km/s (positive = receding)
lambda0_true = lambda_rest * (1 + v_radial_true / c)  # Doppler-shifted wavelength

C_true = 1.0    # Continuum level (normalised)
A_true = 0.25   # Line depth
w_true = 0.8    # Line width (Angstroms) - thermal + instrumental broadening

print(f"Rest wavelength: {lambda_rest:.2f} Å")
print(f"True radial velocity: {v_radial_true:.1f} km/s")
print(f"True observed wavelength: {lambda0_true:.3f} Å")
print(f"Doppler shift: {lambda0_true - lambda_rest:.3f} Å")

############################
# Generate wavelength grid #
############################
wavelength = np.linspace(5165, 5185, 120)

###############################
# Generate data and add noise #
###############################
np.random.seed(0) 
sigma = 0.1  
flux_true = absorption_line_model(wavelength, C_true, A_true, lambda0_true, w_true)
flux_observed = flux_true + np.random.normal(0, sigma, size=len(wavelength))
flux_uncertainty = np.full_like(wavelength, sigma)

In [None]:
#############
# Plot data #
#############

plt.figure(figsize=(10, 6))
plt.errorbar(wavelength, flux_observed, yerr=flux_uncertainty, fmt='k.', alpha=0.5, label='Observed')
plt.plot(wavelength, flux_true, 'r-', lw=2, label='True model')
plt.axvline(lambda_rest, color='b', linestyle='--', alpha=0.5, label=f'Rest wavelength ({lambda_rest:.1f} Å)')
plt.axvline(lambda0_true, color='g', linestyle=':', alpha=0.5, label=f'Doppler shifted ({lambda0_true:.2f} Å)')
plt.xlabel('Wavelength (Å)')
plt.ylabel('Normalised flux')
plt.legend()
plt.title(f'Simulated Mg I absorption line ($v_r$ = {v_radial_true:.0f} km/s)')
plt.show()

#### Converting Wavelength to Radial Velocity

Since we ultimately care about radial velocity, rather than wavelength, let's write some helper functions to convert between them.

In [None]:
def wavelength_to_velocity(lambda_obs, lambda_rest=5172.7):
    """
    Convert observed wavelength to radial velocity.
    
    Inputs:
        - lambda_obs  : float or array
        - lambda_rest : float          
    Output:
        - v_r         : float or array 
    """
    c = 299792.458  # km/s
    ##############################
    # TODO: Implement conversion #       
    ##############################
    return v_r


def velocity_to_wavelength(v_r, lambda_rest=5172.7):
    """
    Convert radial velocity to observed wavelength.
    
    Inputs:
        - v_r         : float or array
        - lambda_rest : float          
    Output:
        - lambda_obs  : float or array 
    """
    c = 299792.458  # km/s
    ##############################
    # TODO: Implement conversion #       
    ##############################
    return lambda_obs

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 1</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

Recall:

$$\lambda_{\text{observed}} = \lambda_\text{rest} \left( 1+ \frac{v_r}{c} \right) $$

</div>
</details>

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 2</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

```python
def wavelength_to_velocity(lambda_obs, lambda_rest=5172.7):
    """
    Convert observed wavelength to radial velocity.
    
    Inputs:
        - lambda_obs  : float or array
        - lambda_rest : float          
    Output:
        - v_r         : float or array
    """
    c = 299792.458  # km/s
    v_r = c * (lambda_obs - lambda_rest) / lambda_rest
    return v_r    
```

</div>
</details>

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 3</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

```python
def velocity_to_wavelength(v_r, lambda_rest=5172.7):
    """
    Convert radial velocity to observed wavelength.
    
    Inputs:
        - v_r         : float or array 
        - lambda_rest : float          
    Output:
        - lambda_obs  : float or array
    """
    c = 299792.458  # km/s
    lambda_obs = lambda_rest * (1 + v_r / c)
    return lambda_obs
```

</div>
</details>


#### The likelihood

Now that we have generated the data, and wrote some helper function, it is time to now apply Bayes' Theorem. Let's start with the likelihood. As we are assuming Gaussian errors, the log-likelihood is:

$$ \ln \mathcal{L}(\theta) = -\frac{1}{2} \sum_{i=1}^N  \frac{(f_i - f(\lambda_i,\, \theta))^2}{\sigma_i^2} + \text{const.}$$

In [None]:
def log_likelihood(theta, wavelength, flux_observed, flux_uncertainty):
    """
    Inputs:
        - theta            : tuple
        - wavelength       : array
        - flux_observed    : array
        - flux_uncertainty : array
    Output:
        - ln_L             : float - log-likelihood value
    """
    C, A, lambda0, w = theta

    ####################################
    # TODO: Compute the log-likelihood #
    ####################################
   
    return ln_L

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Click for solution</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">
    
```python
def log_likelihood(theta, wavelength, flux_observed, flux_uncertainty):
    """
    Inputs:
        - theta            : tuple
        - wavelength       : array
        - flux_observed    : array
        - flux_uncertainty : array
    Output:
        - ln_L             : float - log-likelihood value
    """
    C, A, lambda0, w = theta

    flux_model = absorption_line_model(wavelength, C, A, lambda0, w)
    chi_squared = np.sum(((flux_observed - flux_model) / flux_uncertainty)**2)
    ln_L = -0.5 * chi_squared
    return ln_L    
```
</div>
</details>


#### The Prior:

We need to specify the prior to perform Bayes' theorem. We *could* just do use a uniform prior, but does this represent all of our knowledge? Let's think about what we know, and what we are uncertain about.

We are looking for the Mg Ib line which has a rest wavelength $\lambda_\text{rest} = 5172.7$Å. We suspect that the star might be moving, thus could have a radial velocity. If this is the case, we can argue that smaller radial velocities are more likely than larger ones, with larger radial velocities getting increasingly less likely. 

In this case, we could use a Laplacian (double sided exponential) prior on $\lambda_0$, centred on the rest wavelength. That is to say:

$$ \pi(\lambda_0) = \frac{1}{2b} \exp \left( - \frac{|\lambda_0 - \lambda_{rest} |}{b} \right) $$

Here, the parameter $b$ is the scale parameter of the Laplacian distribution. It is analgous to the standard deviation in a Gaussian distribution. It is essentially encoding the distribution of radial velocities from a stellar population. Let's assume that from prior measurements of a stellar population that we know $b \approx 0.5$Å, corresponding to a dispersion of roughly 30km/s. We don't need to be exactly correct here. Assuming our data is informative, then differences in the prior should wash out. 

In [None]:
def log_prior(theta):
    """
    Compute the log-prior for the absorption line model.
    
    Inputs:
        - theta : tuple - model parameters (C, A, lambda0, w)
    Output:
        - ln_prior : float
    """
    C, A, lambda0, w = theta
    
    # Rest wavelength and Laplacian scale parameter
    lambda_rest = 5172.7  # Angstroms
    b = 0.5               # Scale parameter (~30 km/s velocity dispersion)
    
    # Uniform prior bounds
    C_min, C_max = 0.5, 1.5
    A_min, A_max = 0.0, C      # Depth cannot exceed continuum
    w_min, w_max = 0.1, 5.0     

    ################################################
    # TODO: Apply uniform prior bounds for C, a, W #
    ################################################

    ##########################################
    # TODO: Apply Laplacian prior on lambda0 #
    ##########################################
    return ln_prior

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 1</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

For uniform priors, as long as the model parameter is within the bounds, our prior probability is the same, so we don't need to add this output. However, we need to forbid values outside our prior bounds. **If our model parameter is outside the bounds, ```return -np.inf```

</div>
</details>

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 2</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

```python
# Check uniform prior bounds
    if not (C_min < C < C_max):
        return -np.inf
    if not (A_min < A < A_max):
        return -np.inf
    if not (w_min < w < w_max):
        return -np.inf
```
</div>
</details>



<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 3</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">
    
```python
    def log_prior(theta):
        """
        Compute the log-prior for the absorption line model.
        
        Inputs:
            - theta : tuple - model parameters (C, A, lambda0, w)
        Output:
            - ln_prior : float - log-prior value
        """
        C, A, lambda0, w = theta
        
        # Rest wavelength and Laplacian scale parameter
        lambda_rest = 5172.7  # Angstroms
        b = 0.5               # Scale parameter (~30 km/s velocity dispersion)
        
        # Uniform prior bounds
        C_min, C_max = 0.5, 1.5
        A_min, A_max = 0.0, C      # Depth cannot exceed continuum
        w_min, w_max = 0.1, 5.0
        
        # Check uniform prior bounds
        if not (C_min < C < C_max):
            return -np.inf
        if not (A_min < A < A_max):
            return -np.inf
        if not (w_min < w < w_max):
            return -np.inf
        
        # Laplacian prior on lambda0
        ln_prior = -np.log(2 * b) - np.abs(lambda0 - lambda_rest) / b
        return ln_prior
        
```
</div>
</details>


#### The Posterior

Finally, we can calculate the posterior. Remember that the log-posterior is the sum of the log-likelihood and log-prior. For now, we are ignore the evidence. 

In [None]:
def log_posterior(theta, wavelength, flux_observed, flux_uncertainty):
    """
    Inputs:
        - theta            : tuple - model parameters (C, A, lambda0, w)
        - wavelength       : array - wavelength values
        - flux_observed    : array - observed flux values
        - flux_uncertainty : array - flux uncertainties (1-sigma)
    Output:
        - ln_post          : float - log-posterior value
    """
    ##################################
    # TODO: Calculate Log-likelihood #
    ##################################

    #############################
    # TODO: Calculate log-prior #
    #############################

    ##################################
    # TODO: Sum to get log-posterior #
    ##################################
    
    return log_post

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Click for solution</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

```python
    def log_posterior(theta, wavelength, flux_observed, flux_uncertainty):
        """        
        Inputs:
            - theta            : tuple - model parameters (C, A, lambda0, w)
            - wavelength       : array - wavelength values
            - flux_observed    : array - observed flux values
            - flux_uncertainty : array - flux uncertainties (1-sigma)
        Output:
            - ln_post          : float - log-posterior value
        """
        ln_prior = log_prior(theta)
        if not np.isfinite(ln_prior):
            return -np.inf
        
        ln_like = log_likelihood(theta, wavelength, flux_observed, flux_uncertainty)
        ln_post = ln_prior + ln_like
        return ln_post
```
</div>
</details>


#### Calculate the Posterior on a Grid

For problems involving ibky a few parameters, we can evaluate the posterior on a grid. This is not what people really do as it is a brute force technique. I am only showing this so you can learn what it looks like. Typically, you would do something such as an MCMC to explore the space more efficiently. 

With our 4 parameters, a full grid search is expensive ($100^4 = 10^8$ evaluations). Instead let's fix the continuum $C$ and width $\omega$ at their true values, and explore a 2D grid over line depth $A$ and central wavelegnth $\lambda_0$.  

In [None]:
# Fix nuisance parameters (in practice you'd marginalise over these too)
C_fixed = C_true
w_fixed = w_true

# Create 2D grid over A and lambda0
n_grid = 250 # <--------------------- change grid resolution depending on you computer speed
A_grid = np.linspace(0.05, 0.5, n_grid)
lambda0_grid = np.linspace(5172, 5176, n_grid)

This is a good place to use `tqdm` if you have it. The below will import it if you have it. If you don't have it, it will do some python magic so that you can still run the code even if you don't have `tqdm`, you just won't have a progress bar.

In [None]:
try:
    from tqdm import tqdm
except ImportError:
    tqdm = lambda x, **kwargs: x  # Fallback: just return the iterable unchanged

# Evaluate log-posterior at each grid point
log_post_grid = np.zeros((n_grid, n_grid))

for i, A in enumerate(tqdm(A_grid, desc='Evaluating posterior')):
    for j, lam0 in enumerate(lambda0_grid):
        theta = (C_fixed, A, lam0, w_fixed)
        log_post_grid[i, j] = log_posterior(theta, wavelength, flux_observed, flux_uncertainty)

# Convert to posterior (subtract max for numerical stability, then exponentiate)
log_post_grid -= np.max(log_post_grid)
post_grid = np.exp(log_post_grid)

# Normalise so it integrates to 1
dA = A_grid[1] - A_grid[0]
dlam = lambda0_grid[1] - lambda0_grid[0]
post_grid /= np.sum(post_grid) * dA * dlam

#### Visualise the 2D posterior

In [None]:
plt.figure(figsize=(8, 6))
plt.contourf(lambda0_grid, A_grid, post_grid, levels=50, cmap='viridis')
plt.colorbar(label='Posterior density')
plt.axvline(lambda0_true, color='r', linestyle='--', label='True λ₀')
plt.axhline(A_true, color='r', linestyle='--', label='True A')
plt.axvline(lambda_rest, color='white', linestyle=':', alpha=0.5, label='Rest wavelength')
plt.xlabel('λ₀ (Å)')
plt.ylabel('Depth A')
plt.legend()
plt.title('2D Posterior (C and w fixed)')
plt.show()

#### Marginalising to 1D
To get the posterior on $\lambda_0$ alone, we marginalise (sum) over $A$. This is the numerical version of:

$$P(\lambda_0 \mid D) = \int P(A,\, \lambda_0 \mid D) dA $$

In [None]:
################################################
# TODO: Marginalise over A to get P(λ₀ | data) #
################################################
post_lambda0 = 

################################################
# TODO: Marginalise over λ₀ to get P(A | data) #
################################################
post_A = 

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 1</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

To marginalise, we need to call `np.sum`, select an axis, and multiply by `d[something]` (calculated above).

</div>
</details>

<details style="margin: 10px 0;">
<summary style="cursor: pointer; font-weight: bold; padding: 10px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">Hint 2</summary>

<div style="margin-top: 10px; padding: 15px; background-color: #cce5ff; border: 1px solid #004085; border-radius: 5px; color: #004085;">

```python
# Marginalise over A to get P(λ₀ | data)
post_lambda0 = np.sum(post_grid, axis=0) * dA

# Marginalise over λ₀ to get P(A | data)
post_A = np.sum(post_grid, axis=1) * dlam
```
</div>
</details>


#### Plot Marginalised Posteriors

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(lambda0_grid, post_lambda0, 'b-', lw=2)
axes[0].axvline(lambda0_true, color='r', linestyle='--', label='True value')
axes[0].axvline(lambda_rest, color='gray', linestyle=':', label='Rest wavelength')
axes[0].set_xlabel('λ₀ (Å)')
axes[0].set_ylabel('Marginal posterior density')
axes[0].set_title('P(λ₀ | data)')
axes[0].legend()

axes[1].plot(A_grid, post_A, 'b-', lw=2)
axes[1].axvline(A_true, color='r', linestyle='--', label='True value')
axes[1].set_xlabel('Depth A')
axes[1].set_ylabel('Marginal posterior density')
axes[1].set_title('P(A | data)')
axes[1].legend()

plt.tight_layout()
plt.show()

#### Credible Intervals vs Confidence Intervals

Before we extract uncertainties from our posterior, let's clarify what we are actually computing.

Recall that frequentist statistics uses **confidence** intervals. A 95% **confidence** interval does *not* mean "there is a 95% probability the parameter lies in this interval". It means that if you were to repeat the experiment many times and constructed the interval the same way, 95% of those intervals would contain the true value. From a single **confidence** interval, all that you can honestly say is that the value either lies in the interval, or it doesn't/ 

In Bayesian statistics, we compute **credible** intervals. A 95% credible interval means what you wished a **confidence** interval means. A 95% **credible** interval *does* mean that given our data and prior, there is a 95% probability that the parameter lies in this interval.

To give the frequentists some credit, in the limit of well behaved problems with lots of data, the **confidence** interval should converge to the **credible** interval. However, the interpretation is different. Also, some adversarial problems exist where this is not the case.

Anyway, let's calculate our credible interval and see what it is:

In [None]:
def credible_interval(x_grid, posterior, level=0.68):
    """
    Compute the credible interval from a 1D posterior on a grid.
    
    Inputs:
        - x_grid    : array - parameter values
        - posterior : array - posterior density (does not need to be normalised)
        - level     : float - credible level (default 0.68 for ~1-sigma)
    Output:
        - median    : float - posterior median
        - low       : float - lower bound of credible interval
        - high      : float - upper bound of credible interval
    """
    # Normalise
    posterior = posterior / np.sum(posterior)
    
    # Cumulative distribution
    cdf = np.cumsum(posterior)
    
    # Interpolate to find percentiles
    low = np.interp((1 - level) / 2, cdf, x_grid)
    median = np.interp(0.5, cdf, x_grid)
    high = np.interp((1 + level) / 2, cdf, x_grid)
    
    return median, low, high

In [None]:
#  Compute credible intervals for λ₀
median_lam, low_lam, high_lam = credible_interval(lambda0_grid, post_lambda0)

print(f"λ₀ = {median_lam:.3f} (+{high_lam - median_lam:.3f} / -{median_lam - low_lam:.3f}) Å")
print(f"True value: {lambda0_true:.3f} Å\n")

# Convert to radial velocity
v_median = wavelength_to_velocity(median_lam)
v_low = wavelength_to_velocity(low_lam)
v_high = wavelength_to_velocity(high_lam)

print(f"v_r = {v_median:.2f} (+{v_high - v_median:.2f} / -{v_median - v_low:.2f}) km/s")
print(f"True value: {v_radial_true:.2f} km/s")

## Part 10: Summary

Takeaways:
- Bayesian probabilities quantify belief, not long run frequency
- Bayes' theorem is just algebra. The magic is in the interpretation.
- Priors encode knowledge and shouldn't encode bias. With enough data, the prior becomes insignificant.
- The posterior answers the question you care about: The probability of a parameter given data.

What wasn't covered (but will be in future):
- How to efficiently explore high-dimensional posteriors (i.e. MCMC)
- Model comparosn
- Heirarchical models

<hr style="border: none; border-top: 1px solid #ccc; margin: 20px 100px;">
<p style="text-align: center; font-style: italic; margin: 10px 60px;">
A scientist who has learned how to use probability theory directly as extended logic, has a great advantage in power and versatility over one who has learned only a collection of unrelated ad hoc devices.    <br><span style="font-style: normal; font-size: 0.9em;">— E.T. Jaynes</span>
</p>
<hr style="border: none; border-top: 1px solid #ccc; margin: 20px 100px;">
