# Building a Bayesian model
[Collard and coworkers](https://doi.org/10.1139/cjz-2020-0171) did a simple experiment. They collected samples of carrion beetles (beetles that feed on decaying animal matter) and measured morphological features of the beetles of various species, collected from different sites at different times of the year.

Imagine that you are in a cabin in Maine. There are also plenty of carrion beetles in the area in which you are staying, so you are curious to look at the morphological features of the beetles in your area. Your plan to set up a trap near your cabin to collect specimens from a given species. For each beetle you collect, you will measure its mass and the length of its [elytra](https://en.wikipedia.org/wiki/Elytron).

**a)** Propose a generative model, including priors, for the probability distribution describing the observed elytra length and mass.

**b)** Now you will generate data sets that you might expect to observe in your experiment. To generate one of the data sets do the following.

1. Draw a set of parameter values out of the probability distribution you constructed for the parameters (the prior).
2. Use those parameters to parametrize the likelihood and draw 40 samples of beetle masses and elytra lengths out of it.

You can do this many many times to see how the data sets might look. Generate many such data sets and make plots of them. (You should think carefully about how you might want to plot these to best make clear to you how the data sets coming from your generative model may look.) Does this jibe with what you would expect from your experiment? If not, do you have any ideas why not?

**c)** *Do not attempt this part of the problem until parts (a) through (c) are complete.* You can access the measurements of Collard and coworkers [here](https://doi.org/10.5061/dryad.hdr7sqvg7). Extract the measurements made of *Nicrophorus orbicollis* at location MASS 10 in trap 0. Does the measured data set fall within the data sets you might expect from your proposed generative model?

In [None]:
import numpy as np
import iqplot
import bokeh.io
import colorcet
bokeh.io.output_notebook()

rng = np.random.default_rng()

In [127]:
# a. This is a Multivariate Normal distribution

# LENGTH IN mm assumed range (bet the farm): 
# mean -> 10-100 mm
# logs = [1, 2]
mean_length_mu = 1.5 
mean_length_sigma=0.2551

# std -> 1-8 mm
# logs = [0, 0.9]
std_length_mu = 0.25
std_length_sigma = 0.2296

# MASS IN grams (bet the farm):
# mean -> 10 - 1000 grams
# logs = [1, 3]
mean_mass_mu = 1
mean_mass_sigma = 0.51

# std -> 1-10 grams
# logs = [0, 1]
std_mass_mu = 0.25
std_mass_sigma = 0.2551


Likelihood:
$$

\left(\begin{array}{c}
M_i \\
L_i 
\end{array}\right) \sim Norm \left(
\left(\begin{array}{c}
\mu_M \\
\mu_L 
\end{array}\right),
\left(\begin{array}{cc}
\sigma_M^2 & \sigma_{ML} \\
\sigma_{LM} & \sigma_L^2
\end{array}\right)
\right) \\
\sim Norm \left(
\left(\begin{array}{c}
\mu_M \\
\mu_L 
\end{array}\right),
\left(\begin{array}{cc}
\sigma_M^2 & \rho \cdot \sigma_M \cdot \sigma_L   \\
\rho \cdot \sigma_L \cdot \sigma_M & \sigma_L^2
\end{array}\right)
\right)
$$

In [128]:
samplings = 40
iterations = 100

length_mu = 10**rng.normal(mean_length_mu, mean_length_sigma, iterations)
length_sigma = 10**rng.normal(std_length_mu, std_length_sigma, iterations)
mass_mu = 10**rng.normal(mean_mass_mu, mean_mass_sigma, iterations)
mass_sigma = 10**rng.normal(std_mass_mu, std_mass_sigma, iterations)

# the ditribution of rho is beta and we assume mostly positive correlation. 
# beta is between 0 and 1 and we need [-1,1] so 2*beta -1
# choose parameters for the beta distribution from the website
alpha = 5
beta = 1.6 
rho = 2*rng.beta(alpha, beta, iterations) - 1
cov = rho * length_sigma * mass_sigma

masses = np.zeros((iterations, samplings))
lengths = np.zeros((iterations, samplings))
counter = -1
for length_m, length_s, mass_m, mass_s, cov_ in zip(length_mu, length_sigma, mass_mu, mass_sigma, cov):
    mvn = rng.multivariate_normal(
        mean=[length_m, mass_m],
        cov=[[length_s**2, cov_], [cov_, mass_s**2]],
        size=samplings
    )
    counter += 1
    lengths[counter, :] = mvn[:, 0]
    masses[counter, :] = mvn[:, 1]

# mass = masses.flatten()
# length = lengths.flatten()


In [129]:
# Plot histogram2d:
p = bokeh.plotting.figure(
    width=400, height=400,
    title="Generated Beetle Data",
    x_axis_label="Elytra Length (mm)",
    y_axis_label="Mass (grams)",
    x_axis_type='log',
    y_axis_type='log',
)

colors = colorcet.b_glasbey_category10

for l, m, color in zip(lengths, masses, colors):
    p.scatter(l, m, alpha=0.3, color=color)


p.xaxis.axis_label_text_font_size = "14pt"
p.yaxis.axis_label_text_font_size = "14pt"
bokeh.io.show(p)

In [130]:
# c - Extract the measurements made of *Nicrophorus orbicollis* at location MASS 10 in trap 0. 
# Does the measured data set fall within the data sets you might expect from your proposed generative model?
import polars as pl
# import csv:
df = pl.read_csv("data.beetle.size.csv")
# choose Trap 0 Site MASS 10
df_filtered = df.filter(
    (pl.col("Trap") == 0) & 
    (pl.col("Site") == "MASS 10") &
    (pl.col("Genus") == "Nicrophorus") &
    (pl.col("Species") == "orbicollis")
).select(["Elytra_length_mm", "Mass_g"])
real_data_length = df_filtered["Elytra_length_mm"].to_numpy()
real_data_mass = df_filtered["Mass_g"].to_numpy()
p.scatter(real_data_length, real_data_mass, color="black", marker="x")


In [131]:
bokeh.io.show(p)