# Overdispersion Exercises

You'll notice the exercises below are not as guided as exercises in previous lessons. There is not a unique way to successfully solve the exercises. You are encouraged to use all the resources available. Not only the content in this lesson, but also previous lessons, libraries' documentation, books, articles, Google search, etc.

If you come up with a solution that differs from the official one, please share it with us! Discourse or private message are both good. It would be very nice your creative solutions 😄

At the same time, if one of the exercises is more challenging than excepted or simply not clear, reach out so we can clarify it and make it better.

## Imports

In [1]:
import arviz as az
import bambi as bmb
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

from scipy import stats

In [2]:
%matplotlib inline
plt.style.use("intuitivebayes.mplstyle")

mpl.rcParams["figure.dpi"] = 120
mpl.rcParams["figure.facecolor"] = "white"
mpl.rcParams["axes.spines.left"] = False
FIGSIZE = (7, 4)

## Exercise 1 [Easy]

**_Become a beta-binomial pro_**

In the "It's better to _look_ at it" sub-section of the lesson we created a visualization where we compared the probability mass function of the binomial and beta-binomial distributions. 

1.  All of the distributions in the visualization have something in common, what's that? _Hint: Think about properties that are commonly discussed when characterizing distributions_.
1.  Propose pairs of beta and beta-binomial distributions with the same mean but different variances and plot them. You can use the functions `beta_binomial_mean` and `beta_binomial_variance` provided in the lesson. Before performing the actual computation, think about which distribution will have a larger variance.

In [3]:
def beta_binomial_mean(n, alpha, beta):
    return (n * alpha) / (alpha + beta)

def beta_binomial_variance(n, alpha, beta):
    t1 = (n * alpha * beta) / (alpha + beta) ** 2
    t2 = (alpha + beta + n) / (alpha + beta + 1)
    return t1 * t2

**_Your answer here_**

## Exercise 2 [Easy]

**_Plot the function $f(\pi) = n \times \pi \times (1 - \pi)$ for values of $\pi \in [0, 1]$ and some fixed values of $n$ of your choice._**

1. What's the maximum value of the function $f(\pi)$? And the minimum?
2. What does it mean for the binomial distribution?

_Your answer here_

## Exercise 3 [Easy]

**_Compute probabilities based on the observed data and the fitted Poisson distribution._**

In this exercise, use the values in the file `toy_counts.csv` that was introduced in the lesson.

1. Find the maximum likelihood estimate of $\lambda$ assuming $X \sim \mathrm{Poisson}(\lambda)$. _Hint: have a second look at the lesson if you need a refresher on how to do this. Feel free to search on the web for help too._
2. Compute the following probabilities:

$$
\begin{aligned}
& P(X \le 2) \\
& P(X > 10)  \\
& P(4 \le X \le 8)
\end{aligned}
$$

You can use the `.pmf()` method in SciPy's random variables, but you could also implement it yourself.

3. Are there any substantial differences? Why?

**_Your answer here_**

## Exercise 4 [Medium]

**_Modeling of binomial outcomes with PyMC._**

In this exercise we're going to revisit the UC Berkeley admissions problem with PyMC.

1.  Reproduce the binomial model for graduate admissions using PyMC.
2.  Compute the coverage of the 95% HDI of the posterior predictive distribution.
3.  What are your conclusions about this model?
4.  Now, consider both gender and department as a covariates and repeat points 1 to 3. Compare the estimation of the gender coefficient between the two models and conclude.

**_Your answer here_**

## Exercise 5 [Medium]

**_Comparing Poisson vs negative binomial predictive distributions._**

In the beginning of the lesson we introduced the fish species diversity dataset. There we create a model (actually two models) to understand the association between the size of a lake and the number of species in it and be able to predict the number the number of species given a lake size.

Now it's time to do some work on that dataset with our favorite probabilistic programming language: PyMC.

1. Reproduce the Poisson regression model with PyMC and get the predictive distribution of the number of species in a lake of 1000 squared meters.
2. Same as #1, but using the negative binomial regression model.
3. Compare the predictive distributions obtained in #1 and #2. Compare their means and variances and explain the result.
4. How would you do the same in Bambi?

**_Your answer here_**

## Exercise 6 [Medium/Hard]

**_Analysis of trout eggs._**

The `troutegg.csv` file contains the **troutegg** dataset, which contains information about the survival of trout eggs according to different times and locations.  

Boxes of trout eggs were buried at 5 different stream locations and retrieved at 4 different times, and the number of surviving eggs was recorded.

The data frame contains 20 observations with the following 4 variables:

* **survive** the number of surviving eggs.
* **total** the number of eggs in the box.
* **location** the location in the stream (1, 2, 3, 4, and 5).
* **period** the number of weeks after placement that the box was withdrawn (4, 7, 8, and 11).

**Note**: Consider both `location` and `period` as categoric covariates.

1.  Explore the dataset. Do you spot any challenges?
2.  Use Bambi to:
    1. Build two binomial models: one with additive and other with interaction effects
    1. Build two beta binomial models: one with additive and other with interaction effects
3.  What is the estimated survival probability for the following cases in the dataset? Estimate the posterior with the four models and plot all of them together using a forest plot.
    * `location == 4` and `period == 4`
    * `location == 4` and `period == 8`
    * `location == 5` and `period == 11`
4.  For the cases listed in #3, get and visualize the predictive distribution along the observed number of eggs that survive.
5.  Which model do you think makes most sense? List pros and cons of both models.

In part 2, **it's OK to use default priors** as prior elicitation is not the goal of the exercise.

**_Your answer here_**

## Exercise 7 [Hard]

**_Revisiting the fish species dataset with PyMC - Part 2._**

Consider the Poisson regression model that we have already built:

$$
\begin{aligned}
\text{species}_i &\sim \text{Poisson}(\mu_i) \\
\mu_i &= \exp[\beta_0 + \beta_1 \log(\text{area}_i)] \\
\beta_0 &\sim \text{Normal} \\
\beta_1 &\sim \text{Normal}
\end{aligned}
$$


1. What does the intercept mean in the  model? Does it make sense to interpret it?
2. What transformation can be performed to the lake area in order to make the intercept $\beta_0$ be related to the number of species of a lake of average size?
3. What is needed to make the intercept $\beta_0$ relate to the number of species of a lake of 1000 squared kilometers?
4. Write and fit both models in PyMC. Then:
    1.  Using the first model, predict the mean number of species for a lake of 1000 squared kilometers.
    2.  Using the second model, predict the mean number of species for a lake of average size.
    3.  Compare these values with the posterior mean of $\beta_0$ in both models.
5.  What can you conclude?

**_Your answer here_**

## Exercise 8 [Hard]

**_The non-identifiability playground._**

Consider the additive trout eggs model created in a previous exercise.

Below we propose several model implementations that are not completely correct.

Explain what is the problem in each of the cases and propose at least one solution (written in PyMC).

It can be useful to revisit all the content about parameter non-identifiability in the course.

In [53]:
# Setup
df_eggs = pd.read_csv("data/trout_egg.csv")
survive = df_eggs["survive"].to_numpy()
total = df_eggs["total"].to_numpy()
location, location_idx = np.unique(df_eggs["location"], return_inverse=True)
period, period_idx = np.unique(df_eggs["period"], return_inverse=True)

coords = {
    "location": location,
    "period": period
}

**Model 1**

In [54]:
with pm.Model(coords=coords) as model_1:
    intercept = pm.Normal("intercept")
    b_location = pm.Normal("b_location", dims="location")
    b_period = pm.Normal("b_period", dims="period")
    mu = intercept + b_location[location_idx] + b_period[period_idx]
    p = pm.math.invlogit(mu)
    pm.Binomial("survive", p=p, n=total, observed=survive)

**Model 2**

In [55]:
with pm.Model(coords=coords) as model_2:
    b_location = pm.Normal("b_location", dims="location")
    b_period = pm.Normal("b_period", dims="period")
    mu = b_location[location_idx] + b_period[period_idx]
    p = pm.math.invlogit(mu)
    pm.Binomial("survive", p=p, n=total, observed=survive)

**Model 3**

In [56]:
with pm.Model(coords=coords) as model_3:
    intercept = pm.Normal("intercept")
    b_location = pm.ZeroSumNormal("b_location", dims="location")
    b_period = pm.Normal("b_period", dims="period")
    mu = intercept + b_location[location_idx] + b_period[period_idx]
    p = pm.math.invlogit(mu)
    pm.Binomial("survive", p=p, n=total, observed=survive)

**Model 4**

In [57]:
with pm.Model(coords=coords) as model_4:
    intercept = pm.Normal("intercept")
    b_location = pm.Normal("b_location", dims="location")
    b_period = pm.ZeroSumNormal("b_period", dims="period")
    mu = intercept + b_location[location_idx] + b_period[period_idx]
    p = pm.math.invlogit(mu)
    pm.Binomial("survive", p=p, n=total, observed=survive)

**_Your answer here_**