In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
!gdown https://drive.google.com/uc?id=1FInZ2jrlZGNColU4sHF9JKGHP39fTVut
!gdown https://drive.google.com/uc?id=1n1qS6dcVVKcVJOuUIIm0VTz6cSyrtzDH

## Data & library imports

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
from scipy.stats import norm, lognorm

In [None]:
income = pd.read_csv('BDL municipality incomes 2015-2020.csv', sep=';', dtype={'Code': 'str'})

**Exercise 1.** In the first exercise, we'll perform a bayesian estimation of the mean income of municipalities based on a random sample. The sample is already selected in the cell below. For simplicity, we'll assume that the income and its mean are normally distributed, and that the frequentist estimator of the standard deviation always gives us the correct answer (so that we can use a model with a known variance).  

First, using the $3\sigma$ rule, calculate the hyperparameters for priors that assume:
1. 99% probability that the mean income is between $10^4$ and $10^{12}$ PLN (a *weakly informative* prior),
2. 99% probability that the mean income is between $10^6$ and $2\cdot 10^8$ PLN (a *moderately informative* prior),
3. 99% probability that the mean income is between $4\cdot 10^7$ and $6 \cdot 10^7$ PLN (a *strongly informative* prior),
4. 99% probability that the mean income is between $8 \cdot 10^7$ and $10^8$ PLN (a strongly informative, but *incorrect* prior).

Write a function that takes the prior parameters, the mean and standard deviation estimated from the random sample, and the size of the sample, and returns the hyperparameters of the posterior distribution (the posterior mean and standard deviation). You can use the formulas from the description of this notebook or look them up at the [Wikipedia article](https://en.wikipedia.org/wiki/Conjugate_prior).

Using the `norm.pdf` function, compute the posterior probability densities in points given by `x = np.linspace(1e06, 2e08, 501)` for all four priors. Visualize the densities on a plot. Annotate the plot with the true mean income and the value of the frequentist estimator (i.e. the arithmetic mean of the sample). Hint: create a data frame `posterior_pdf = pd.DataFrame({'x': x})` and add columns with the computed density values. Next, use `posterior_pdf = posterior_pdf.melt(id_vars='x', var_name="Type of prior")` to get the data in a format suitable for plotting with `plotly.express`. Use the `fig = px.line()` function for plotting and `fig.add_vline()` to annotate the plots.  

Create a plot showing the probability density function of the moderately informative prior and the corresponding posterior. Answer the following questions: How did the sample influence the prior distributions? Is there a large difference between the posteriors for the weakly and the moderately informative priors? What is the effect of incorrectly specifying the prior compared to specifying a prior with a large variance?

What happens if you increase the size of the sample?  


In [None]:
## Get the data:
income2020 = income['2020'].dropna()
true_mean, true_sd = income2020.mean(), income2020.std()
print('True mean:', round(true_mean), 'and standard deviation:', round(true_sd))
## Get the sample:
N = 36
#income_sample = income2020.sample(N)
income_sample = income2020[[2241, 1980, 2436,  979, 1064, 2146, 1983,  464, 1262,  318, 2429,
                            1609, 2320, 1383,  813, 1948, 2392, 1930, 1751, 1330, 1586,  856,
                            1149, 2369, 2189, 1993, 1911,  225,  546,  843, 1389,  821,  338,
                            1986, 1132, 1077]]
## Frequentist estimate:
mu_estim = income_sample.mean()
sd_estim = income_sample.std()
print('Estimated mean:', round(mu_estim), 'and standard deviation:', round(sd_estim))
## Write the rest of your code here. 

In [None]:
### Solution for the tutors.

## Prior for a 99% probability that the true income
## is between 1e04 and 1e12:
mu01, sd01 = (1e12+1e04)/2, (1e12-1e04)/6  
## Prior for a 99% probability that the true income
## is between 1e06 and 2e08:
mu02, sd02 = (2e08+1e06)/2, (2e08-1e06)/6
## Prior for a 99% probability that the true income
## is between 40e06 and 60e06:
mu03, sd03 = (4e07+6e07)/2, (6e07-4e07)/6
## Prior for a 99% probability that the true income
## is between 80e06 and 100e06:
mu04, sd04 = (1e08+8e07)/2, (1e08-8e07)/6
mu0s = [mu01, mu02, mu03, mu04]
sd0s = [sd01, sd02, sd03, sd04]

## Posterior parameter function:
def posterior_param(mu_prior, sd_prior, mu_data, sd_data, n):
  """
  Assumes that sd_data = true sd.
  """
  var_posterior = 1/(1/sd_prior**2 + n/sd_data**2)
  mu_posterior = var_posterior 
  mu_posterior *= mu_prior/sd_prior**2 + n*mu_data/sd_data**2
  sd_posterior = np.sqrt(var_posterior)
  return (mu_posterior, sd_posterior)

## Compute the posterior probability densities:
x = np.linspace(1e06, 2e08, 501)
posterior_pdf = pd.DataFrame({'x': x})
pmus = []
psds = []
for i, name in enumerate(['Weakly informative',
                          'Moderately informative',
                          'Strongly informative',
                          'Improper']):
  pmu, psd = posterior_param(mu0s[i], sd0s[i], mu_estim, sd_estim, N)
  posterior_pdf[name] = norm.pdf(x, loc=pmu, scale=psd)
  pmus.append(pmu)
  psds.append(psd)

## Visualize the result:
posterior_pdf = posterior_pdf.melt(id_vars='x', var_name="Type of prior")
fig = px.line(posterior_pdf, x='x', y='value', color='Type of prior')
fig.add_vline(x=true_mean, annotation_text='True mean', opacity=0.5)
fig.add_vline(x=mu_estim, annotation_text='Estimated mean', opacity=0.5, line_color='orange')
fig.show()

## Compare the moderately informative prior and its posterior:
fig2 = px.line(posterior_pdf[posterior_pdf['Type of prior']=='Moderately informative'],
               x='x', y='value')
fig2.add_scatter(x=x, y=norm.pdf(x=x, loc=mu02, scale=sd02), line_dash='dash', name='Prior')
fig2.add_vline(x=true_mean, annotation_text='True mean', opacity=0.5)
fig2.add_vline(x=mu_estim, annotation_text='Estimated mean', opacity=0.5, line_color='orange')
fig2.show()