# Hierarchical Regression: Elasticities

**Today**

* Price elasticity in a retail setting
* Orange juice data
* Hierarchical models
* Centered vs non-centered

In [None]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm

np.set_printoptions(precision=4, linewidth=150)

%matplotlib inline

## Elasticities

We will start today by discussing the price elasticity of demand:

\begin{align*}
  \eta_i &= \frac{\Delta q / q}{\Delta p / p} = \frac{dq / q}{dp / p}
\end{align*}

What sign should $\eta_i$ take?

Most individuals buy less of a good as the price increases $\rightarrow \eta_i < 0$

### Constant elasticity demand function

We will use the following demand function in today's lecture

\begin{align*}
  q = \alpha p^{\eta}
\end{align*}

One could motivate it theoretically, but we'll skip those steps and refer the interested student to the book "Economics and Consumer Behavior" by Angus Deaton and John Muellbauer

Taking the log of the demand function and take the derivative w.r.t $p$ gives us

\begin{align*}
  \log(q) &= \log(\alpha) + \eta \log(p) \\
  \frac{1}{q} \frac{\partial q}{\partial p} &= \eta \frac{1}{p} \\
  \rightarrow \frac{\partial q / q}{\partial p / p} &= \eta 
\end{align*}

### Goals of a retail store

The owners of retail stores are typically interested in maximizing the amount they earn.

A simplified view of how a store could achieve this is to choose a price for their goods that maximizes today's profits (this ignores dynamic opportunities and other important considerations).

\begin{align*}
  \pi &= p q - c q \\
  \pi &= \alpha p^{1 + \eta} - c \alpha p^{\eta} \\
  \frac{\partial \pi}{\partial p} &= \alpha (1 + \eta) p^{\eta} - c \alpha \eta p^{\eta-1} = 0\\
\end{align*}

\begin{align*}
  p &= c \frac{\eta}{1 + \eta}
\end{align*}

Two cases:

- $\eta < -1$: This results in $p > 0$ because $1 + \eta < 0 \rightarrow \frac{\eta}{1 + \eta} > 0$
- $\eta \geq -1$: This creates problems for the profit maximization of this particular demand function and results in $p < 0$... The easiest way forward is to acknowledge that our demand equation has short-comings and that the elasticities faced by retail stores are not likely constant -- Maybe we justify this by claiming that the constant elasticity demand model could be a reasonable local approximation to a more complex demand model that we aren't able to specify.

### At what level to think about an elasticity?

Imagine you own 5 stores in different geographies and the stores each sell the same set of goods. If you're the owner of these stores, you need to consider how to think about elasticities... 

Are they shared across products? Across geographies? Not shared at all?

**Global level**

All items in your store across all geographies have the same elasticity

**Product level**

Each item has an elasticity but that elasticity is the same across each store

**Store level**

All items in each store have the same elasticity but that elasticity differs by store

**(Product $\times$ Store) level**

Each product has a different elasticity in every store

### Trade-off

The more you allow your elasticities to differ by product and geography, the more accurately you will capture the differences across products and geographies...

But the more you allow for differences, the less data you have to accurately estimate each elasticity parameter

### Cross-price elasticities

Typically, we would also consider how a price change in one good could affect the purchases of another good. For example, raising the price of orange juice might result in individuals choosing to purchase less orange juice AND more apple juice.

We are going to skip cross-price elasticities but the paper that motivated this exercise (which is in this handouts folder) did estimate cross-price elasticities.

## Data: Dominick's Orange Juice

This data was originally a part of the [Dominick's Finer Foods](https://www.chicagobooth.edu/research/kilts/datasets/dominicks) data collection which ran from 1989 to 1994.

We've followed work by [Greg Allenby](https://fisher.osu.edu/people/allenby.1) and have selected a subset of products. The subset includes 11 different orange juice selections -- [This data](https://cran.r-project.org/web/packages/bayesm/bayesm.pdf#page=40&zoom=100,132,89) is also available through the `bayesm` R package.

### Data description

There are 11 orange juices in the full dataset. They include:

1. Tropicana Premium 64 oz (premium)
2. Tropicana Premium 96 oz (premium)
3. Florida’s Natural 64 oz (premium)
4. Tropicana 64 oz (national)
5. Minute Maid 64 oz (national)
6. Minute Maid 96 oz (national)
7. Citrus Hill 64 oz (national)
8. Tree Fresh 64 oz (national)
9. Florida Gold 64 oz (national)
10. Dominicks 64 oz (store)
11. Dominicks 128 oz (store)

The prices in the dataset are reported in \\$/oz

The quantity sold are reported in logs in the `logmove` variable

There are approximately 80 stores in the dataset with between 85 and 125 weeks of data for each store

In [None]:
oj_raw = pd.read_csv("oj.csv", index_col=0)

In [None]:
oj_raw.head()

**Get a single log price per row**


In [None]:
oj = oj_raw.copy()

oj.loc[:, "logprice"] = oj.apply(
    lambda x: np.log(x[f'price{x["brand"].astype(int)}']), axis="columns"
)

**Label products premium/national/store**

* 0 -> premium
* 1 -> national
* 2 -> store

In [None]:
labeler = {
    1: 0,
    2: 0,
    3: 0,
    4: 1,
    5: 1,
    6: 1,
    7: 1,
    8: 1,
    9: 1,
    10: 2,
    11: 2,
}

oj.loc[:, "quality"] = oj.loc[:, "brand"].map(lambda x: labeler[x])

**Shrink dataset**

We would like to be able to sample from our posterior relatively quickly for this class, so we're going to discard some of the data...

We've run other (more complex) Bayesian models on the data successfully but it was relatively expensive time-wise -- The computational time of Bayesian models can sometimes be a drawback to doing large-scale work with them (but there are lots of great "approximate Bayesian methods" being developed that help combat this).

In [None]:
prods_to_keep = [1, 3, 4, 5, 7, 8, 10]

oj = oj.query(
    "store < 33 &"  # Only use 10 stores
    "(brand in @prods_to_keep)"
    ""
)

**Determine product/store/quality/week indexers**

In [None]:
# Determine the number of products, stores, and weeks
nproducts = oj.loc[:, "brand"].nunique()
nstores = oj.loc[:, "store"].nunique()
nquality = oj.loc[:, "quality"].nunique()
nweeks = oj.loc[:, "week"].nunique()
 
# Convert things into index references
brand_mapper = dict(zip(oj["brand"].unique(), range(nproducts)))
store_mapper = dict(zip(oj["store"].unique(), range(nstores)))
quality_mapper = dict(zip(oj["quality"].unique(), range(nquality)))
week_mapper = dict(zip(oj["week"].unique(), range(nweeks)))

# Add indexer columns
oj.loc[:, "brand_idx"] = oj.loc[:, "brand"].replace(brand_mapper)
oj.loc[:, "store_idx"] = oj.loc[:, "store"].replace(store_mapper)
oj.loc[:, "quality_idx"] = oj.loc[:, "quality"].replace(quality_mapper)
oj.loc[:, "week_idx"] = oj.loc[:, "week"].replace(week_mapper)

**Discard unneeded columns**

In [None]:
cols_to_keep = [
    "store_idx",
    "store",
    "brand_idx",
    "brand",
    "quality_idx",
    "quality",
    "week_idx",
    "logmove",
    "logprice"
]

df = oj.loc[:, cols_to_keep].reset_index(drop=True)

In [None]:
df.head()

## Bayesian Models for Orange Juice Demand

We now proceed to develop and explore a sequence of Bayesian models for orange juice demand

### Fully pooled

In the fully pooled model, we will treat all products and stores as if they were the same.

The model can be written as:

\begin{align*}
  \log(q_{i, s, t}) &= \alpha + \eta \log(p_{i, s, t}) + \sigma \varepsilon_{i, s, t} \\
  \alpha &\sim N(1, 10) \\
  \eta &\sim N(-1, 5) \\
  \sigma &\sim \text{HalfStudentT}(10, 5)
\end{align*}

In [None]:
# Prep data
log_q = df["logmove"].to_numpy()
log_p = df["logprice"].to_numpy()

In [None]:
m_fp = pm.Model()

with m_fp:
    # Data
    _log_q = pm.Data("log_q", log_q)
    _log_p = pm.Data("log_p", log_p)

    # Priors
    alpha = pm.Normal("alpha", 1, 10)
    eta = pm.Normal("eta", -1, 5)
    sigma = pm.HalfStudentT("sigma", nu=10, sigma=5)

    # Likelihood
    ll = pm.Normal(
        "ll", alpha + eta*_log_p, sigma, observed=_log_q
    )

**Sample from the posterior**

In [None]:
with m_fp:
    traces_fp = pm.sample(2000, tune=1500)

In [None]:
with m_fp:
    az.plot_trace(traces_fp)

**Diagnostics**

In [None]:
with m_fp:
    ess = az.ess(traces_fp, relative=True)
    rhat = az.rhat(traces_fp)

print("Effective Sample Size (min across parameters)")
print(f"\talpha: {ess['alpha'].values.min()}")
print(f"\teta: {ess['eta'].values.min()}")
print(f"\tsigma: {ess['sigma'].values.min()}")
print("rhat (max across parameters)")
print(f"\talpha: {rhat['alpha'].values.max()}")
print(f"\teta: {rhat['eta'].values.max()}")
print(f"\tsigma: {rhat['sigma'].values.max()}")

**Sampling posterior predictive**

In [None]:
with m_fp:
    spp_fp = pm.sample_posterior_predictive(traces_fp, 500)

spp_fp_logmove = spp_fp["ll"]

In [None]:
prods_to_plot = [1, 3, 4, 10]
store_idx_to_plot = [0, 7]

fig, ax = plt.subplots(
    len(prods_to_plot), len(store_idx_to_plot),
    figsize=(10, 18)
)

for (pidx, p) in enumerate(prods_to_plot):
    for (sidx, s) in enumerate(store_idx_to_plot):
        idx = (pidx, sidx)

        subdf = df.query("brand == @p & store_idx == @sidx")
        subdf_idx = subdf.index

        for row in range(spp_fp_logmove.shape[0]):
            ax[idx].scatter(
                subdf["logprice"],
                spp_fp_logmove[row, subdf_idx],
                color="g", alpha=0.01, s=35
            )

        ax[idx].scatter(
            subdf["logprice"], subdf["logmove"],
            color="k", s=1
        )

        ax[idx].set_xlim(-4.0, -2.5)
        ax[idx].set_ylim(3.5, 15)


### Amnesia

In the amnesia model, we will treat all products and stores as if they were the different and completely unrelated.

The model can be written as:

\begin{align*}
  \log(q_{i, s, t}) &= \alpha_{i, s} + \eta_{i, s} \log(p_{i, s, t}) + \sigma_{i, s} \varepsilon_{i, s, t} \\
  \alpha_{i, s} &\sim N(1, 10) \\
  \eta_{i, s} &\sim N(-1, 5) \\
  \sigma_{i, s} &\sim \text{HalfStudentT}(10, 5)
\end{align*}

In [None]:
# Prep data
log_q = df["logmove"].to_numpy()
log_p = df["logprice"].to_numpy()

product_idx = df["brand_idx"].to_numpy()
quality_idx = df["quality_idx"].to_numpy()
store_idx = df["store_idx"].to_numpy()

nproducts = df["brand_idx"].nunique()
nquality = df["quality_idx"].nunique()
nstores = df["store_idx"].nunique()


In [None]:
m_np = pm.Model()

with m_np:
    # Data
    _log_q = pm.Data("log_q", log_q)
    _log_p = pm.Data("log_p", log_p)
    _product_idx = pm.intX(pm.Data(
        "product_idx", product_idx
    ))
    _store_idx = pm.intX(pm.Data(
        "store_idx", store_idx
    ))

    # Priors
    alpha = pm.Normal("alpha", 1, 10, shape=(nstores, nproducts))
    eta = pm.Normal("eta", -1, 5, shape=(nstores, nproducts))
    sigma = pm.HalfStudentT("sigma", nu=10, sigma=5, shape=(nstores, nproducts))

    # Likelihood
    ll = pm.Normal(
        "ll",
        alpha[(_store_idx, _product_idx)] + eta[(_store_idx, _product_idx)]*_log_p,
        sigma[(_store_idx, _product_idx)],
        observed=_log_q
    )

**Sample from the posterior**

In [None]:
with m_np:
    traces_np = pm.sample(2000, tune=1500)

In [None]:
with m_np:
    az.plot_trace(traces_np, compact=True)

**Diagnostics**

In [None]:
with m_np:
    ess = az.ess(traces_np, relative=True)
    rhat = az.rhat(traces_np)

print("Effective Sample Size (min across parameters)")
print(f"\talpha: {ess['alpha'].values.min()}")
print(f"\teta: {ess['eta'].values.min()}")
print(f"\tsigma: {ess['sigma'].values.min()}")
print("rhat (max across parameters)")
print(f"\talpha: {rhat['alpha'].values.max()}")
print(f"\teta: {rhat['eta'].values.max()}")
print(f"\tsigma: {rhat['sigma'].values.max()}")

**Sampling posterior predictive**

In [None]:
with m_np:
    spp_np = pm.sample_posterior_predictive(traces_np, 500)

spp_np_logmove = spp_np["ll"]

In [None]:
prods_to_plot = [1, 3, 4, 10]
store_idx_to_plot = [0, 7]

fig, ax = plt.subplots(
    len(prods_to_plot), len(store_idx_to_plot),
    figsize=(10, 18)
)

for (pidx, p) in enumerate(prods_to_plot):
    for (sidx, s) in enumerate(store_idx_to_plot):
        idx = (pidx, sidx)

        subdf = df.query("brand == @p & store_idx == @sidx")
        subdf_idx = subdf.index

        for row in range(spp_fp_logmove.shape[0]):
            ax[idx].scatter(
                subdf["logprice"],
                spp_np_logmove[row, subdf_idx],
                color="g", alpha=0.01, s=35
            )

        ax[idx].scatter(
            subdf["logprice"], subdf["logmove"],
            color="k", s=1
        )

        ax[idx].set_xlim(-4.0, -2.5)
        ax[idx].set_ylim(3.5, 15)


### Pooled by product

This will be our first hierarchical model. We will assume that the elasticity differs by product but is the same across all stores.

The model can be written as:

\begin{align*}
  \log(q_{i, s, t}) &= \alpha_{i, s} + \eta_{i} \log(p_{i, s, t}) + \sigma_{i} \varepsilon_{i, s, t} \\
  \alpha_{i, s} &\sim N(1, 10) \\
  \eta_{i} &\sim N(\bar{\eta}, \sigma^\eta) \\
  \sigma_{i} &\sim \text{HalfStudentT}(10, 5) \\
  \bar{\eta} &\sim N(-1, 5) \\
  \sigma^\eta &\sim \text{HalfStudentT}(5, 2)
\end{align*}

In [None]:
# Prep data
log_q = df["logmove"].to_numpy()
log_p = df["logprice"].to_numpy()

product_idx = df["brand_idx"].to_numpy()
store_idx = df["store_idx"].to_numpy()

nproducts = df["brand_idx"].unique().shape[0]
nstores = df["store_idx"].unique().shape[0]


In [None]:
m_hs = pm.Model()

with m_hs:
    # Data
    _log_q = pm.Data("log_q", log_q)
    _log_p = pm.Data("log_p", log_p)
    _product_idx = pm.intX(pm.Data(
        "product_idx", product_idx
    ))
    _store_idx = pm.intX(pm.Data(
        "store_idx", store_idx
    ))

    # Hyper priors
    eta_bar = pm.Normal("eta_bar", -1, 5)
    sigma_eta = pm.HalfStudentT("sigma_eta", nu=5, sigma=3)

    # Priors
    alpha = pm.Normal("alpha", 1, 10, shape=(nstores, nproducts))
    eta = pm.Normal("eta", eta_bar, sigma_eta, shape=nproducts)
    sigma = pm.HalfStudentT("sigma", nu=10, sigma=5, shape=nproducts)

    # Likelihood
    ll = pm.Normal(
        "ll",
        alpha[(_store_idx, _product_idx)] + eta[_product_idx]*_log_p,
        sigma[_product_idx],
        observed=_log_q
    )

**Sample from the posterior**

In [None]:
with m_hs:
    traces_hs = pm.sample(2000, tune=1500)

In [None]:
with m_hs:
    az.plot_trace(
        traces_hs, var_names=["alpha", "sigma", "eta", "eta_bar", "sigma_eta"],
        compact=True
    )

**Diagnostics**

In [None]:
with m_hs:
    ess = az.ess(traces_hs, relative=True)
    rhat = az.rhat(traces_hs)

print("Effective Sample Size (min across parameters)")
print(f"\talpha: {ess['alpha'].values.min()}")
print(f"\teta: {ess['eta'].values.min()}")
print(f"\tsigma: {ess['sigma'].values.min()}")
print("rhat (max across parameters)")
print(f"\talpha: {rhat['alpha'].values.max()}")
print(f"\teta: {rhat['eta'].values.max()}")
print(f"\tsigma: {rhat['sigma'].values.max()}")

**Sampling posterior predictive**

In [None]:
with m_hs:
    spp_hs = pm.sample_posterior_predictive(traces_hs, 250)

spp_hs_logmove = spp_hs["ll"]

In [None]:
prods_to_keep

In [None]:
prods_to_plot = [1, 3, 4, 10]
store_idx_to_plot = [0, 7]

fig, ax = plt.subplots(
    len(prods_to_plot), len(store_idx_to_plot),
    figsize=(10, 18)
)

for (pidx, p) in enumerate(prods_to_plot):
    for (sidx, s) in enumerate(store_idx_to_plot):
        idx = (pidx, sidx)

        subdf = df.query("brand == @p & store_idx == @sidx")
        subdf_idx = subdf.index

        for row in range(spp_hs_logmove.shape[0]):
            ax[idx].scatter(
                subdf["logprice"],
                spp_np_logmove[row, subdf_idx],
                color="g", alpha=0.01, s=35
            )

        ax[idx].scatter(
            subdf["logprice"], subdf["logmove"],
            color="k", s=1
        )

        ax[idx].set_xlim(-4.0, -2.5)
        ax[idx].set_ylim(3.5, 15)


### Multi-layered hierarchical

This will be our first multi-layered hierarchical model.

We will allow elasticity to differ by both product and store, but we will do it in a particular way -- We will assume that a product's elasticity for each store is drawn from a product specific distribution and that the product specific distribution comes from a distribution shared by products.

The model can be written as:

\begin{align*}
  \log(q_{i, s, t}) &= \alpha_{i, s} + \eta_{i} \log(p_{i, s, t}) + \sigma_{i} \varepsilon \\
  \alpha_{i, s} &\sim N(1, 10) \\
  \sigma_{i} &\sim \text{HalfStudentT}(10, 5) \\
  \eta_{i, s} &\sim N(\bar{\eta}_i, \sigma^\eta_i) \\
  \bar{\eta}_i &\sim N(\tilde{\eta}, \tilde{\sigma}) \\
  \tilde{\eta} &\sim N(-1, 3) \\
  \tilde{\sigma} &\sim \text{HalfStudentT}(5, 3)
\end{align*}

In [None]:
# Prep data
log_q = df["logmove"].to_numpy()
log_p = df["logprice"].to_numpy()

product_idx = df["brand_idx"].to_numpy()
store_idx = df["store_idx"].to_numpy()

nproducts = df["brand_idx"].unique().shape[0]
nstores = df["store_idx"].unique().shape[0]


In [None]:
m_mlhs = pm.Model()

with m_mlhs:
    # Data
    _log_q = pm.Data("log_q", log_q)
    _log_p = pm.Data("log_p", log_p)
    _product_idx = pm.intX(pm.Data(
        "product_idx", product_idx
    ))
    _store_idx = pm.intX(pm.Data(
        "store_idx", store_idx
    ))

    # Hyperhyper priors
    eta_tilde = pm.Normal("eta_tilde", -1, 5)
    sigma_tilde = pm.HalfStudentT("sigma_tilde", nu=5, sigma=3)

    # Hyperpriors
    eb_offset = pm.Normal("eb_offset", 0, 1, shape=nproducts)
    eta_bar = pm.Deterministic(
        "eta_bar", eta_tilde + sigma_tilde*eb_offset
    )
    sigma_eta = pm.HalfStudentT("sigma_eta", nu=5, sigma=3, shape=nproducts)

    # Priors
    alpha = pm.Normal("alpha", 1, 10, shape=(nstores, nproducts))
    eta_offset = pm.Normal("eta_offset", 0, 1, shape=(nstores, nproducts))
    eta = pm.Deterministic(
        "eta", eta_bar[None, :] + eta_offset*sigma_eta[None, :]
    )
    sigma = pm.HalfStudentT("sigma", nu=10, sigma=5, shape=nproducts)

    # Likelihood
    ll = pm.Normal(
        "ll",
        alpha[(_store_idx, _product_idx)] + eta[(_store_idx, _product_idx)]*_log_p,
        sigma[_product_idx],
        observed=_log_q
    )

# Centered model
# m_mlhs = pm.Model()

# with m_:
#     # Data
#     _log_q = pm.Data("log_q", log_q)
#     _log_p = pm.Data("log_p", log_p)
#     _product_idx = pm.intX(pm.Data(
#         "product_idx", product_idx
#     ))
#     _store_idx = pm.intX(pm.Data(
#         "store_idx", store_idx
#     ))

#     # Hyperhyper priors
#     eta_tilde = pm.Normal("eta_tilde", -1, 5)
#     sigma_tilde = pm.HalfStudentT("sigma_tilde", nu=5, sigma=3)

#     # Hyperpriors
#     eta_bar = pm.Normal("eta_bar", eta_tilde, sigma_tilde, shape=nproducts)
#     sigma_eta = pm.HalfStudentT("sigma_eta", nu=5, sigma=3, shape=nproducts)

#     # Priors
#     alpha = pm.Normal("alpha", 1, 10, shape=(nstores, nproducts))
#     eta = pm.Normal("eta", eta_bar[None, :], sigma_eta[None, :], shape=(nstores, nproducts))
#     sigma = pm.HalfStudentT("sigma", nu=10, sigma=5, shape=nproducts)

#     # Likelihood
#     ll = pm.Normal(
#         "ll",
#         alpha[(_store_idx, _product_idx)] + eta[(_store_idx, _product_idx)]*_log_p,
#         sigma[_product_idx],
#         observed=_log_q
#     )

Why did we build the model this way? We think the elasticity is more likely to be similar within a product than within a store so we want the distribution that the elasticities are being drawn from to be related to product rather than store.

This could have been achieved in other ways and it would also be reasonable to take a different view than we took

**Sample from the posterior**

In [None]:
with m_mlhs:
    traces_mlhs = pm.sample(1500, tune=1000, target_accept=0.95)

In [None]:
with m_mlhs:
    az.plot_trace(
        traces_mlhs, var_names=[
            "alpha", "sigma", "eta", "eta_bar", "sigma_eta",
            "eta_tilde", "sigma_tilde"
        ],
        compact=True
    )

**Diagnostics**

In [None]:
with m_mlhs:
    ess = az.ess(traces_mlhs, relative=True)
    rhat = az.rhat(traces_mlhs)

print("Effective Sample Size (min across parameters)")
print(f"\talpha: {ess['alpha'].values.min()}")
print(f"\teta: {ess['eta'].values.min()}")
print(f"\tsigma: {ess['sigma'].values.min()}")
print("rhat (max across parameters)")
print(f"\talpha: {rhat['alpha'].values.max()}")
print(f"\teta: {rhat['eta'].values.max()}")
print(f"\tsigma: {rhat['sigma'].values.max()}")

**Sampling posterior predictive**

In [None]:
with m_mlhs:
    spp_mlhs = pm.sample_posterior_predictive(traces_mlhs, 250)

spp_mlhs_logmove = spp_mlhs["ll"]

In [None]:
prods_to_plot = [1, 3, 4, 10]
store_idx_to_plot = [0, 7]

fig, ax = plt.subplots(
    len(prods_to_plot), len(store_idx_to_plot),
    figsize=(10, 18)
)

for (pidx, p) in enumerate(prods_to_plot):
    for (sidx, s) in enumerate(store_idx_to_plot):
        idx = (pidx, sidx)

        subdf = df.query("brand == @p & store_idx == @sidx")
        subdf_idx = subdf.index

        for row in range(spp_mlhs_logmove.shape[0]):
            ax[idx].scatter(
                subdf["logprice"],
                spp_mlhs_logmove[row, subdf_idx],
                color="g", alpha=0.01, s=35
            )

        ax[idx].scatter(
            subdf["logprice"], subdf["logmove"],
            color="k", s=1
        )

        ax[idx].set_xlim(-4.0, -2.5)
        ax[idx].set_ylim(3.5, 15)


## Centered vs Non-Centered Hierarchical Models

We won't be able to talk about this as much as we would like, but we wanted to at least raise this issue because it is a serious issue that can arise when working with hierarchical models

If you'd like more information about this issue, we recommend reading: [1](https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/), [2](https://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html), [3](https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html)

### The issue

Let's focus on a somewhat generic hierarchical model where,

\begin{align*}
  \beta_i &\sim N(\bar{\beta}, \bar{\sigma}) \\
  \bar{\beta} &\sim N(\mu, \sigma) \\
  \bar{\sigma} &\sim \text{HalfStudentT}(\nu, \tau)
\end{align*}

We would refer to this model as the _centered model_, however, it could be written another way as well

\begin{align*}
  \beta_i &= \bar{\beta} + \bar{\sigma} \beta_\text{offset} \\
  \beta_\text{offset} &\sim N(0, 1) \\
  \bar{\beta} &\sim N(\mu, \sigma) \\
  \bar{\sigma} &\sim \text{HalfStudentT}(\nu, \tau)
\end{align*}

This is what we refer to as the _non-centered model_.

We use an image from the blogpost by Thomas Wiecki that we linked to above to show why the non-centered model can be helpful

* Non-centered weakens the relationship between $\hat{\beta}$ and $\hat{\sigma}$ by introducing $\beta_{\text{offset}}$
* This then allows the Markov chain to more easily enter the "funnel" portion of the posterior

![](funnel.png)

**Why does this matter?**

The centered version of the model ends up overestimating the standard deviation which means that you believe that you're more capable of separating the grouped coefficients than you actually are

![](sigma_centered_vs_noncentered.png)

Again, this graph also comes from the [Thomas Wiecki post](https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/)... I highly recommend reading it if this interests you.